获取 df 列中的哪个元素对于另一个 df 列中的每个单独元素出现频率最高(各种单独字符串的列表)
Obtaining which element in a df column appears most frequently for each individual element in another df column (list of various individual strings)
我的 pandas 数据框中有一个名为 'tags' 的列,它是多个字符串的列表。
[abc, 123, xyz]
[456, 123]
[abc, 123, xyz]
我还有另一个专栏技术,每个专栏都有一个字符串
win
mac
win
请告诉我是否有办法让我知道技术中的哪个元素对于标签中的每个元素出现得最频繁。
例如,与其他技术相比,'abc' 最常与 'win' 相关联。所以输出应该是这样的:
abc win
123 win
xyz win
456 mac
IIUC,您可以 explode
Tags
列并将 crosstab
与 idxmax
:
一起使用
输入:
d = {'Tags':[['abc', 123, 'xyz'],[456, 123],['abc', 123, 'xyz']],
'tech':['win','mac','win']}
df = pd.DataFrame(d)
print(df)
Tags tech
0 [abc, 123, xyz] win
1 [456, 123] mac
2 [abc, 123, xyz] win
解决方案:
m = df.explode('Tags')
out = pd.crosstab(m['Tags'],m['tech']).idxmax(1)
Tags
123 win
456 mac
abc win
xyz win
dtype: object
您好,我建议如下:
import pandas as pd
# I reproduce your example
df = pd.DataFrame({"tags": [["abc", "123", "xyz"], ["456", "123"], ["abc", "123", "xyz"]],
"tech": ["win", "mac", "win"]})
# I use explode to have one row per tag
df = df.explode(column="tags")
# then I set index for tags
df = df.set_index("tags").sort_index()
# And then I take the most frequent value by defining a mode function
def mode(x):
'''
Returns mode
'''
return x.value_counts().index[0]
res = df.groupby(level=0).agg(mode)
我明白了
tech
tags
123 win
456 mac
abc win
xyz win
如果您还想要与标签关联的频率:
import pandas as pd
from collections import Counter
df = pd.DataFrame({'tech':['win', 'mac', 'win'],
'tags':[['abc', 123, 'xyz'], [456, 123], ['abc', 234, 'xyz']]})
df = df.groupby('tech').sum() # concatenate by tech the lists
df['freq'] = [Counter(el) for el in df['tags']] # convert each list to a dict of frequency
final_df = pd.DataFrame()
# explode the column of dicts
for row in df.iterrows():
tech = row[0] # get the value in the metric column
for key, value in row[1][1].items():
tmp_df = pd.DataFrame({
'tech':tech,
'tag': key,
'frequency': value
}, index=[0])
final_df = final_df.append(tmp_df) # append the tmp_df to our final df
final_df = final_df.reset_index(drop=True)
我的 pandas 数据框中有一个名为 'tags' 的列,它是多个字符串的列表。
[abc, 123, xyz]
[456, 123]
[abc, 123, xyz]
我还有另一个专栏技术,每个专栏都有一个字符串
win
mac
win
请告诉我是否有办法让我知道技术中的哪个元素对于标签中的每个元素出现得最频繁。 例如,与其他技术相比,'abc' 最常与 'win' 相关联。所以输出应该是这样的:
abc win
123 win
xyz win
456 mac
IIUC,您可以 explode
Tags
列并将 crosstab
与 idxmax
:
输入:
d = {'Tags':[['abc', 123, 'xyz'],[456, 123],['abc', 123, 'xyz']],
'tech':['win','mac','win']}
df = pd.DataFrame(d)
print(df)
Tags tech
0 [abc, 123, xyz] win
1 [456, 123] mac
2 [abc, 123, xyz] win
解决方案:
m = df.explode('Tags')
out = pd.crosstab(m['Tags'],m['tech']).idxmax(1)
Tags
123 win
456 mac
abc win
xyz win
dtype: object
您好,我建议如下:
import pandas as pd
# I reproduce your example
df = pd.DataFrame({"tags": [["abc", "123", "xyz"], ["456", "123"], ["abc", "123", "xyz"]],
"tech": ["win", "mac", "win"]})
# I use explode to have one row per tag
df = df.explode(column="tags")
# then I set index for tags
df = df.set_index("tags").sort_index()
# And then I take the most frequent value by defining a mode function
def mode(x):
'''
Returns mode
'''
return x.value_counts().index[0]
res = df.groupby(level=0).agg(mode)
我明白了
tech
tags
123 win
456 mac
abc win
xyz win
如果您还想要与标签关联的频率:
import pandas as pd
from collections import Counter
df = pd.DataFrame({'tech':['win', 'mac', 'win'],
'tags':[['abc', 123, 'xyz'], [456, 123], ['abc', 234, 'xyz']]})
df = df.groupby('tech').sum() # concatenate by tech the lists
df['freq'] = [Counter(el) for el in df['tags']] # convert each list to a dict of frequency
final_df = pd.DataFrame()
# explode the column of dicts
for row in df.iterrows():
tech = row[0] # get the value in the metric column
for key, value in row[1][1].items():
tmp_df = pd.DataFrame({
'tech':tech,
'tag': key,
'frequency': value
}, index=[0])
final_df = final_df.append(tmp_df) # append the tmp_df to our final df
final_df = final_df.reset_index(drop=True)