获取两列之间出现次数最多的
Get the highest occurrence between 2 columns
我有 2 列,分别称为 decision1 和 decision2。
我想比较它们以获得 2 之间出现次数最多的结果,因此根据最大的一个,我在 decision1 或 decision2 中获得最高出现次数。到目前为止,我的尝试导致了这一点,但没有成功,因为我只是在 EACH 列中出现次数最多,而不是合并
# weightage option
if args['weightage'] == "yes":
attr1 = data['decision'].value_counts().idxmax #highest occurrence in decision
attr2 = data['decision2'].value_counts().idxmax #highest occurrence in decision2
heaviest_attribute = data.groupby(['decision','decision2']).size()
理想情况下,我只需要在 attr1
和 attr2
之间使用某种 max()
函数,但我不知道如何处理。
例如,给定这个 table
我想比较 decision1 和 decision2 列,就好像它们是一列一样,在这种情况下,预期输出将是 'Yes',因为它是最经常出现的值。
可能存在更优雅的解决方案...
df = pd.DataFrame({'decision': ['Yes', 'Maybe', 'Yes', 'Maybe', 'Yes'],
'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']})
d1 = dict(df.groupby('decision').count().loc[:, 'decision 2'])
d2 = dict(df.groupby('decision 2').count().loc[:, 'decision'])
d1.update(d2)
max(d1, key=d1.get)
这是一个简单的解决方案。
最好将内容转换为列表,查找列表中的最大出现次数很简单。
import pandas as pd
data = pd.DataFrame({'decision': ['yes', 'maybe', 'yes', 'maybe', 'yes'],
'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']
})
a = list(data['decision'])+list(data['decision 2'])
a = max(set(a), key=a.count)
print(a)
输出:
yes
使用DataFrame.melt
with Series.mode
and select first value by position with Series.iat
:
a = df[['decision','decision 2']].melt()['value'].mode().iat[0]
或通过DataFrame.stack
重塑:
a = df[['decision','decision 2']].stack().mode().iat[0]
print (a)
Yes
详情:
print (df[['decision','decision 2']].melt()['value'])
0 Yes
1 Maybe
2 Yes
3 Maybe
4 Yes
5 No
6 No
7 Perhaps
8 Perhaps
9 unsure
Name: value, dtype: object
print (df[['decision','decision 2']].stack())
0 decision Yes
decision 2 No
1 decision Maybe
decision 2 No
2 decision Yes
decision 2 Perhaps
3 decision Maybe
decision 2 Perhaps
4 decision Yes
decision 2 unsure
dtype: object
编辑:
s = df.eq(a).any()
col = s.index[s][0]
print (col)
decision
我有 2 列,分别称为 decision1 和 decision2。
我想比较它们以获得 2 之间出现次数最多的结果,因此根据最大的一个,我在 decision1 或 decision2 中获得最高出现次数。到目前为止,我的尝试导致了这一点,但没有成功,因为我只是在 EACH 列中出现次数最多,而不是合并
# weightage option
if args['weightage'] == "yes":
attr1 = data['decision'].value_counts().idxmax #highest occurrence in decision
attr2 = data['decision2'].value_counts().idxmax #highest occurrence in decision2
heaviest_attribute = data.groupby(['decision','decision2']).size()
理想情况下,我只需要在 attr1
和 attr2
之间使用某种 max()
函数,但我不知道如何处理。
例如,给定这个 table
我想比较 decision1 和 decision2 列,就好像它们是一列一样,在这种情况下,预期输出将是 'Yes',因为它是最经常出现的值。
可能存在更优雅的解决方案...
df = pd.DataFrame({'decision': ['Yes', 'Maybe', 'Yes', 'Maybe', 'Yes'],
'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']})
d1 = dict(df.groupby('decision').count().loc[:, 'decision 2'])
d2 = dict(df.groupby('decision 2').count().loc[:, 'decision'])
d1.update(d2)
max(d1, key=d1.get)
这是一个简单的解决方案。
最好将内容转换为列表,查找列表中的最大出现次数很简单。
import pandas as pd
data = pd.DataFrame({'decision': ['yes', 'maybe', 'yes', 'maybe', 'yes'],
'decision 2': ['No', 'No', 'Perhaps', 'Perhaps', 'unsure']
})
a = list(data['decision'])+list(data['decision 2'])
a = max(set(a), key=a.count)
print(a)
输出:
yes
使用DataFrame.melt
with Series.mode
and select first value by position with Series.iat
:
a = df[['decision','decision 2']].melt()['value'].mode().iat[0]
或通过DataFrame.stack
重塑:
a = df[['decision','decision 2']].stack().mode().iat[0]
print (a)
Yes
详情:
print (df[['decision','decision 2']].melt()['value'])
0 Yes
1 Maybe
2 Yes
3 Maybe
4 Yes
5 No
6 No
7 Perhaps
8 Perhaps
9 unsure
Name: value, dtype: object
print (df[['decision','decision 2']].stack())
0 decision Yes
decision 2 No
1 decision Maybe
decision 2 No
2 decision Yes
decision 2 Perhaps
3 decision Maybe
decision 2 Perhaps
4 decision Yes
decision 2 unsure
dtype: object
编辑:
s = df.eq(a).any()
col = s.index[s][0]
print (col)
decision