pandas:在应用函数中使用 value_counts
pandas: use value_counts in an apply function
这是我的 pandas 数据框的玩具示例:
country_market language_market
0 United States English
1 United States French
2 Not used Not used
3 Canada OR United States English
4 Germany English
5 United Kingdom French
6 United States German
7 United Kingdom English
8 United Kingdom English
9 Not used Not used
10 United States French
11 United States English
12 United Kingdom English
13 United States French
14 Not used English
15 Not used English
16 United States French
17 United States Not used
18 Not used English
19 United States German
我想添加一列 top_country
,显示 country_market
中的值是否是数据中最常见的两个国家之一。如果是,我希望新的 top_country
列显示 country_market
中的值,如果不是,那么我希望它显示“其他”。我想为 language_market
重复此过程(以及我未在此处显示的所有其他市场专栏)。
这是我希望数据在处理后的样子:
country_market language_market top_country top_language
0 United States English United States English
1 United States French United States French
2 Not used Not used Not used Other
3 Canada OR United States English Other English
4 Germany English Other English
5 United Kingdom French Other French
6 United States German United States Other
7 United Kingdom English Other English
8 United Kingdom English Other English
9 Not used Not used Not used Other
10 United States French United States French
11 United States English United States English
12 United Kingdom English Other English
13 United States French United States French
14 Not used English Not used English
15 Not used English Not used English
16 United States French United States French
17 United States Not used United States Other
18 Not used English Not used English
19 United States German United States Other
我创建了一个函数 original_top_markets_function
来执行此操作,但我不知道如何将函数的 value_counts
部分传递给 pandas apply
。我一直收到 AttributeError: 'str' object has no attribute 'value_counts'
.
def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
if i in top2:
return i
else:
return 'Other'
我知道这是因为 apply
正在查看我的目标列中的每个元素,但我还需要一次考虑整个列的函数,以便我可以使用 value_counts
。我不知道该怎么做。
所以我想出了这个 top_markets
函数作为解决方案,它使用一个列表,它可以满足我的要求,但效率不高。我需要将这个函数应用到许多不同的市场栏目,所以我想要更 pythonic 的东西。
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
这是一个可重现的例子。请以某种方式帮助我修复 top_markets
函数,以便我可以将它与 apply
?
一起使用
import pandas as pd
d = {0: {'country_market': 'United States', 'language_market': 'English'},
1: {'country_market': 'United States', 'language_market': 'French'},
2: {'country_market': 'Not used', 'language_market': 'Not used'},
3: {'country_market': 'Canada OR United States',
'language_market': 'English'},
4: {'country_market': 'Germany', 'language_market': 'English'},
5: {'country_market': 'United Kingdom', 'language_market': 'French'},
6: {'country_market': 'United States', 'language_market': 'German'},
7: {'country_market': 'United Kingdom', 'language_market': 'English'},
8: {'country_market': 'United Kingdom', 'language_market': 'English'},
9: {'country_market': 'Not used', 'language_market': 'Not used'},
10: {'country_market': 'United States', 'language_market': 'French'},
11: {'country_market': 'United States', 'language_market': 'English'},
12: {'country_market': 'United Kingdom', 'language_market': 'English'},
13: {'country_market': 'United States', 'language_market': 'French'},
14: {'country_market': 'Not used', 'language_market': 'English'},
15: {'country_market': 'Not used', 'language_market': 'English'},
16: {'country_market': 'United States', 'language_market': 'French'},
17: {'country_market': 'United States', 'language_market': 'Not used'},
18: {'country_market': 'Not used', 'language_market': 'English'},
19: {'country_market': 'United States', 'language_market': 'German'}}
df = pd.DataFrame.from_dict(d, orient='index')
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])
df
我想你可以使用:
df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')
如果您想使用自己的功能,您可以使用:
df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)
#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)
根据评论中的讨论进行编辑:
def top_markets(x, top):
if x in top:
return x
else:
'Other'
top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index
df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))
如果需要在某些函数中按 DataFrame.apply
处理多个列,例如这里 lambda function
使用:
cols = ['language_market', 'country_market']
f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))
没有 lambda 函数的解决方案:
def top_markets(x):
return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(top_markets).add_prefix('total_'))
这是我的 pandas 数据框的玩具示例:
country_market language_market
0 United States English
1 United States French
2 Not used Not used
3 Canada OR United States English
4 Germany English
5 United Kingdom French
6 United States German
7 United Kingdom English
8 United Kingdom English
9 Not used Not used
10 United States French
11 United States English
12 United Kingdom English
13 United States French
14 Not used English
15 Not used English
16 United States French
17 United States Not used
18 Not used English
19 United States German
我想添加一列 top_country
,显示 country_market
中的值是否是数据中最常见的两个国家之一。如果是,我希望新的 top_country
列显示 country_market
中的值,如果不是,那么我希望它显示“其他”。我想为 language_market
重复此过程(以及我未在此处显示的所有其他市场专栏)。
这是我希望数据在处理后的样子:
country_market language_market top_country top_language
0 United States English United States English
1 United States French United States French
2 Not used Not used Not used Other
3 Canada OR United States English Other English
4 Germany English Other English
5 United Kingdom French Other French
6 United States German United States Other
7 United Kingdom English Other English
8 United Kingdom English Other English
9 Not used Not used Not used Other
10 United States French United States French
11 United States English United States English
12 United Kingdom English Other English
13 United States French United States French
14 Not used English Not used English
15 Not used English Not used English
16 United States French United States French
17 United States Not used United States Other
18 Not used English Not used English
19 United States German United States Other
我创建了一个函数 original_top_markets_function
来执行此操作,但我不知道如何将函数的 value_counts
部分传递给 pandas apply
。我一直收到 AttributeError: 'str' object has no attribute 'value_counts'
.
def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
if i in top2:
return i
else:
return 'Other'
我知道这是因为 apply
正在查看我的目标列中的每个元素,但我还需要一次考虑整个列的函数,以便我可以使用 value_counts
。我不知道该怎么做。
所以我想出了这个 top_markets
函数作为解决方案,它使用一个列表,它可以满足我的要求,但效率不高。我需要将这个函数应用到许多不同的市场栏目,所以我想要更 pythonic 的东西。
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
这是一个可重现的例子。请以某种方式帮助我修复 top_markets
函数,以便我可以将它与 apply
?
import pandas as pd
d = {0: {'country_market': 'United States', 'language_market': 'English'},
1: {'country_market': 'United States', 'language_market': 'French'},
2: {'country_market': 'Not used', 'language_market': 'Not used'},
3: {'country_market': 'Canada OR United States',
'language_market': 'English'},
4: {'country_market': 'Germany', 'language_market': 'English'},
5: {'country_market': 'United Kingdom', 'language_market': 'French'},
6: {'country_market': 'United States', 'language_market': 'German'},
7: {'country_market': 'United Kingdom', 'language_market': 'English'},
8: {'country_market': 'United Kingdom', 'language_market': 'English'},
9: {'country_market': 'Not used', 'language_market': 'Not used'},
10: {'country_market': 'United States', 'language_market': 'French'},
11: {'country_market': 'United States', 'language_market': 'English'},
12: {'country_market': 'United Kingdom', 'language_market': 'English'},
13: {'country_market': 'United States', 'language_market': 'French'},
14: {'country_market': 'Not used', 'language_market': 'English'},
15: {'country_market': 'Not used', 'language_market': 'English'},
16: {'country_market': 'United States', 'language_market': 'French'},
17: {'country_market': 'United States', 'language_market': 'Not used'},
18: {'country_market': 'Not used', 'language_market': 'English'},
19: {'country_market': 'United States', 'language_market': 'German'}}
df = pd.DataFrame.from_dict(d, orient='index')
def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
if i in top2:
results.append(i)
else:
results.append('Other')
return results
df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])
df
我想你可以使用:
df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')
如果您想使用自己的功能,您可以使用:
df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)
#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)
根据评论中的讨论进行编辑:
def top_markets(x, top):
if x in top:
return x
else:
'Other'
top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index
df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))
如果需要在某些函数中按 DataFrame.apply
处理多个列,例如这里 lambda function
使用:
cols = ['language_market', 'country_market']
f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))
没有 lambda 函数的解决方案:
def top_markets(x):
return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(top_markets).add_prefix('total_'))