pandas:在应用函数中使用 value_counts

pandas: use value_counts in an apply function

这是我的 pandas 数据框的玩具示例:

    country_market  language_market
0   United States   English
1   United States   French
2   Not used    Not used
3   Canada OR United States English
4   Germany English
5   United Kingdom  French
6   United States   German
7   United Kingdom  English
8   United Kingdom  English
9   Not used    Not used
10  United States   French
11  United States   English
12  United Kingdom  English
13  United States   French
14  Not used    English
15  Not used    English
16  United States   French
17  United States   Not used
18  Not used    English
19  United States   German

我想添加一列 top_country,显示 country_market 中的值是否是数据中最常见的两个国家之一。如果是,我希望新的 top_country 列显示 country_market 中的值,如果不是,那么我希望它显示“其他”。我想为 language_market 重复此过程(以及我未在此处显示的所有其他市场专栏)。

这是我希望数据在处理后的样子:

    country_market  language_market top_country top_language
0   United States   English United States   English
1   United States   French  United States   French
2   Not used    Not used    Not used    Other
3   Canada OR United States English Other   English
4   Germany English Other   English
5   United Kingdom  French  Other   French
6   United States   German  United States   Other
7   United Kingdom  English Other   English
8   United Kingdom  English Other   English
9   Not used    Not used    Not used    Other
10  United States   French  United States   French
11  United States   English United States   English
12  United Kingdom  English Other   English
13  United States   French  United States   French
14  Not used    English Not used    English
15  Not used    English Not used    English
16  United States   French  United States   French
17  United States   Not used    United States   Other
18  Not used    English Not used    English
19  United States   German  United States   Other

我创建了一个函数 original_top_markets_function 来执行此操作,但我不知道如何将函数的 value_counts 部分传递给 pandas apply。我一直收到 AttributeError: 'str' object has no attribute 'value_counts'.

def original_top_markets_function(x):
top2 = x.value_counts().nlargest(2).index
for i in x:
    if i in top2: 
        return i
    else: 
        return 'Other'         

我知道这是因为 apply 正在查看我的目标列中的每个元素,但我还需要一次考虑整个列的函数,以便我可以使用 value_counts。我不知道该怎么做。

所以我想出了这个 top_markets 函数作为解决方案,它使用一个列表,它可以满足我的要求,但效率不高。我需要将这个函数应用到许多不同的市场栏目,所以我想要更 pythonic 的东西。

def top_markets(x):
top2 = x.value_counts().nlargest(2).index
results = []
for i in x:
    if i in top2: 
        results.append(i)
    else: 
        results.append('Other')         
return results

这是一个可重现的例子。请以某种方式帮助我修复 top_markets 函数,以便我可以将它与 apply?

一起使用
import pandas as pd

d = {0: {'country_market': 'United States', 'language_market': 'English'},
 1: {'country_market': 'United States', 'language_market': 'French'},
 2: {'country_market': 'Not used', 'language_market': 'Not used'},
 3: {'country_market': 'Canada OR United States',
  'language_market': 'English'},
 4: {'country_market': 'Germany', 'language_market': 'English'},
 5: {'country_market': 'United Kingdom', 'language_market': 'French'},
 6: {'country_market': 'United States', 'language_market': 'German'},
 7: {'country_market': 'United Kingdom', 'language_market': 'English'},
 8: {'country_market': 'United Kingdom', 'language_market': 'English'},
 9: {'country_market': 'Not used', 'language_market': 'Not used'},
 10: {'country_market': 'United States', 'language_market': 'French'},
 11: {'country_market': 'United States', 'language_market': 'English'},
 12: {'country_market': 'United Kingdom', 'language_market': 'English'},
 13: {'country_market': 'United States', 'language_market': 'French'},
 14: {'country_market': 'Not used', 'language_market': 'English'},
 15: {'country_market': 'Not used', 'language_market': 'English'},
 16: {'country_market': 'United States', 'language_market': 'French'},
 17: {'country_market': 'United States', 'language_market': 'Not used'},
 18: {'country_market': 'Not used', 'language_market': 'English'},
 19: {'country_market': 'United States', 'language_market': 'German'}}

df = pd.DataFrame.from_dict(d, orient='index')

def top_markets(x):
    top2 = x.value_counts().nlargest(2).index
    results = []
    for i in x:
        if i in top2: 
            results.append(i)
        else: 
            results.append('Other')         
    return results

df['top_country'] = top_markets(df['country_market'])
df['top_language'] = top_markets(df['language_market'])

df

我想你可以使用:

df['top_country'] = np.where(df['country_market'].isin(df['country_market'].value_counts().nlargest(2).index), df['country_market'], 'Other')
df['top_language'] = np.where(df['language_market'].isin(df['language_market'].value_counts().nlargest(2).index), df['language_market'], 'Other')

如果您想使用自己的功能,您可以使用:

df['top_country'] = df[['country_market']].apply(top_markets)
df['top_language'] = df[['language_market']].apply(top_markets)

#OR
df[['top_country', 'top_language']] = df[['country_market', 'language_market']].apply(top_markets)

根据评论中的讨论进行编辑:

def top_markets(x, top):
    if x in top:
        return x
    else:
        'Other'

top_country = df['country_market'].value_counts().nlargest(2).index
top_languages = df['language_market'].value_counts().nlargest(2).index

df['top_country'] = df['country_market'].apply(lambda x: top_markets(x, top_country))
df['top_language'] = df['language_market'].apply(lambda x: top_markets(x, top_languages))

如果需要在某些函数中按 DataFrame.apply 处理多个列,例如这里 lambda function 使用:

cols = ['language_market', 'country_market']

f = lambda x: np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')
df = df.join(df[cols].apply(f).add_prefix('total_'))

没有 lambda 函数的解决方案:

def top_markets(x):
    return np.where(x.isin(x.value_counts().nlargest(2).index), x, 'Other')

df = df.join(df[cols].apply(top_markets).add_prefix('total_'))