如果数据帧中存在令牌,则分配 True/False

Assigning True/False if a token is present in a data-frame

我当前的数据框是:

     |articleID | keywords                                               | 
     |:-------- |:------------------------------------------------------:| 
0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      |     
1    |58b6393b  | ['Crossword Puzzles']                                  |          
2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']|            
3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        |  

我想要一个类似于下面的数据框,其中根据是否在关键字中提到特朗普令牌 'Trump, Donald J' 添加一列,如果是,则将其分配为 True :

     |articleID | keywords                                               | trumpMention |
     |:-------- |:------------------------------------------------------:| ------------:|
0    |58b61d1d  | ['Second Avenue (Manhattan, NY)']                      | False        |      
1    |58b6393b  | ['Crossword Puzzles']                                  | False        |          
2    |58b6556e  | ['Workplace Hazards and Violations', 'Trump, Donald J']| True         |           
3    |58b657fa  | ['Trump, Donald J', 'Speeches and Statements'].        | True         |       

我尝试了多种使用 df 函数的方法。但是达不到我想要的结果。我尝试过的一些方法是:

df['trumpMention'] = np.where(any(df['keywords']) == 'Trump, Donald J', True, False) 

df['trumpMention'] = df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x) 

lst = ['Trump, Donald J']  
df['trumpMention'] = df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))   

原始输入:

df = pd.DataFrame({'articleID': ['58b61d1d', '58b6393b', '58b6556e', '58b657fa'],
                   'keywords': [['Second Avenue (Manhattan, NY)'],
                                ['Crossword Puzzles'],
                                ['Workplace Hazards and Violations', 'Trump, Donald J'],
                                ['Trump, Donald J', 'Speeches and Statements']],
                   'trumpMention': [False, False, True, True]})

尝试

df["trumpMention"] = df["keywords"].apply(lambda x: "Trump, Donald J" in x)

应用检查集合成员资格的函数如何?

df['trumpMention'] = df['keywords'].apply(lambda x: 'Trump, Donald J' in set(x))

输出:

  articleID                                           keywords  trumpMention
0  58b61d1d                    [Second Avenue (Manhattan, NY)]         False
1  58b6393b                                [Crossword Puzzles]         False
2  58b6556e  [Workplace Hazards and Violations, Trump, Dona...          True
3  58b657fa         [Trump, Donald J, Speeches and Statements]          True

关于您的尝试:

np.where(any(df['keywords']) == 'Trump, Donald J', True, False) 

不会工作,因为 any(df['keywords']) 总是计算 True 不等于 'Trump, Donald J',所以上面的总是 return array(False) .

df['keywords'].apply(lambda x: any(token == 'Trump, Donald J') for token in x) 

不起作用,因为它引发了 TypeError 因为这里没有理解。

df['keywords'].apply(lambda x: ([ True for token in x if any(token in lst)]))  

不起作用,因为 token in lst 是一个布尔值,所以

any(token in lst)

毫无意义。

试试我的方法。我在将其添加到数据框之前创建了一个列表。

def mentioned_Trump(s, lst):
    if s in lst:
        return True
    else:
        return False
s = [[1,['Second Avenue (Manhattan, NY)']],[2,['Crossword Puzzles']],
    [3, ['Workplace Hazards and Violations', 'Trump, Donald J']],
    [4, ['Trump, Donald J', 'Speeches and Statements']]]

import pandas as pd
df = pd.DataFrame(s)
df.columns =['ID','keywords']

s = list( df['keywords'])
s1 = [mentioned_Trump('Trump, Donald J',x) for x in s]

df['trumpMention']= s1 
print(df)

使用vectorized方法,比使用apply更快。

df.keywords.astype(str).str.contains("Trump, Donald J")