如何根据查找数据框检查由字符串列表组成的数据框并执行计算?

How to check a dataframe consisting of a list of strings against a lookup dataframe and perform calculations?

我有一个包含多行标记化字符串的数据框 df1

df1 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 
'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 
'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 
'party', 'best', 'friends']]})

我还有一个数据框 df2,其中包含单个单词字符串以及与每个单词相关的分数:

df2 = pd.DataFrame(data = {'word' : ['adopted', 'auditioned',
'favorite', 'gym', 'play', 'sandwiches'], 'score' : [1, 2, 3, 4, 5,
6]})

df2 用作一种查找 "table" 的最佳方法是什么,我也可以用它来帮助执行计算?

对于 df1 中的每一行,我需要检查 df2 中是否存在任何单词。如果是这样,计算找到的单词数并将结果存储在一个名为 word_count 的系列中(如果特定单词在 df1 中出现不止一次,则对每次出现进行计数)。此外,当 df1 中的某个单词存在于 df2 中时,将该单词的分数与在名为 total score 的系列中找到的任何其他单词的分数相加。最终输出应类似于 df3:

df3 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 'party', 'best', 'friends']], 'word_count' : [3, 1, 2, 1, 0], 'total_score' : [12, 1, 9, 4, None]})

你可以做到

d=dict(zip(df2.word,df2.score))

helpdf=df1.tokens.apply(lambda x :pd.Series([d.get(y)for y in x ]))
df1['Total']=helpdf.sum(1)
df1['count']=helpdf.notnull().sum(1)
df1
Out[338]: 
                                          tokens  Total  count
0           [auditioned, lead, role, play, play]   12.0      3
1                     [kittens, adopted, family]    1.0      1
2  [peanut, butter, jelly, sandwiches, favorite]    9.0      2
3                    [committee, decorated, gym]    4.0      1
4               [surprise, party, best, friends]    0.0      0

使用:

d = df2.set_index('word')['score']

def f(x):
    y = [d.get(a) for a in x if a in d]
    return pd.Series([len(y), sum(y)], index=['word_count','total_score'])

df3[['word_count','total_score']] = df3['tokens'].apply(f)
print (df3)
                                          tokens  word_count  total_score
0           [auditioned, lead, role, play, play]           3           12
1                     [kittens, adopted, family]           1            1
2  [peanut, butter, jelly, sandwiches, favorite]           2            9
3                    [committee, decorated, gym]           1            4
4               [surprise, party, best, friends]           0            0

方法一

创建一个基本字典以用于在应用程序中进行映射

m0 = dict(df2.values)
m1 = lambda x: m0.get(x, 0)
m2 = lambda x: int(x in m0)
df1.assign(
    word_count=df1.tokens.apply(lambda x: sum(map(m2, x))),
    Total=df1.tokens.apply(lambda x: sum(map(m1, x)))
)

                                          tokens  word_count  Total
0           [auditioned, lead, role, play, play]           3     12
1                     [kittens, adopted, family]           1      1
2  [peanut, butter, jelly, sandwiches, favorite]           2      9
3                    [committee, decorated, gym]           1      4
4               [surprise, party, best, friends]           0      0

方法二

创建一个新系列,展开 df1 中的单词,但保留索引值,以便我们可以使用计数和求和进行聚合。

idx = df1.index.repeat(df1.tokens.str.len())
s1 = pd.Series(np.concatenate(df1.tokens), idx)
s2 = s1.map(dict(df2.values)).groupby(level=0).agg(['count', 'sum'])
df1.join(s2.rename(columns=dict(count='word_count', sum='total_score')))

                                          tokens  word_count  total_score
0           [auditioned, lead, role, play, play]           3         12.0
1                     [kittens, adopted, family]           1          1.0
2  [peanut, butter, jelly, sandwiches, favorite]           2          9.0
3                    [committee, decorated, gym]           1          4.0
4               [surprise, party, best, friends]           0          0.0