如何根据查找数据框检查由字符串列表组成的数据框并执行计算?
How to check a dataframe consisting of a list of strings against a lookup dataframe and perform calculations?
我有一个包含多行标记化字符串的数据框 df1
:
df1 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play',
'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly',
'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise',
'party', 'best', 'friends']]})
我还有一个数据框 df2
,其中包含单个单词字符串以及与每个单词相关的分数:
df2 = pd.DataFrame(data = {'word' : ['adopted', 'auditioned',
'favorite', 'gym', 'play', 'sandwiches'], 'score' : [1, 2, 3, 4, 5,
6]})
将 df2
用作一种查找 "table" 的最佳方法是什么,我也可以用它来帮助执行计算?
对于 df1
中的每一行,我需要检查 df2
中是否存在任何单词。如果是这样,计算找到的单词数并将结果存储在一个名为 word_count
的系列中(如果特定单词在 df1
中出现不止一次,则对每次出现进行计数)。此外,当 df1
中的某个单词存在于 df2
中时,将该单词的分数与在名为 total score
的系列中找到的任何其他单词的分数相加。最终输出应类似于 df3
:
df3 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 'party', 'best', 'friends']], 'word_count' : [3, 1, 2, 1, 0], 'total_score' : [12, 1, 9, 4, None]})
你可以做到
d=dict(zip(df2.word,df2.score))
helpdf=df1.tokens.apply(lambda x :pd.Series([d.get(y)for y in x ]))
df1['Total']=helpdf.sum(1)
df1['count']=helpdf.notnull().sum(1)
df1
Out[338]:
tokens Total count
0 [auditioned, lead, role, play, play] 12.0 3
1 [kittens, adopted, family] 1.0 1
2 [peanut, butter, jelly, sandwiches, favorite] 9.0 2
3 [committee, decorated, gym] 4.0 1
4 [surprise, party, best, friends] 0.0 0
使用:
d = df2.set_index('word')['score']
def f(x):
y = [d.get(a) for a in x if a in d]
return pd.Series([len(y), sum(y)], index=['word_count','total_score'])
df3[['word_count','total_score']] = df3['tokens'].apply(f)
print (df3)
tokens word_count total_score
0 [auditioned, lead, role, play, play] 3 12
1 [kittens, adopted, family] 1 1
2 [peanut, butter, jelly, sandwiches, favorite] 2 9
3 [committee, decorated, gym] 1 4
4 [surprise, party, best, friends] 0 0
方法一
创建一个基本字典以用于在应用程序中进行映射
m0 = dict(df2.values)
m1 = lambda x: m0.get(x, 0)
m2 = lambda x: int(x in m0)
df1.assign(
word_count=df1.tokens.apply(lambda x: sum(map(m2, x))),
Total=df1.tokens.apply(lambda x: sum(map(m1, x)))
)
tokens word_count Total
0 [auditioned, lead, role, play, play] 3 12
1 [kittens, adopted, family] 1 1
2 [peanut, butter, jelly, sandwiches, favorite] 2 9
3 [committee, decorated, gym] 1 4
4 [surprise, party, best, friends] 0 0
方法二
创建一个新系列,展开 df1
中的单词,但保留索引值,以便我们可以使用计数和求和进行聚合。
idx = df1.index.repeat(df1.tokens.str.len())
s1 = pd.Series(np.concatenate(df1.tokens), idx)
s2 = s1.map(dict(df2.values)).groupby(level=0).agg(['count', 'sum'])
df1.join(s2.rename(columns=dict(count='word_count', sum='total_score')))
tokens word_count total_score
0 [auditioned, lead, role, play, play] 3 12.0
1 [kittens, adopted, family] 1 1.0
2 [peanut, butter, jelly, sandwiches, favorite] 2 9.0
3 [committee, decorated, gym] 1 4.0
4 [surprise, party, best, friends] 0 0.0
我有一个包含多行标记化字符串的数据框 df1
:
df1 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play',
'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly',
'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise',
'party', 'best', 'friends']]})
我还有一个数据框 df2
,其中包含单个单词字符串以及与每个单词相关的分数:
df2 = pd.DataFrame(data = {'word' : ['adopted', 'auditioned',
'favorite', 'gym', 'play', 'sandwiches'], 'score' : [1, 2, 3, 4, 5,
6]})
将 df2
用作一种查找 "table" 的最佳方法是什么,我也可以用它来帮助执行计算?
对于 df1
中的每一行,我需要检查 df2
中是否存在任何单词。如果是这样,计算找到的单词数并将结果存储在一个名为 word_count
的系列中(如果特定单词在 df1
中出现不止一次,则对每次出现进行计数)。此外,当 df1
中的某个单词存在于 df2
中时,将该单词的分数与在名为 total score
的系列中找到的任何其他单词的分数相加。最终输出应类似于 df3
:
df3 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 'party', 'best', 'friends']], 'word_count' : [3, 1, 2, 1, 0], 'total_score' : [12, 1, 9, 4, None]})
你可以做到
d=dict(zip(df2.word,df2.score))
helpdf=df1.tokens.apply(lambda x :pd.Series([d.get(y)for y in x ]))
df1['Total']=helpdf.sum(1)
df1['count']=helpdf.notnull().sum(1)
df1
Out[338]:
tokens Total count
0 [auditioned, lead, role, play, play] 12.0 3
1 [kittens, adopted, family] 1.0 1
2 [peanut, butter, jelly, sandwiches, favorite] 9.0 2
3 [committee, decorated, gym] 4.0 1
4 [surprise, party, best, friends] 0.0 0
使用:
d = df2.set_index('word')['score']
def f(x):
y = [d.get(a) for a in x if a in d]
return pd.Series([len(y), sum(y)], index=['word_count','total_score'])
df3[['word_count','total_score']] = df3['tokens'].apply(f)
print (df3)
tokens word_count total_score
0 [auditioned, lead, role, play, play] 3 12
1 [kittens, adopted, family] 1 1
2 [peanut, butter, jelly, sandwiches, favorite] 2 9
3 [committee, decorated, gym] 1 4
4 [surprise, party, best, friends] 0 0
方法一
创建一个基本字典以用于在应用程序中进行映射
m0 = dict(df2.values)
m1 = lambda x: m0.get(x, 0)
m2 = lambda x: int(x in m0)
df1.assign(
word_count=df1.tokens.apply(lambda x: sum(map(m2, x))),
Total=df1.tokens.apply(lambda x: sum(map(m1, x)))
)
tokens word_count Total
0 [auditioned, lead, role, play, play] 3 12
1 [kittens, adopted, family] 1 1
2 [peanut, butter, jelly, sandwiches, favorite] 2 9
3 [committee, decorated, gym] 1 4
4 [surprise, party, best, friends] 0 0
方法二
创建一个新系列,展开 df1
中的单词,但保留索引值,以便我们可以使用计数和求和进行聚合。
idx = df1.index.repeat(df1.tokens.str.len())
s1 = pd.Series(np.concatenate(df1.tokens), idx)
s2 = s1.map(dict(df2.values)).groupby(level=0).agg(['count', 'sum'])
df1.join(s2.rename(columns=dict(count='word_count', sum='total_score')))
tokens word_count total_score
0 [auditioned, lead, role, play, play] 3 12.0
1 [kittens, adopted, family] 1 1.0
2 [peanut, butter, jelly, sandwiches, favorite] 2 9.0
3 [committee, decorated, gym] 1 4.0
4 [surprise, party, best, friends] 0 0.0