Pandas 使用数据框作为字典或查找
Pandas use dataframe as dictionary or lookup
没有在这里获胜。需要使用传递到数据框中的自由文本字段来查找第二个数据框中的不同列:
df1 = pd.read_csv('Hotel_reviews.csv')
...
user: Review:
Julie 'Sheets were dirty'
Samantha 'Meal arrived cold'
Rachel 'Cocktails were delicious'
]
...
想象一下上面有很多数据^
df2 = [{'Keyword':['Sheets','Cocktails','Meal'],
'Department' :['Bedrooms','Restaurant','Restaurant'],
'Issue Type':['Beds','Drinks','Food']}]
我尝试了很多方法来达到这个目的:
df3 =
user: Review: Department: Issue Type:
Julie 'Sheets were dirty' 'Bedrooms' 'Beds'
Samantha 'Meal arrived cold' 'Restaurant' 'Food'
Rachel 'Cocktails were delicious' 'Restaurant' 'Drinks'
这是我试过的:
尝试1
def find_dept(review):
words = review.split(' ')
for word in words:
if word.isin(df2['Keyword']):
return df2[df2['word'] ==word]['Department']
dept = df['Review'].apply(find_dept)
尝试2
for dept in df2['Department']:
if dept.isin(review):
return True
尝试 3
review_dict = df2.to_dict('series')
def r_dict(review):
return review_dict[review]
dept = df['Review'].apply(r_dict)
不用说了,我在挣扎……
为不完全正确的格式道歉,这是一个编造的例子,我的咖啡因水平正在下降
这是一种方式。这个想法是将您的映射字典转换为 keyword: (department, issue)
.
格式
然后使用生成器表达式查找第一个匹配项,循环遍历您的新词典。
最后,通过 pd.Series.apply(pd.Series)
.
将一系列元组拆分为 2 列
注释词典不被视为有序。因此,对于多场比赛,您应该考虑选择哪场比赛的机会。如果要按特定顺序搜索,请使用有序字典(查找 collections.OrderedDict
)。
import pandas as pd
df = pd.DataFrame([['Julie', 'Sheets were dirty'],
['Samantha', 'Meal arrived cold'],
['Rachel', 'Cocktails were delicious']],
columns=['User', 'Review'])
d = {'Keyword': ['Sheets','Cocktails','Meal'],
'Department' : ['Bedrooms','Restaurant','Restaurant'],
'Issue Type': ['Beds','Drinks','Food']}
d2 = {key: (dep, iss) for key, dep, iss in \
zip(d['Keyword'], d['Department'], d['Issue Type'])}
def mapper(x):
return d2.get(next((i for i in d2 if i in x), None))
df[['Department', 'IssueType']] = df['Review'].apply(mapper).apply(pd.Series)
结果:
User Review Department IssueType
0 Julie Sheets were dirty Bedrooms Beds
1 Samantha Meal arrived cold Restaurant Food
2 Rachel Cocktails were delicious Restaurant Drinks
没有在这里获胜。需要使用传递到数据框中的自由文本字段来查找第二个数据框中的不同列:
df1 = pd.read_csv('Hotel_reviews.csv') ... user: Review: Julie 'Sheets were dirty' Samantha 'Meal arrived cold' Rachel 'Cocktails were delicious' ] ...
想象一下上面有很多数据^
df2 = [{'Keyword':['Sheets','Cocktails','Meal'], 'Department' :['Bedrooms','Restaurant','Restaurant'], 'Issue Type':['Beds','Drinks','Food']}]
我尝试了很多方法来达到这个目的:
df3 = user: Review: Department: Issue Type: Julie 'Sheets were dirty' 'Bedrooms' 'Beds' Samantha 'Meal arrived cold' 'Restaurant' 'Food' Rachel 'Cocktails were delicious' 'Restaurant' 'Drinks'
这是我试过的:
尝试1
def find_dept(review): words = review.split(' ') for word in words: if word.isin(df2['Keyword']): return df2[df2['word'] ==word]['Department'] dept = df['Review'].apply(find_dept)
尝试2
for dept in df2['Department']: if dept.isin(review): return True
尝试 3
review_dict = df2.to_dict('series') def r_dict(review): return review_dict[review] dept = df['Review'].apply(r_dict)
不用说了,我在挣扎……
为不完全正确的格式道歉,这是一个编造的例子,我的咖啡因水平正在下降
这是一种方式。这个想法是将您的映射字典转换为 keyword: (department, issue)
.
然后使用生成器表达式查找第一个匹配项,循环遍历您的新词典。
最后,通过 pd.Series.apply(pd.Series)
.
注释词典不被视为有序。因此,对于多场比赛,您应该考虑选择哪场比赛的机会。如果要按特定顺序搜索,请使用有序字典(查找 collections.OrderedDict
)。
import pandas as pd
df = pd.DataFrame([['Julie', 'Sheets were dirty'],
['Samantha', 'Meal arrived cold'],
['Rachel', 'Cocktails were delicious']],
columns=['User', 'Review'])
d = {'Keyword': ['Sheets','Cocktails','Meal'],
'Department' : ['Bedrooms','Restaurant','Restaurant'],
'Issue Type': ['Beds','Drinks','Food']}
d2 = {key: (dep, iss) for key, dep, iss in \
zip(d['Keyword'], d['Department'], d['Issue Type'])}
def mapper(x):
return d2.get(next((i for i in d2 if i in x), None))
df[['Department', 'IssueType']] = df['Review'].apply(mapper).apply(pd.Series)
结果:
User Review Department IssueType
0 Julie Sheets were dirty Bedrooms Beds
1 Samantha Meal arrived cold Restaurant Food
2 Rachel Cocktails were delicious Restaurant Drinks