在 Pandas 数据框中查找多个字典键 & return 多个匹配值
Looking up multiple dictionary keys in a Pandas Dataframe & return multiple values for matches
第一次发帖,格式不对请见谅
这是我的问题:
我创建了一个 Pandas 数据框,其中包含多行文本:
d = {'keywords' :['cheap shoes', 'luxury shoes', 'cheap hiking shoes']}
keywords = pd.DataFrame(d,columns=['keywords'])
In [7]: keywords
Out[7]:
keywords
0 cheap shoes
1 luxury shoes
2 cheap hiking shoes
现在我有一个包含以下键/值的字典:
labels = {'cheap' : 'budget', 'luxury' : 'expensive', 'hiking' : 'sport'}
我想做的是找出数据框中是否存在字典中的键,如果存在,return 适当的值
我能够使用以下方法到达那里:
for k,v in labels.items():
keywords['Labels'] = np.where(keywords['keywords'].str.contains(k),v,'No Match')
但是,输出缺少前两个键,只捕获最后一个 "hiking" 键
keywords Labels
0 cheap shoes No Match
1 luxury shoes No Match
2 cheap hiking shoes sport
此外,我还想知道是否有办法在字典中捕获多个由 | 分隔的值, 所以理想的输出应该是这样的
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
非常感谢任何帮助或指导。
干杯
当然有可能。这是一种方法。
d = {'keywords': ['cheap shoes', 'luxury shoes', 'cheap hiking shoes', 'nothing']}
keywords = pd.DataFrame(d,columns=['keywords'])
labels = {'cheap': 'budget', 'luxury': 'expensive', 'hiking': 'sport'}
df = pd.DataFrame(d)
def matcher(k):
x = (i for i in labels if i in k)
return ' | '.join(map(labels.get, x))
df['values'] = df['keywords'].map(matcher)
# keywords values
# 0 cheap shoes budget
# 1 luxury shoes expensive
# 2 cheap hiking shoes budget | sport
# 3 nothing
您可以使用 "|".join(labels.keys())
获取 re.findall()
使用的模式。
import pandas as pd
import re
d = {'keywords' :['cheap shoes', 'luxury shoes', 'cheap hiking shoes']}
keywords = pd.DataFrame(d,columns=['keywords'])
labels = {'cheap' : 'budget', 'luxury' : 'expensive', 'hiking' : 'sport'}
pattern = "|".join(labels.keys())
def f(s):
return "|".join(labels[word] for word in re.findall(pattern, s))
keywords.keywords.map(f)
坚持你的方法,你可以做例如
arr = np.array([np.where(keywords['keywords'].str.contains(k), v, 'No Match') for k, v in labels.items()]).T
keywords["Labels"] = ["|".join(set(item[ind if ind.sum() == ind.shape[0] else ~ind])) for item, ind in zip(arr, (arr == "No Match"))]
Out[97]:
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes sport|budget
您可以split
the strings into separate columns, then stack
into a multi index, so that you can map
, the labels dictionary to the values. Then groupby
the initial index, and concatenate
属于每个索引的字符串
keywords['Labels'] = keywords.keywords.str.split(expand=True).stack()\
.map(labels).groupby(level=0)\
.apply(lambda x: x.str.cat(sep=' | '))
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
我喜欢先使用 replace
然后找到值的想法。
keywords.assign(
values=
keywords.keywords.replace(labels, regex=True)
.str.findall(f'({"|".join(labels.values())})')
.str.join(' | ')
)
keywords values
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
第一次发帖,格式不对请见谅
这是我的问题:
我创建了一个 Pandas 数据框,其中包含多行文本:
d = {'keywords' :['cheap shoes', 'luxury shoes', 'cheap hiking shoes']}
keywords = pd.DataFrame(d,columns=['keywords'])
In [7]: keywords
Out[7]:
keywords
0 cheap shoes
1 luxury shoes
2 cheap hiking shoes
现在我有一个包含以下键/值的字典:
labels = {'cheap' : 'budget', 'luxury' : 'expensive', 'hiking' : 'sport'}
我想做的是找出数据框中是否存在字典中的键,如果存在,return 适当的值
我能够使用以下方法到达那里:
for k,v in labels.items():
keywords['Labels'] = np.where(keywords['keywords'].str.contains(k),v,'No Match')
但是,输出缺少前两个键,只捕获最后一个 "hiking" 键
keywords Labels
0 cheap shoes No Match
1 luxury shoes No Match
2 cheap hiking shoes sport
此外,我还想知道是否有办法在字典中捕获多个由 | 分隔的值, 所以理想的输出应该是这样的
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
非常感谢任何帮助或指导。
干杯
当然有可能。这是一种方法。
d = {'keywords': ['cheap shoes', 'luxury shoes', 'cheap hiking shoes', 'nothing']}
keywords = pd.DataFrame(d,columns=['keywords'])
labels = {'cheap': 'budget', 'luxury': 'expensive', 'hiking': 'sport'}
df = pd.DataFrame(d)
def matcher(k):
x = (i for i in labels if i in k)
return ' | '.join(map(labels.get, x))
df['values'] = df['keywords'].map(matcher)
# keywords values
# 0 cheap shoes budget
# 1 luxury shoes expensive
# 2 cheap hiking shoes budget | sport
# 3 nothing
您可以使用 "|".join(labels.keys())
获取 re.findall()
使用的模式。
import pandas as pd
import re
d = {'keywords' :['cheap shoes', 'luxury shoes', 'cheap hiking shoes']}
keywords = pd.DataFrame(d,columns=['keywords'])
labels = {'cheap' : 'budget', 'luxury' : 'expensive', 'hiking' : 'sport'}
pattern = "|".join(labels.keys())
def f(s):
return "|".join(labels[word] for word in re.findall(pattern, s))
keywords.keywords.map(f)
坚持你的方法,你可以做例如
arr = np.array([np.where(keywords['keywords'].str.contains(k), v, 'No Match') for k, v in labels.items()]).T
keywords["Labels"] = ["|".join(set(item[ind if ind.sum() == ind.shape[0] else ~ind])) for item, ind in zip(arr, (arr == "No Match"))]
Out[97]:
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes sport|budget
您可以split
the strings into separate columns, then stack
into a multi index, so that you can map
, the labels dictionary to the values. Then groupby
the initial index, and concatenate
属于每个索引的字符串
keywords['Labels'] = keywords.keywords.str.split(expand=True).stack()\
.map(labels).groupby(level=0)\
.apply(lambda x: x.str.cat(sep=' | '))
keywords Labels
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport
我喜欢先使用 replace
然后找到值的想法。
keywords.assign(
values=
keywords.keywords.replace(labels, regex=True)
.str.findall(f'({"|".join(labels.values())})')
.str.join(' | ')
)
keywords values
0 cheap shoes budget
1 luxury shoes expensive
2 cheap hiking shoes budget | sport