使用数据框中的关键字来检测另一个数据框或字符串中是否存在任何关键字
Use keywords from dataframe to detect if any present in another dataframe or string
我有两个问题:首先是...
我有一个数据框,其中包含这样的类别和关键字:
Category Keywords
0 Fruit ['apple', 'pear', 'plum', 'grape']
1 Color ['red', 'purple', 'green']
另一个像这样的数据框:
Summary
0 This is a basket of red apples. They are sour.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
我想要这样的最终结果:
Category Summary
0 Fruit, Color This is a basket of red apples. They are sour.
1 Color We found a bushel of fruit. They are red.
2 Fruit, Color There is a peck of green pears that taste sweet.
3 Fruit We have a box of plums.
第二个是...
我应该能够检查字符串是否包含任何关键字,如果为真,则输出适当类别的列表。
示例:sample_sentence = "This line contains a red plum?"
输出:
result_list = ['color','Fruit']
编辑:它有点相似但不是 same.Use 这个供参考:
编辑 2:
我还有另一个版本的第一个数据框,如下所示:
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
您可以使用列表理解来实现此目的:
数据帧设置:
df1 = pd.DataFrame({'Category': {0: 'Fruit', 1: 'Color'},
'Keywords': {0: 'apple,pear,plum,grape', 1: 'red,purple,green'}})
df2 = pd.DataFrame({'Summary': {0: 'This is a basket of red apples. They are sour.',
1: 'We found a bushel of fruit. They are red.',
2: 'There is a peck of pears that taste sweet.',
3: 'We have a box of plums.'}})
df1['Keywords'] = df1['Keywords'].str.split(',')
代码:
df2['Category'] = (df2['Summary'].str.split(' ').apply(
lambda x: list(set([str(a) for y in
x for a,b in
zip(df1['Category'], df1['Keywords']) for c in
b if str(c) in #Or you can use: "if str(c) == str(y)" or "if str(c).lower() == str(y).lower()"
str(y)]))).str.join(', '))
df2
输出:
Out[1]:
Summary Category
0 This is a basket of red apples. They are sour. Fruit, Color
1 We found a bushel of fruit. They are red. Color
2 There is a peck of pears that taste sweet. Fruit
3 We have a box of plums. Fruit
a
、b
和 x
遍历 rows
(垂直)。 c
和 y
遍历列表 在 行内(水平)。为了开始水平地遍历列表,您首先需要垂直地遍历行。这就是我们拥有所有这些变量的原因(见图)。您可以使用 zip
同时遍历第一个数据帧的两列或多列。
我有两个问题:首先是...
我有一个数据框,其中包含这样的类别和关键字:
Category Keywords
0 Fruit ['apple', 'pear', 'plum', 'grape']
1 Color ['red', 'purple', 'green']
另一个像这样的数据框:
Summary
0 This is a basket of red apples. They are sour.
1 We found a bushel of fruit. They are red.
2 There is a peck of pears that taste sweet.
3 We have a box of plums.
我想要这样的最终结果:
Category Summary
0 Fruit, Color This is a basket of red apples. They are sour.
1 Color We found a bushel of fruit. They are red.
2 Fruit, Color There is a peck of green pears that taste sweet.
3 Fruit We have a box of plums.
第二个是...
我应该能够检查字符串是否包含任何关键字,如果为真,则输出适当类别的列表。
示例:sample_sentence = "This line contains a red plum?"
输出:
result_list = ['color','Fruit']
编辑:它有点相似但不是 same.Use 这个供参考:
编辑 2:
我还有另一个版本的第一个数据框,如下所示:
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
您可以使用列表理解来实现此目的:
数据帧设置:
df1 = pd.DataFrame({'Category': {0: 'Fruit', 1: 'Color'},
'Keywords': {0: 'apple,pear,plum,grape', 1: 'red,purple,green'}})
df2 = pd.DataFrame({'Summary': {0: 'This is a basket of red apples. They are sour.',
1: 'We found a bushel of fruit. They are red.',
2: 'There is a peck of pears that taste sweet.',
3: 'We have a box of plums.'}})
df1['Keywords'] = df1['Keywords'].str.split(',')
代码:
df2['Category'] = (df2['Summary'].str.split(' ').apply(
lambda x: list(set([str(a) for y in
x for a,b in
zip(df1['Category'], df1['Keywords']) for c in
b if str(c) in #Or you can use: "if str(c) == str(y)" or "if str(c).lower() == str(y).lower()"
str(y)]))).str.join(', '))
df2
输出:
Out[1]:
Summary Category
0 This is a basket of red apples. They are sour. Fruit, Color
1 We found a bushel of fruit. They are red. Color
2 There is a peck of pears that taste sweet. Fruit
3 We have a box of plums. Fruit
a
、b
和 x
遍历 rows
(垂直)。 c
和 y
遍历列表 在 行内(水平)。为了开始水平地遍历列表,您首先需要垂直地遍历行。这就是我们拥有所有这些变量的原因(见图)。您可以使用 zip
同时遍历第一个数据帧的两列或多列。