使用 str.contians 查看列表中的哪些词在每个项目中
See which words from list are in each item using str.contians
我正在尝试提取在 str.contains()
搜索中找到的单词,如下图所示(但使用 pandas 和 str.contains
,而不是 VBA)。我正在尝试在 VBA 结果列中重新创建输出。
如果在每条评论中找到这些词,我会用它来简单地告诉我:
searchfor = list(terms['term'])
found = [reviews['review_trimmed'].str.contains(x) for x in searchfor]
result = pd.DataFrame(found)
这很棒,因为我知道哪些评论有我要查找的字词,但我不知道它为每个评论找到了哪些字词。我希望我的回答使用 str.contains
来保持一致性。
使用numpy
:
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)
为了显示我使用虚拟数据的输出:
import pandas as pd
import numpy as np
df=pd.DataFrame({"review_trimmed": ["dog and cat", "Cat chases mouse", "horrible thing", "noodle soup", "chilli", "pizza is Good"]})
searchfor="yes cat Dog soup good bad horrible".split(" ")
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)
print(searchfor)
print(df)
输出:
#searchfor:
{'cat', 'good', 'yes', 'dog', 'bad', 'horrible', 'soup'}
#df:
review_trimmed found
0 dog and cat {cat, dog}
1 Cat chases mouse {cat}
2 horrible thing {horrible}
3 noodle soup {soup}
4 chilli {}
5 pizza is Good {good}
编辑
IIUC - 只需添加 .str.join(";")
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor).str.join(";")
print(searchfor)
print(df)
输出:
{'dog', 'soup', 'cat', 'bad', 'good', 'yes', 'horrible'}
review_trimmed found
0 dog and cat dog;cat
1 Cat chases mouse cat
2 horrible thing horrible
3 noodle soup soup
4 chilli
5 pizza is Good good
使用 Grzegorz Skibinski 的设置
df = pd.DataFrame({
"review_trimmed": [
"dog and cat",
"Cat chases mouse",
"horrible thing",
"noodle soup",
"chilli",
"pizza is Good"
]
})
searchfor = "yes cat Dog soup good bad horrible".split()
df
review_trimmed
0 dog and cat
1 Cat chases mouse
2 horrible thing
3 noodle soup
4 chilli
5 pizza is Good
_______________________________________________________
解决方案(pandas.Series.str.findall
)
- 使用
'|'.join
将搜索到的所有项目组合成一个正则表达式字符串,用于搜索任何项目。
- 使用
flag=2
这意味着 IGNORECASE
df.review_trimmed.str.findall('|'.join(searchfor), 2)
0 [dog, cat]
1 [Cat]
2 [horrible]
3 [soup]
4 []
5 [Good]
Name: review_trimmed, dtype: object
我们可以 join
他们 ';'
像这样:
df.review_trimmed.str.findall('|'.join(searchfor), 2).str.join(';')
0 dog;cat
1 Cat
2 horrible
3 soup
4
5 Good
Name: review_trimmed, dtype: object
我通过 for 循环试过了,
import pandas as pd
words_to_look=['Yes','No']
sentences=['He knows Yes No Yes','No He dont know','He Know' ]
df=pd.DataFrame(sentences,columns=['Comments_to_look'])
string=""
final_list=[]
for item in df['Comments_to_look']:
items=set(item.split())
for item2 in items:
for item3 in words_to_look:
if item2==item3:
string=item3+" "+string
break
final_list.append(string)
string=""
df['words occured']=final_list
print(df)
输出
Comments_to_look words occured
0 He knows Yes No Yes Yes No
1 No He dont know No
2 He Know
我正在尝试提取在 str.contains()
搜索中找到的单词,如下图所示(但使用 pandas 和 str.contains
,而不是 VBA)。我正在尝试在 VBA 结果列中重新创建输出。
如果在每条评论中找到这些词,我会用它来简单地告诉我:
searchfor = list(terms['term'])
found = [reviews['review_trimmed'].str.contains(x) for x in searchfor]
result = pd.DataFrame(found)
这很棒,因为我知道哪些评论有我要查找的字词,但我不知道它为每个评论找到了哪些字词。我希望我的回答使用 str.contains
来保持一致性。
使用numpy
:
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)
为了显示我使用虚拟数据的输出:
import pandas as pd
import numpy as np
df=pd.DataFrame({"review_trimmed": ["dog and cat", "Cat chases mouse", "horrible thing", "noodle soup", "chilli", "pizza is Good"]})
searchfor="yes cat Dog soup good bad horrible".split(" ")
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)
print(searchfor)
print(df)
输出:
#searchfor:
{'cat', 'good', 'yes', 'dog', 'bad', 'horrible', 'soup'}
#df:
review_trimmed found
0 dog and cat {cat, dog}
1 Cat chases mouse {cat}
2 horrible thing {horrible}
3 noodle soup {soup}
4 chilli {}
5 pizza is Good {good}
编辑
IIUC - 只需添加 .str.join(";")
searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor).str.join(";")
print(searchfor)
print(df)
输出:
{'dog', 'soup', 'cat', 'bad', 'good', 'yes', 'horrible'}
review_trimmed found
0 dog and cat dog;cat
1 Cat chases mouse cat
2 horrible thing horrible
3 noodle soup soup
4 chilli
5 pizza is Good good
使用 Grzegorz Skibinski 的设置
df = pd.DataFrame({
"review_trimmed": [
"dog and cat",
"Cat chases mouse",
"horrible thing",
"noodle soup",
"chilli",
"pizza is Good"
]
})
searchfor = "yes cat Dog soup good bad horrible".split()
df
review_trimmed
0 dog and cat
1 Cat chases mouse
2 horrible thing
3 noodle soup
4 chilli
5 pizza is Good
_______________________________________________________
解决方案(pandas.Series.str.findall
)
- 使用
'|'.join
将搜索到的所有项目组合成一个正则表达式字符串,用于搜索任何项目。 - 使用
flag=2
这意味着IGNORECASE
df.review_trimmed.str.findall('|'.join(searchfor), 2)
0 [dog, cat]
1 [Cat]
2 [horrible]
3 [soup]
4 []
5 [Good]
Name: review_trimmed, dtype: object
我们可以 join
他们 ';'
像这样:
df.review_trimmed.str.findall('|'.join(searchfor), 2).str.join(';')
0 dog;cat
1 Cat
2 horrible
3 soup
4
5 Good
Name: review_trimmed, dtype: object
我通过 for 循环试过了,
import pandas as pd
words_to_look=['Yes','No']
sentences=['He knows Yes No Yes','No He dont know','He Know' ]
df=pd.DataFrame(sentences,columns=['Comments_to_look'])
string=""
final_list=[]
for item in df['Comments_to_look']:
items=set(item.split())
for item2 in items:
for item3 in words_to_look:
if item2==item3:
string=item3+" "+string
break
final_list.append(string)
string=""
df['words occured']=final_list
print(df)
输出
Comments_to_look words occured
0 He knows Yes No Yes Yes No
1 No He dont know No
2 He Know