使用 str.contians 查看列表中的哪些词在每个项目中

See which words from list are in each item using str.contians

我正在尝试提取在 str.contains() 搜索中找到的单词,如下图所示(但使用 pandas 和 str.contains,而不是 VBA)。我正在尝试在 VBA 结果列中重新创建输出。

如果在每条评论中找到这些词,我会用它来简单地告诉我:

searchfor = list(terms['term'])
found = [reviews['review_trimmed'].str.contains(x) for x in searchfor]
result = pd.DataFrame(found)

这很棒,因为我知道哪些评论有我要查找的字词,但我不知道它为每个评论找到了哪些字词。我希望我的回答使用 str.contains 来保持一致性。

使用numpy:

searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)

为了显示我使用虚拟数据的输出:

import pandas as pd
import numpy as np
df=pd.DataFrame({"review_trimmed": ["dog and cat", "Cat chases mouse", "horrible thing", "noodle soup", "chilli", "pizza is Good"]})

searchfor="yes cat Dog soup good bad horrible".split(" ")

searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor)
print(searchfor)
print(df)

输出:

#searchfor:
{'cat', 'good', 'yes', 'dog', 'bad', 'horrible', 'soup'}

#df:
     review_trimmed       found
0       dog and cat  {cat, dog}
1  Cat chases mouse       {cat}
2    horrible thing  {horrible}
3       noodle soup      {soup}
4            chilli          {}
5     pizza is Good      {good}

编辑

IIUC - 只需添加 .str.join(";")

searchfor=[wrd.lower() for wrd in searchfor]
searchfor=set(searchfor)
df["found"]=np.bitwise_and(df["review_trimmed"].str.lower().str.split("[^\w+]").map(set), searchfor).str.join(";")
print(searchfor)
print(df)

输出:

{'dog', 'soup', 'cat', 'bad', 'good', 'yes', 'horrible'}
     review_trimmed     found
0       dog and cat   dog;cat
1  Cat chases mouse       cat
2    horrible thing  horrible
3       noodle soup      soup
4            chilli
5     pizza is Good      good

使用 Grzegorz Skibinski 的设置

df = pd.DataFrame({
    "review_trimmed": [
        "dog and cat",
        "Cat chases mouse",
        "horrible thing",
        "noodle soup",
        "chilli",
        "pizza is Good"
    ]
})

searchfor = "yes cat Dog soup good bad horrible".split()

df

     review_trimmed
0       dog and cat
1  Cat chases mouse
2    horrible thing
3       noodle soup
4            chilli
5     pizza is Good

_______________________________________________________

解决方案(pandas.Series.str.findall

  • 使用 '|'.join 将搜索到的所有项目组合成一个正则表达式字符串,用于搜索任何项目。
  • 使用 flag=2 这意味着 IGNORECASE

df.review_trimmed.str.findall('|'.join(searchfor), 2)

0    [dog, cat]
1         [Cat]
2    [horrible]
3        [soup]
4            []
5        [Good]
Name: review_trimmed, dtype: object

我们可以 join 他们 ';' 像这样:

df.review_trimmed.str.findall('|'.join(searchfor), 2).str.join(';')

0     dog;cat
1         Cat
2    horrible
3        soup
4            
5        Good
Name: review_trimmed, dtype: object

我通过 for 循环试过了,

import pandas as pd

words_to_look=['Yes','No']
sentences=['He knows Yes No Yes','No He dont know','He Know' ]

df=pd.DataFrame(sentences,columns=['Comments_to_look'])

string=""
final_list=[]

for item in df['Comments_to_look']:
    items=set(item.split())
    for item2 in items:
        for item3 in words_to_look:
            if item2==item3:
                string=item3+" "+string
                break
    final_list.append(string)
    string=""

df['words occured']=final_list
print(df)

输出

    Comments_to_look      words occured
0   He knows Yes No Yes   Yes No
1   No He dont know       No
2   He Know