如果单词或短语列表不包含在列中,即使它不准确,也过滤数据框的行

Filter a dataframe's rows if a list of words or phrases are not included in the column even if its not exact

我知道如果一个词不在数据框中,要过滤数据框,您可以使用以下方法:

df[~df['Job Name'].isin(remover_rows)]

问题出在我的例子中,数据框可能包含多个单词,如下所示:

 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Finance and Office Administrator',
 'Office/Accounts Administrator',
 'Office Administrator',
 'Temporary Office Administrator',
 'Accounts and Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Administrator',
 'Office Admin - Customer Support',
 'Office Administrator',
 'Office Administrator - London - Spanish Speaking'

只有在要排除的单词列表中指明了特定单词时,上述解决方案才有效,例如,在上面的列表中,如果“Office Administrator”在 remover_rows 变量中,那么那些行不会显示。但是,如果我想删除包含 ['Finance', 'Spanish']Job Name 列中的行,即使它也包含其他单词怎么办?例如,我希望 Job Name 列中包含 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking' 的行不会显示。

您可以将 Series.str.contains 与正则表达式一起使用:

import re

words = ["Finance", "Spanish"]

x = df["Job Name"].str.contains(
    "|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])

打印:

                             Job Name
0                Office Administrator
1                Office Administrator
2                Office Administrator
4       Office/Accounts Administrator
5                Office Administrator
6      Temporary Office Administrator
7   Accounts and Office Administrator
8                Office Administrator
9                Office Administrator
10               Office Administrator
11               Office Administrator
12    Office Admin - Customer Support
13               Office Administrator

或没有re:

words = ["Finance", "Spanish"]

x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])