如果单词或短语列表不包含在列中,即使它不准确,也过滤数据框的行
Filter a dataframe's rows if a list of words or phrases are not included in the column even if its not exact
我知道如果一个词不在数据框中,要过滤数据框,您可以使用以下方法:
df[~df['Job Name'].isin(remover_rows)]
问题出在我的例子中,数据框可能包含多个单词,如下所示:
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Finance and Office Administrator',
'Office/Accounts Administrator',
'Office Administrator',
'Temporary Office Administrator',
'Accounts and Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Admin - Customer Support',
'Office Administrator',
'Office Administrator - London - Spanish Speaking'
只有在要排除的单词列表中指明了特定单词时,上述解决方案才有效,例如,在上面的列表中,如果“Office Administrator”在 remover_rows
变量中,那么那些行不会显示。但是,如果我想删除包含 ['Finance', 'Spanish']
的 Job Name
列中的行,即使它也包含其他单词怎么办?例如,我希望 Job Name
列中包含 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking'
的行不会显示。
您可以将 Series.str.contains
与正则表达式一起使用:
import re
words = ["Finance", "Spanish"]
x = df["Job Name"].str.contains(
"|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])
打印:
Job Name
0 Office Administrator
1 Office Administrator
2 Office Administrator
4 Office/Accounts Administrator
5 Office Administrator
6 Temporary Office Administrator
7 Accounts and Office Administrator
8 Office Administrator
9 Office Administrator
10 Office Administrator
11 Office Administrator
12 Office Admin - Customer Support
13 Office Administrator
或没有re
:
words = ["Finance", "Spanish"]
x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])
我知道如果一个词不在数据框中,要过滤数据框,您可以使用以下方法:
df[~df['Job Name'].isin(remover_rows)]
问题出在我的例子中,数据框可能包含多个单词,如下所示:
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Finance and Office Administrator',
'Office/Accounts Administrator',
'Office Administrator',
'Temporary Office Administrator',
'Accounts and Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Administrator',
'Office Admin - Customer Support',
'Office Administrator',
'Office Administrator - London - Spanish Speaking'
只有在要排除的单词列表中指明了特定单词时,上述解决方案才有效,例如,在上面的列表中,如果“Office Administrator”在 remover_rows
变量中,那么那些行不会显示。但是,如果我想删除包含 ['Finance', 'Spanish']
的 Job Name
列中的行,即使它也包含其他单词怎么办?例如,我希望 Job Name
列中包含 'Finance and Office Administrator' and 'Office Administrator - London - Spanish Speaking'
的行不会显示。
您可以将 Series.str.contains
与正则表达式一起使用:
import re
words = ["Finance", "Spanish"]
x = df["Job Name"].str.contains(
"|".join(map(re.escape, words)), flags=re.IGNORECASE
)
print(df[~x])
打印:
Job Name
0 Office Administrator
1 Office Administrator
2 Office Administrator
4 Office/Accounts Administrator
5 Office Administrator
6 Temporary Office Administrator
7 Accounts and Office Administrator
8 Office Administrator
9 Office Administrator
10 Office Administrator
11 Office Administrator
12 Office Admin - Customer Support
13 Office Administrator
或没有re
:
words = ["Finance", "Spanish"]
x = df["Job Name"].apply(lambda x: any(w in x for w in words))
print(df[~x])