pandas 查找系列中的共同字符串

Question

我从一个更大的 DataFrame 和一个带有一列字符串的 DataFrame 中提取了一系列关键字。我想屏蔽 DataFrame 发现哪些字符串至少包含一个关键字。 "Keywords"系列如下（不好意思怪词）：

Skilful
Wilful
Somewhere
Thing
Strange

DataFrame 如下所示：

User_ID;Tweet
01;hi all
02;see you somewhere
03;So weird
04;hi all :-)
05;next big thing
06;how can i say no?
07;so strange
08;not at all

到目前为止，我使用了 pandas 中的 str.contains() 函数，例如：

mask = df['Tweet'].str.contains(str(Keywords['Keyword'][4]), case=False)

在 DataFrame 中找到 "Strange" 字符串和 returns:

效果很好

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
Name: Tweet, dtype: bool

我想做的是用 all Keywords 数组屏蔽整个 DataFrame，所以我可以有这样的东西：

0    False
1     True
2    False
3    False
4     True
5    False
6     True
7    False
Name: Tweet, dtype: bool

是否可以不用遍历数组？在我的真实案例中，我必须搜索数百万个字符串，所以我正在寻找一种快速的方法。

感谢您的帮助。

Answer 1

import re
df['Tweet'].str.match('.*({0}).*'.format('|'.join(phrases)))

其中 phrases 是一个可迭代的短语，您要以其存在为条件。

Answer 2

一个简单的apply就可以解决这个问题。如果您可以忍受几秒钟的处理，我认为这是您可以使用的最简单的方法，无需冒险到外面 pandas。

import pandas as pd

df = pd.read_csv("dict.csv", delimiter=";")
ref = pd.read_csv("ref.csv")

kw = set([k.lower() for k in ref["Keywords"]])
print kw

boom = lambda x:True if any(w in kw for w in x.split()) else False

df["Tweet"] = df["Tweet"].apply(boom)
print df

我针对 10,165,760 行虚构数据对其进行了测试，并在 18.9 秒内完成。如果这还不够快，则需要更好的方法。

set(['somewhere', 'thing', 'strange', 'skilful', 'wilful'])
          User_ID  Tweet
0               1  False
1               2   True
2               3  False
3               4  False
4               5   True
5               6  False
6               7   True
7               8  False
8               1  False
9               2   True
10              3  False
11              4  False
12              5   True
13              6  False
14              7   True
15              8  False
16              1  False
17              2   True
18              3  False
19              4  False
20              5   True
21              6  False
22              7   True
23              8  False
24              1  False
25              2   True
26              3  False
27              4  False
28              5   True
29              6  False
...           ...    ...
10165730        3  False
10165731        4  False
10165732        5   True
10165733        6  False
10165734        7   True
10165735        8  False
10165736        1  False
10165737        2   True
10165738        3  False
10165739        4  False
10165740        5   True
10165741        6  False
10165742        7   True
10165743        8  False
10165744        1  False
10165745        2   True
10165746        3  False
10165747        4  False
10165748        5   True
10165749        6  False
10165750        7   True
10165751        8  False
10165752        1  False
10165753        2   True
10165754        3  False
10165755        4  False
10165756        5   True
10165757        6  False
10165758        7   True
10165759        8  False

[10165760 rows x 2 columns]
[Finished in 18.9s]

希望对您有所帮助。

Answer 3

实现此目的的另一种方法是将 pd.Series.isin() 与 map 和 apply，你的样本会像：

df    # DataFrame

   User_ID              Tweet
0        1             hi all
1        2  see you somewhere
2        3           So weird
3        4         hi all :-)
4        5     next big thing
5        6  how can i say no?
6        7         so strange
7        8         not at all

w    # Series

0      Skilful
1       Wilful
2    Somewhere
3        Thing
4      Strange
dtype: object

# list
masked = map(lambda x: any(w.apply(str.lower).isin(x)), \                 
             df['Tweet'].apply(str.lower).apply(str.split))

df['Tweet_masked'] = masked

结果：

df
Out[13]: 
   User_ID              Tweet Tweet_masked
0        1             hi all        False
1        2  see you somewhere         True
2        3           So weird        False
3        4         hi all :-)        False
4        5     next big thing         True
5        6  how can i say no?        False
6        7         so strange         True
7        8         not at all        False

作为旁注，isin 仅在整个字符串与值匹配时才有效，如果您只对 str.contains 感兴趣，这里是变体：

masked = map(lambda x: any(_ in x for _ in w.apply(str.lower)), \
             df['Tweet'].apply(str.lower))

更新：正如@Alex 指出的那样，将 map 和 regexp 结合起来会更有效，事实上我不太喜欢 map + lambda 都不是，我们开始吧：

import re

r = re.compile(r'.*({}).*'.format('|'.join(w.values)), re.IGNORECASE)

masked = map(bool, map(r.match, df['Tweet']))

pandas 查找系列中的共同字符串

pandas find strings in common among Series

python

string

pandas