如何在 pandas 数据框中跨多行搜索文本？

Question

所以我是 Python 的新手，我只是想知道我是否可以使用它来跨多行搜索文本。这是我的数据框的屏幕截图：

https://i.stack.imgur.com/jeqpv.png

为了更清楚，我想做的是搜索包含多个单词的短语或表达式，例如 'New Jersey,' 但是，每个单词组成一个单独的行，所以我不知道如何着手在查询中包含多行。如果可能的话，我还想创建一个新列，它将标记任何匹配 'M' 和不带 'N.' 的匹配项。感谢所有帮助，让我更容易！

Answer 1

想法是连接所有行以便能够搜索多个连续的单词。

例如，我们想在整个数据框中找到短语“she wants to”：

>>> df
   subtitle
0       She  # <- start here (1)
1     wants  #
2        to  # <- end here (1)
3      sing
4       she  # <- start here (2)
5     wants  #
6        to  # <- end here (2)
7       act
8       she  # <- start here (3)
9     wants  # 
10       to  # <- end here (3)
11    dance

import re

search = "she wants to"
text = " ".join(df["subtitle"])

# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1

# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False

# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
    idx1 = df[df["start"] == match.start()].index[0]
    idx2 = df[df["end"] == match.end()].index[0]
    df.loc[idx1:idx2, "match"] = True

>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False

更新：搜索多个词：

仅更改：

# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"

# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False
12       If     54   56   True
13      you     57   60   True
14     sing     61   65  False
15        I     66   67   True
16     will     68   72   True
17    smile     73   78  False
18       if     79   81   True
19      you     82   85   True
20    laugh     86   91  False
21        I     92   93   True
22     will     94   98   True
23    smile     99  104  False
24       if    105  107   True
25      you    108  111   True
26     love    112  116  False
27        I    117  118   True
28     will    119  123   True
29    smile    124  129  False

更新 2：将条款写入文本文件：

$ cat terms.txt
she wants to
if you
I will

search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"

如何在 pandas 数据框中跨多行搜索文本？

How to search for text across multiple rows in a pandas dataframe?

python

rows

pandas