如何在 pandas 数据框中跨多行搜索文本?
How to search for text across multiple rows in a pandas dataframe?
所以我是 Python 的新手,我只是想知道我是否可以使用它来跨多行搜索文本。这是我的数据框的屏幕截图:
https://i.stack.imgur.com/jeqpv.png
为了更清楚,我想做的是搜索包含多个单词的短语或表达式,例如 'New Jersey,' 但是,每个单词组成一个单独的行,所以我不知道如何着手在查询中包含多行。如果可能的话,我还想创建一个新列,它将标记任何匹配 'M' 和不带 'N.' 的匹配项。感谢所有帮助,让我更容易!
想法是连接所有行以便能够搜索多个连续的单词。
例如,我们想在整个数据框中找到短语“she wants to”:
>>> df
subtitle
0 She # <- start here (1)
1 wants #
2 to # <- end here (1)
3 sing
4 she # <- start here (2)
5 wants #
6 to # <- end here (2)
7 act
8 she # <- start here (3)
9 wants #
10 to # <- end here (3)
11 dance
import re
search = "she wants to"
text = " ".join(df["subtitle"])
# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
更新:搜索多个词:
仅更改:
# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
12 If 54 56 True
13 you 57 60 True
14 sing 61 65 False
15 I 66 67 True
16 will 68 72 True
17 smile 73 78 False
18 if 79 81 True
19 you 82 85 True
20 laugh 86 91 False
21 I 92 93 True
22 will 94 98 True
23 smile 99 104 False
24 if 105 107 True
25 you 108 111 True
26 love 112 116 False
27 I 117 118 True
28 will 119 123 True
29 smile 124 129 False
更新 2:将条款写入文本文件:
$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"
所以我是 Python 的新手,我只是想知道我是否可以使用它来跨多行搜索文本。这是我的数据框的屏幕截图:
https://i.stack.imgur.com/jeqpv.png
为了更清楚,我想做的是搜索包含多个单词的短语或表达式,例如 'New Jersey,' 但是,每个单词组成一个单独的行,所以我不知道如何着手在查询中包含多行。如果可能的话,我还想创建一个新列,它将标记任何匹配 'M' 和不带 'N.' 的匹配项。感谢所有帮助,让我更容易!
想法是连接所有行以便能够搜索多个连续的单词。
例如,我们想在整个数据框中找到短语“she wants to”:
>>> df
subtitle
0 She # <- start here (1)
1 wants #
2 to # <- end here (1)
3 sing
4 she # <- start here (2)
5 wants #
6 to # <- end here (2)
7 act
8 she # <- start here (3)
9 wants #
10 to # <- end here (3)
11 dance
import re
search = "she wants to"
text = " ".join(df["subtitle"])
# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
更新:搜索多个词:
仅更改:
# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
12 If 54 56 True
13 you 57 60 True
14 sing 61 65 False
15 I 66 67 True
16 will 68 72 True
17 smile 73 78 False
18 if 79 81 True
19 you 82 85 True
20 laugh 86 91 False
21 I 92 93 True
22 will 94 98 True
23 smile 99 104 False
24 if 105 107 True
25 you 108 111 True
26 love 112 116 False
27 I 117 118 True
28 will 119 123 True
29 smile 124 129 False
更新 2:将条款写入文本文件:
$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"