使用字符串列表计算数据框列中单词的出现次数
Counting the occurrence of words in a dataframe column using a list of strings
我有一个字符串列表和一个带有文本列的数据框。在文本列中,我有几行文本。我想计算字符串列表中的每个单词在文本列中出现的次数。我的目标是向数据框添加两列;一列包含单词,另一列包含出现次数。如果有更好的解决方案,我愿意接受。学习不同的方法来实现这一点会很棒。理想情况下,最后我想要一个数据框。
string_list = ['had', 'it', 'the']
当前数据帧:
代码中的数据框:
pd.DataFrame({'title': {0: 'book1', 1: 'book2', 2: 'book3', 3: 'book4', 4: 'book5'},
'text': {0: 'His voice had never sounded so cold',
1: 'When she arrived home, she noticed that the curtains were closed.',
2: 'He was terrified of small spaces and she knew',
3: "It was time. She'd fought against it for so long",
4: 'As he took in the view from the twentieth floor, the lights went out all over the city'},
'had': {0: 1, 1: 5, 2: 5, 3: 2, 4: 5},
'it': {0: 1, 1: 3, 2: 2, 3: 1, 4: 2},
'the': {0: 1, 1: 4, 2: 5, 3: 3, 4: 3}})
正在尝试获取这样的数据框:
查找给定模式的匹配数的函数:
def find_match_count(word: str, pattern: str) -> int:
return len(re.findall(pattern, word.lower()))
然后遍历每个字符串,并将此函数应用于 'word'
列:
for col in string_list:
df[col] = df['text'].apply(find_match_count, pattern=col)
当使用您提供的数据框时(没有 had、it 和列)给出:
title text had it the
0 book1 His voice had never sounded so cold 1 0 0
1 book2 When she arrived home, she noticed that the cu... 0 0 1
2 book3 He was terrified of small spaces and she knew 0 0 0
3 book4 It was time. She'd fought against it for so long 0 2 0
4 book5 As he took in the view from the twentieth floo... 0 1 4
定义自定义正则表达式,extractall
、join
和 melt
:
regex = '|'.join(fr'(?P<{w}>\b{w}\b)' for w in string_list)
(df[['title', 'text']]
.join(df['text'].str.extractall(regex).notna().groupby(level=0).sum())
.fillna(0)
.melt(id_vars=['title', 'text'], var_name='word', value_name='word count')
)
输出:
title text word word count
0 book1 His voice had never sounded so cold had 1.0
1 book2 When she arrived home, she noticed that the cu... had 0.0
2 book3 He was terrified of small spaces and she knew had 0.0
3 book4 It was time. She'd fought against it for so long had 0.0
4 book5 As he took in the view from the twentieth floo... had 0.0
5 book1 His voice had never sounded so cold it 0.0
6 book2 When she arrived home, she noticed that the cu... it 0.0
7 book3 He was terrified of small spaces and she knew it 0.0
8 book4 It was time. She'd fought against it for so long it 1.0
9 book5 As he took in the view from the twentieth floo... it 0.0
10 book1 His voice had never sounded so cold the 0.0
11 book2 When she arrived home, she noticed that the cu... the 1.0
12 book3 He was terrified of small spaces and she knew the 0.0
13 book4 It was time. She'd fought against it for so long the 0.0
14 book5 As he took in the view from the twentieth floo... the 4.0
我有一个字符串列表和一个带有文本列的数据框。在文本列中,我有几行文本。我想计算字符串列表中的每个单词在文本列中出现的次数。我的目标是向数据框添加两列;一列包含单词,另一列包含出现次数。如果有更好的解决方案,我愿意接受。学习不同的方法来实现这一点会很棒。理想情况下,最后我想要一个数据框。
string_list = ['had', 'it', 'the']
当前数据帧:
代码中的数据框:
pd.DataFrame({'title': {0: 'book1', 1: 'book2', 2: 'book3', 3: 'book4', 4: 'book5'},
'text': {0: 'His voice had never sounded so cold',
1: 'When she arrived home, she noticed that the curtains were closed.',
2: 'He was terrified of small spaces and she knew',
3: "It was time. She'd fought against it for so long",
4: 'As he took in the view from the twentieth floor, the lights went out all over the city'},
'had': {0: 1, 1: 5, 2: 5, 3: 2, 4: 5},
'it': {0: 1, 1: 3, 2: 2, 3: 1, 4: 2},
'the': {0: 1, 1: 4, 2: 5, 3: 3, 4: 3}})
正在尝试获取这样的数据框:
查找给定模式的匹配数的函数:
def find_match_count(word: str, pattern: str) -> int:
return len(re.findall(pattern, word.lower()))
然后遍历每个字符串,并将此函数应用于 'word'
列:
for col in string_list:
df[col] = df['text'].apply(find_match_count, pattern=col)
当使用您提供的数据框时(没有 had、it 和列)给出:
title text had it the
0 book1 His voice had never sounded so cold 1 0 0
1 book2 When she arrived home, she noticed that the cu... 0 0 1
2 book3 He was terrified of small spaces and she knew 0 0 0
3 book4 It was time. She'd fought against it for so long 0 2 0
4 book5 As he took in the view from the twentieth floo... 0 1 4
定义自定义正则表达式,extractall
、join
和 melt
:
regex = '|'.join(fr'(?P<{w}>\b{w}\b)' for w in string_list)
(df[['title', 'text']]
.join(df['text'].str.extractall(regex).notna().groupby(level=0).sum())
.fillna(0)
.melt(id_vars=['title', 'text'], var_name='word', value_name='word count')
)
输出:
title text word word count
0 book1 His voice had never sounded so cold had 1.0
1 book2 When she arrived home, she noticed that the cu... had 0.0
2 book3 He was terrified of small spaces and she knew had 0.0
3 book4 It was time. She'd fought against it for so long had 0.0
4 book5 As he took in the view from the twentieth floo... had 0.0
5 book1 His voice had never sounded so cold it 0.0
6 book2 When she arrived home, she noticed that the cu... it 0.0
7 book3 He was terrified of small spaces and she knew it 0.0
8 book4 It was time. She'd fought against it for so long it 1.0
9 book5 As he took in the view from the twentieth floo... it 0.0
10 book1 His voice had never sounded so cold the 0.0
11 book2 When she arrived home, she noticed that the cu... the 1.0
12 book3 He was terrified of small spaces and she knew the 0.0
13 book4 It was time. She'd fought against it for so long the 0.0
14 book5 As he took in the view from the twentieth floo... the 4.0