使用字符串列表计算数据框列中单词的出现次数

Question

我有一个字符串列表和一个带有文本列的数据框。在文本列中，我有几行文本。我想计算字符串列表中的每个单词在文本列中出现的次数。我的目标是向数据框添加两列；一列包含单词，另一列包含出现次数。如果有更好的解决方案，我愿意接受。学习不同的方法来实现这一点会很棒。理想情况下，最后我想要一个数据框。

string_list = ['had', 'it', 'the']

当前数据帧：

代码中的数据框：

pd.DataFrame({'title': {0: 'book1', 1: 'book2', 2: 'book3', 3: 'book4', 4: 'book5'},
 'text': {0: 'His voice had never sounded so cold',
  1: 'When she arrived home, she noticed that the curtains were closed.',
  2: 'He was terrified of small spaces and she knew',
  3: "It was time. She'd fought against it for so long",
  4: 'As he took in the view from the twentieth floor, the lights went out all over the city'},
 'had': {0: 1, 1: 5, 2: 5, 3: 2, 4: 5},
 'it': {0: 1, 1: 3, 2: 2, 3: 1, 4: 2},
 'the': {0: 1, 1: 4, 2: 5, 3: 3, 4: 3}})

正在尝试获取这样的数据框：

Answer 1

查找给定模式的匹配数的函数：

def find_match_count(word: str, pattern: str) -> int:
    return len(re.findall(pattern, word.lower()))

然后遍历每个字符串，并将此函数应用于 'word' 列：

for col in string_list:
    df[col] = df['text'].apply(find_match_count, pattern=col)

当使用您提供的数据框时（没有 had、it 和列）给出：

   title                                               text  had  it  the
0  book1                His voice had never sounded so cold    1   0    0
1  book2  When she arrived home, she noticed that the cu...    0   0    1
2  book3      He was terrified of small spaces and she knew    0   0    0
3  book4   It was time. She'd fought against it for so long    0   2    0
4  book5  As he took in the view from the twentieth floo...    0   1    4

Answer 2

定义自定义正则表达式，extractall、join 和 melt:

regex = '|'.join(fr'(?P<{w}>\b{w}\b)' for w in string_list)

(df[['title', 'text']]
 .join(df['text'].str.extractall(regex).notna().groupby(level=0).sum())
 .fillna(0)
 .melt(id_vars=['title', 'text'], var_name='word', value_name='word count')
 )

输出：

    title                                               text word  word count
0   book1                His voice had never sounded so cold  had         1.0
1   book2  When she arrived home, she noticed that the cu...  had         0.0
2   book3      He was terrified of small spaces and she knew  had         0.0
3   book4   It was time. She'd fought against it for so long  had         0.0
4   book5  As he took in the view from the twentieth floo...  had         0.0
5   book1                His voice had never sounded so cold   it         0.0
6   book2  When she arrived home, she noticed that the cu...   it         0.0
7   book3      He was terrified of small spaces and she knew   it         0.0
8   book4   It was time. She'd fought against it for so long   it         1.0
9   book5  As he took in the view from the twentieth floo...   it         0.0
10  book1                His voice had never sounded so cold  the         0.0
11  book2  When she arrived home, she noticed that the cu...  the         1.0
12  book3      He was terrified of small spaces and she knew  the         0.0
13  book4   It was time. She'd fought against it for so long  the         0.0
14  book5  As he took in the view from the twentieth floo...  the         4.0

使用字符串列表计算数据框列中单词的出现次数

Counting the occurrence of words in a dataframe column using a list of strings

python

text

dataframe

pandas