如何计算数据框列中列表中单词的频率?
How to count frequncy of words from a list in a dataframe column?
如果我有一个具有以下布局的数据框:
ID# Response
1234 Covid-19 was a disaster for my business
3456 The way you handled this pandemic was awesome
我希望能够计算列表中特定单词的出现频率。
list=['covid','COVID','Covid-19','pandemic','coronavirus']
最后我想生成一个像下面这样的字典
{covid:0,COVID:0,Covid-19:1,pandemic:1,'coronavirus':0}
请帮助我真的不知道如何在 python
中编写代码
import pandas as pd
import numpy as np
df = pd.DataFrame({'sheet':['sheet1', 'sheet2', 'sheet3', 'sheet2'],
'tokenized_text':[['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed'], ['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is'], ['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are', 'cyvbwx'], ['in', 'the', 'be-induced', 'crisis', 'people', 'are', 'cyvbwx']] })
print(df)
words_collection = ['covid','COVID','Covid-19','pandemic','coronavirus']
# Extract the words from all lines
all_words = []
for index, row in df.iterrows():
all_words.extend(row['tokenized_text'])
# Create a dictionary that maps for each word from `words_collection` the counter it appears
word_to_number_of_occurences = dict()
# Go over the word collection and set it's counter
for word in words_collection:
word_to_number_of_occurences[word] = all_words.count(word)
# {'covid': 1, 'COVID': 0, 'Covid-19': 0, 'pandemic': 0, 'coronavirus': 1}
print(word_to_number_of_occurences)
对于每个字符串,找到匹配的数量。
dict((s, df['response'].str.count(s).fillna(0).sum()) for s in list_of_strings)
请注意,Series.str.count
采用正则表达式输入。您可能需要附加 (?=\b)
表示 look-ahead word-endings.
Series.str.count
returns NA
计数时NA
,因此填0。对每个字符串,对列求和。
试试 np.hstack
和 Counter
:
from collections import Counter
a = np.hstack(df['Response'].str.split())
dct = {**dict.fromkeys(lst, 0), **Counter(a[np.isin(a, lst)])}
{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}
你可以通过理解的方式非常简单地做到这一点:
{x:df.Response.str.count(x).sum() for x in list}
输出
{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}
如果我有一个具有以下布局的数据框:
ID# Response
1234 Covid-19 was a disaster for my business
3456 The way you handled this pandemic was awesome
我希望能够计算列表中特定单词的出现频率。
list=['covid','COVID','Covid-19','pandemic','coronavirus']
最后我想生成一个像下面这样的字典
{covid:0,COVID:0,Covid-19:1,pandemic:1,'coronavirus':0}
请帮助我真的不知道如何在 python
中编写代码import pandas as pd
import numpy as np
df = pd.DataFrame({'sheet':['sheet1', 'sheet2', 'sheet3', 'sheet2'],
'tokenized_text':[['efcc', 'fficial', 'billiontwits', 'since', 'covid', 'landed'], ['when', 'people', 'say', 'the', 'fatality', 'rate', 'of', 'coronavirus', 'is'], ['in', 'the', 'coronavirus-induced', 'crisis', 'people', 'are', 'cyvbwx'], ['in', 'the', 'be-induced', 'crisis', 'people', 'are', 'cyvbwx']] })
print(df)
words_collection = ['covid','COVID','Covid-19','pandemic','coronavirus']
# Extract the words from all lines
all_words = []
for index, row in df.iterrows():
all_words.extend(row['tokenized_text'])
# Create a dictionary that maps for each word from `words_collection` the counter it appears
word_to_number_of_occurences = dict()
# Go over the word collection and set it's counter
for word in words_collection:
word_to_number_of_occurences[word] = all_words.count(word)
# {'covid': 1, 'COVID': 0, 'Covid-19': 0, 'pandemic': 0, 'coronavirus': 1}
print(word_to_number_of_occurences)
对于每个字符串,找到匹配的数量。
dict((s, df['response'].str.count(s).fillna(0).sum()) for s in list_of_strings)
请注意,Series.str.count
采用正则表达式输入。您可能需要附加 (?=\b)
表示 look-ahead word-endings.
Series.str.count
returns NA
计数时NA
,因此填0。对每个字符串,对列求和。
试试 np.hstack
和 Counter
:
from collections import Counter
a = np.hstack(df['Response'].str.split())
dct = {**dict.fromkeys(lst, 0), **Counter(a[np.isin(a, lst)])}
{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}
你可以通过理解的方式非常简单地做到这一点:
{x:df.Response.str.count(x).sum() for x in list}
输出
{'covid': 0, 'COVID': 0, 'Covid-19': 1, 'pandemic': 1, 'coronavirus': 0}