是否有一个模块可以计算 Python 中字符串列表的出现次数?
Is there a module that can count occurrences of a list of strings in Python?
我定义了一个列表,它读取多个文件的内容并存储所有这些文件。
如何创建一个dataframe,每个文件名在一行中,对应的列计算每个单词的出现次数并输出。
为了举例,假设这一切都是明确定义的(但如果需要我可以提供原始代码):
#define list
words = [ file1_contents, file2_contents ]
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog".
filter_words = ["cat", "dog", "box turtle", "sea horse"]
输出将是这样的:
output = {'file1'{'cat': 1, 'dog':1, 'box turtle': 1, 'sea horse': 0}, 'file2'{ ...}}
我附上了我最终目标的图片。我刚开始使用 python,所以我不太确定我会在这里使用什么 package/module?我知道 pandas 让您可以使用数据框。
我想到了使用 collections
中的 Counter
from collections import Counter
z = ['blue', 'red', 'blue', 'yellow', 'blue', 'red']
Counter(z)
Counter({'blue': 3, 'red': 2, 'yellow': 1})
但是,这就是我被困的地方。我如何在 python 中组织一个看起来像附件图像的 table?
示例输出:
from collections import Counter
df_st = pd.DataFrame()
for i in range(1,3):
filename = 'file'+str(i)+'.txt'
with open(filename,'r') as f:
list_words = []
word_count = 0
for line in f:
for word in line.split():
word_count = word_count + 1
list_words.append(word)
df2 = pd.DataFrame(index = (0,),data=Counter(list_words))
df2['0_word_count'] = word_count
df2['0_file_name'] = filename
df_st = df_st.append(df2, ignore_index=True)
df_st
Out[2]:
(who 0_file_name 0_word_count about and another box but cat cats ... pet sea sea), squirrel, string that the turtle turtles. with
0 NaN file1.txt 18 NaN 1.0 NaN 1 NaN NaN 1.0 ... 1.0 1.0 NaN NaN 1 1.0 NaN 1 1.0 2.0
1 1.0 file2.txt 18 1.0 NaN 1.0 1 1.0 1.0 NaN ... NaN NaN 1.0 1.0 1 NaN 1.0 1 NaN NaN
想法是循环每个文件内容,从列表 filter_words
过滤值 re.findall
,计数 Counter
并为 DataFrame
创建字典:
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog."
import re
from collections import Counter
words = {'file1': file1_contents, 'file2':file2_contents}
filter_words = ["cat", "dog", "box turtle", "sea horse"]
out = {}
for k, w in words.items():
new = []
for fw in filter_words:
new.extend(re.findall(r"{}".format(fw),w) )
out[k] = dict(Counter(new))
print (out)
{'file1': {'cat': 1, 'dog': 1}, 'file2': {'cat': 1, 'dog': 1, 'box turtle': 1}}
df = pd.DataFrame.from_dict(out, orient='index').fillna(0).astype(int)
print (df)
cat dog box turtle
file1 1 1 0
file2 1 1 1
要做到这一点,需要考虑一些安静的事情,例如处理标点符号、复数、1 个术语 2 个术语等
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')
import string
import pandas as pd
def preproc(x):
#make translator object
trans=str.maketrans('','',string.punctuation)
wnl = WordNetLemmatizer()
x = ' '.join([wnl.lemmatize(e) for e in x.translate(trans).split()])
return x
vectorizer = CountVectorizer(vocabulary=filter_words,
ngram_range=(1,2),
preprocessor=preproc)
X = vectorizer.fit_transform(words)
pd.DataFrame(columns=filter_words,
data=X.todense())
输出:
cat dog box turtle sea horse
0 1 1 0 0
1 1 1 1 0
我定义了一个列表,它读取多个文件的内容并存储所有这些文件。 如何创建一个dataframe,每个文件名在一行中,对应的列计算每个单词的出现次数并输出。
为了举例,假设这一切都是明确定义的(但如果需要我可以提供原始代码):
#define list
words = [ file1_contents, file2_contents ]
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog".
filter_words = ["cat", "dog", "box turtle", "sea horse"]
输出将是这样的:
output = {'file1'{'cat': 1, 'dog':1, 'box turtle': 1, 'sea horse': 0}, 'file2'{ ...}}
我附上了我最终目标的图片。我刚开始使用 python,所以我不太确定我会在这里使用什么 package/module?我知道 pandas 让您可以使用数据框。
我想到了使用 collections
Counter
from collections import Counter
z = ['blue', 'red', 'blue', 'yellow', 'blue', 'red']
Counter(z)
Counter({'blue': 3, 'red': 2, 'yellow': 1})
但是,这就是我被困的地方。我如何在 python 中组织一个看起来像附件图像的 table?
示例输出:
from collections import Counter
df_st = pd.DataFrame()
for i in range(1,3):
filename = 'file'+str(i)+'.txt'
with open(filename,'r') as f:
list_words = []
word_count = 0
for line in f:
for word in line.split():
word_count = word_count + 1
list_words.append(word)
df2 = pd.DataFrame(index = (0,),data=Counter(list_words))
df2['0_word_count'] = word_count
df2['0_file_name'] = filename
df_st = df_st.append(df2, ignore_index=True)
df_st
Out[2]:
(who 0_file_name 0_word_count about and another box but cat cats ... pet sea sea), squirrel, string that the turtle turtles. with
0 NaN file1.txt 18 NaN 1.0 NaN 1 NaN NaN 1.0 ... 1.0 1.0 NaN NaN 1 1.0 NaN 1 1.0 2.0
1 1.0 file2.txt 18 1.0 NaN 1.0 1 1.0 1.0 NaN ... NaN NaN 1.0 1.0 1 NaN 1.0 1 NaN NaN
想法是循环每个文件内容,从列表 filter_words
过滤值 re.findall
,计数 Counter
并为 DataFrame
创建字典:
file1_contents = "string with dogs, cats and my pet sea turtle that lives in my box with my other turtles."
file2_contents = "another string about my squirrel, box turtle (who lives in the sea), but not my cat or dog."
import re
from collections import Counter
words = {'file1': file1_contents, 'file2':file2_contents}
filter_words = ["cat", "dog", "box turtle", "sea horse"]
out = {}
for k, w in words.items():
new = []
for fw in filter_words:
new.extend(re.findall(r"{}".format(fw),w) )
out[k] = dict(Counter(new))
print (out)
{'file1': {'cat': 1, 'dog': 1}, 'file2': {'cat': 1, 'dog': 1, 'box turtle': 1}}
df = pd.DataFrame.from_dict(out, orient='index').fillna(0).astype(int)
print (df)
cat dog box turtle
file1 1 1 0
file2 1 1 1
要做到这一点,需要考虑一些安静的事情,例如处理标点符号、复数、1 个术语 2 个术语等
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import WordNetLemmatizer
# nltk.download('wordnet')
import string
import pandas as pd
def preproc(x):
#make translator object
trans=str.maketrans('','',string.punctuation)
wnl = WordNetLemmatizer()
x = ' '.join([wnl.lemmatize(e) for e in x.translate(trans).split()])
return x
vectorizer = CountVectorizer(vocabulary=filter_words,
ngram_range=(1,2),
preprocessor=preproc)
X = vectorizer.fit_transform(words)
pd.DataFrame(columns=filter_words,
data=X.todense())
输出:
cat dog box turtle sea horse
0 1 1 0 0
1 1 1 1 0