如何计算从文件中读取的字符串中的单词?
How to count words from strings read from a file?
我正在尝试创建一个程序,它获取给定路径中的所有文本文件并将所有字符串保存在一个列表中:
import os
import collections
vocab = set()
path = 'a\path\'
listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
for file in listing:
#print('Current file : ', file)
pos_review = open(path+file, "r", encoding ='utf8')
words = pos_review.read().split()
#print(type(words))
vocab.update(words)
pos_review.close()
print(vocab)
pos_dict = dict.fromkeys(vocab,0)
print(pos_dict)
输入
file1.txt: A quick brown fox.
file2.txt: a quick boy ran.
file3.txt: fox ran away.
输出
A : 2
quick : 2
brown : 1
fox : 2
boy : 1
ran : 2
away : 1
到现在为止,我可以制作这些字符串的字典。但现在不确定如何将字符串的键、值对及其在所有文本文件中的频率合并。
希望对您有所帮助,
import os
import collections
vocab = set()
path = 'a\path\'
listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
whole=[]
for file in listing:
#print('Current file : ', file)
pos_review = open(path+file, "r", encoding ='utf8')
words = pos_review.read().split()
whole.extend(words)
pos_review.close()
print(vocab)
d={} #Creating an Empty dictionary
for item in whole:
if item in d.keys():
d[item]+=1 #Update count
else:
d[item]=1
print(d)
使用collections.Counter
:
Counter
是一个 dict
子类,用于计算可迭代对象
数据
- 给定 3 个文件,名为
t1.txt
、t2.txt
和 t3.txt
- 每个文件包含以下3行文字
file1 txt A quick brown fox.
file2 txt a quick boy ran.
file3 txt fox ran away.
代码:
获取文件:
from pathlib import Path
files = list(Path('e:/PythonProjects/stack_overflow/t-files').glob('t*.txt'))
print(files)
# Output
[WindowsPath('e:/PythonProjects/stack_overflow/t-files/t1.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t2.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t3.txt')]
收集并统计字数:
- 创建一个单独的函数,
clean_str
,用于清理每行文本
str.lower
小写字母
str.translate
, str.maketrans
& string.punctuation
用于高度优化的标点符号删除
- 来自Best way to strip punctuation from a string
from collections import Counter
import string
def clean_string(value: str) -> list:
value = value.lower()
value = value.translate(str.maketrans('', '', string.punctuation))
value = value.split()
return value
words = Counter()
for file in files:
with file.open('r') as f:
lines = f.readlines()
for line in lines:
line = clean_string(line)
words.update(line)
print(words)
# Output
Counter({'file1': 3,
'txt': 9,
'a': 6,
'quick': 6,
'brown': 3,
'fox': 6,
'file2': 3,
'boy': 3,
'ran': 6,
'file3': 3,
'away': 3})
列表words
:
list_words = list(words.keys())
print(list_words)
>>> ['file1', 'txt', 'a', 'quick', 'brown', 'fox', 'file2', 'boy', 'ran', 'file3', 'away']
这也有效
import pandas as pd
import glob.glob
files = glob.glob('test*.txt')
txts = []
for f in files:
with open (f,'r') as t: txt = t.read()
txts.append(txt)
texts=' '.join(txts)
df = pd.DataFrame({'words':texts.split()})
out = df.words.value_counts().to_dict()
我正在尝试创建一个程序,它获取给定路径中的所有文本文件并将所有字符串保存在一个列表中:
import os
import collections
vocab = set()
path = 'a\path\'
listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
for file in listing:
#print('Current file : ', file)
pos_review = open(path+file, "r", encoding ='utf8')
words = pos_review.read().split()
#print(type(words))
vocab.update(words)
pos_review.close()
print(vocab)
pos_dict = dict.fromkeys(vocab,0)
print(pos_dict)
输入
file1.txt: A quick brown fox.
file2.txt: a quick boy ran.
file3.txt: fox ran away.
输出
A : 2
quick : 2
brown : 1
fox : 2
boy : 1
ran : 2
away : 1
到现在为止,我可以制作这些字符串的字典。但现在不确定如何将字符串的键、值对及其在所有文本文件中的频率合并。
希望对您有所帮助,
import os
import collections
vocab = set()
path = 'a\path\'
listing = os.listdir(path)
unwanted_chars = ".,-_/()*"
vocab={}
whole=[]
for file in listing:
#print('Current file : ', file)
pos_review = open(path+file, "r", encoding ='utf8')
words = pos_review.read().split()
whole.extend(words)
pos_review.close()
print(vocab)
d={} #Creating an Empty dictionary
for item in whole:
if item in d.keys():
d[item]+=1 #Update count
else:
d[item]=1
print(d)
使用collections.Counter
:
Counter
是一个dict
子类,用于计算可迭代对象
数据
- 给定 3 个文件,名为
t1.txt
、t2.txt
和t3.txt
- 每个文件包含以下3行文字
file1 txt A quick brown fox.
file2 txt a quick boy ran.
file3 txt fox ran away.
代码:
获取文件:
from pathlib import Path
files = list(Path('e:/PythonProjects/stack_overflow/t-files').glob('t*.txt'))
print(files)
# Output
[WindowsPath('e:/PythonProjects/stack_overflow/t-files/t1.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t2.txt'),
WindowsPath('e:/PythonProjects/stack_overflow/t-files/t3.txt')]
收集并统计字数:
- 创建一个单独的函数,
clean_str
,用于清理每行文本 str.lower
小写字母str.translate
,str.maketrans
&string.punctuation
用于高度优化的标点符号删除- 来自Best way to strip punctuation from a string
from collections import Counter
import string
def clean_string(value: str) -> list:
value = value.lower()
value = value.translate(str.maketrans('', '', string.punctuation))
value = value.split()
return value
words = Counter()
for file in files:
with file.open('r') as f:
lines = f.readlines()
for line in lines:
line = clean_string(line)
words.update(line)
print(words)
# Output
Counter({'file1': 3,
'txt': 9,
'a': 6,
'quick': 6,
'brown': 3,
'fox': 6,
'file2': 3,
'boy': 3,
'ran': 6,
'file3': 3,
'away': 3})
列表words
:
list_words = list(words.keys())
print(list_words)
>>> ['file1', 'txt', 'a', 'quick', 'brown', 'fox', 'file2', 'boy', 'ran', 'file3', 'away']
这也有效
import pandas as pd
import glob.glob
files = glob.glob('test*.txt')
txts = []
for f in files:
with open (f,'r') as t: txt = t.read()
txts.append(txt)
texts=' '.join(txts)
df = pd.DataFrame({'words':texts.split()})
out = df.words.value_counts().to_dict()