Python 3.5 - 获取计数器以报告零频率项目
Python 3.5 - Get counter to report zero-frequency items
我正在对由于 PDF 到 txt 转换错误的文本进行文本分析,有时会将单词混在一起。所以我不想匹配单词,而是想匹配字符串。
例如,我有字符串:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
然后我搜索
key_words=['loss', 'debt', 'debts', 'elephant']
输出应为以下形式:
Filename Debt Debts Loss Elephant
mystring 2 1 1 0
我的代码运行良好,除了一些小故障:1)它不报告零频率词的频率(因此 'Elephant' 不会出现在输出中:2)顺序key_words 中的单词似乎很重要(即,有时 'debt' 和 'debts' 各有 1 个计数,有时 'debt' 和“债务”仅报告 2 个计数未报告。如果我设法 "print" 数据集的变量名称,我可以接受第二点......但不确定如何。
下面是相关代码。谢谢!
PS。不用说,它不是最优雅的一段代码,但我正在慢慢学习。
bad=set(['debts', 'debt'])
csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
with open(filename, encoding='utf-8', errors='ignore') as f:
file_name=[]
file_name.append(filename)
new_review=[f.read()]
freq_all=[]
rev=[]
from collections import Counter
for review in new_review:
review_processed=review.lower()
for p in list(punctuation):
review_processed=review_processed.replace(p,'')
pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
freq_iter=collections.Counter(pattern.findall(review_processed))
frequency=[value for (key,value) in sorted(freq_iter.items())]
freq_all.append(frequency)
freq=[v for v in freq_all]
fulldata = [ [file_name[i]] + freq for i, freq in enumerate(freq)]
writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
writer.writerows(fulldata)
csvfile.flush()
如你所愿:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
count = mystring.count(kw)
print('%s %s' % (kw, count))
或者换句话说:
from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
d[word] += 1
for kw in key_words:
print('%s %s' % (kw, d[kw]))
你可以预先初始化计数器,像这样:
freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))
关于 Counter
的一个好处是您实际上不必预先初始化它 - 您可以只做 c = Counter(); c['key'] += 1
,但没有什么能阻止您将某些值预先初始化为 0如果你想。
对于debt
/debts
这件事——这只是一个没有充分说明的问题。在这种情况下,您想要代码做什么?如果你想让它匹配最长的匹配模式,你需要对列表进行最长优先排序,这将解决它。如果您希望两者都被报告,您可能需要进行多次搜索并保存所有结果。
已更新以添加一些有关为什么找不到的信息 debts
:这与正则表达式 findall 的关系比其他任何事情都大。 re.findall
总是寻找最短的匹配项,但一旦找到,它也不会包含在后续匹配项中:
In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']
如果你真的想找到每个单词的所有个实例,你需要单独进行:
In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']
In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']
然而,也许您真正要找的是字。在这种情况下,使用 \b
运算符要求分词:
In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']
In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']
我不知道这是不是你想要的...在这种情况下,它能够正确区分 debt
和 debts
,但它错过了 debtor
因为它只匹配一个子字符串,我们要求它不要匹配。
根据你的用例,你可能想研究一下文本的词干...我相信 nltk 中有一个非常简单的(只用过一次,所以我不会尝试 post 一个例子...这个问题 Combining text stemming and removal of punctuation in NLTK and scikit-learn 可能会有用),它应该将 debt
、debts
和 debtor
都减少到同一个词根 debt
,对其他词做类似的事情。这可能有帮助,也可能没有帮助;我不知道你用它做什么。
一个圆滑的解决方案是使用正则表达式:
import regex
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
print ({k:len(regex.findall(k,mystring,overlapped=True)) for k in key_words})
结果:
{'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
计算出现次数可以在一个简单的一行中完成:
counts = {k: mystring.count(k) for k in key_words}
将其与 csv.DictWriter
放在一起得出:
import csv
mystring = 'The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words = ['loss', 'debt', 'debts', 'elephant']
counts = {k: mystring.count(k) for k in key_words}
print(counts) # {'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
# write out
with open('out.csv', 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=counts, delimiter=' ')
# key_words
writer.writeheader()
# counts
writer.writerow(counts)
# out.csv:
# loss debt debts elephant
# 1 2 1 0
我正在对由于 PDF 到 txt 转换错误的文本进行文本分析,有时会将单词混在一起。所以我不想匹配单词,而是想匹配字符串。
例如,我有字符串:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
然后我搜索
key_words=['loss', 'debt', 'debts', 'elephant']
输出应为以下形式:
Filename Debt Debts Loss Elephant
mystring 2 1 1 0
我的代码运行良好,除了一些小故障:1)它不报告零频率词的频率(因此 'Elephant' 不会出现在输出中:2)顺序key_words 中的单词似乎很重要(即,有时 'debt' 和 'debts' 各有 1 个计数,有时 'debt' 和“债务”仅报告 2 个计数未报告。如果我设法 "print" 数据集的变量名称,我可以接受第二点......但不确定如何。
下面是相关代码。谢谢! PS。不用说,它不是最优雅的一段代码,但我正在慢慢学习。
bad=set(['debts', 'debt'])
csvfile=open("freq_10k_test.csv", "w", newline='', encoding='cp850', errors='replace')
writer=csv.writer(csvfile)
for filename in glob.glob('*.txt'):
with open(filename, encoding='utf-8', errors='ignore') as f:
file_name=[]
file_name.append(filename)
new_review=[f.read()]
freq_all=[]
rev=[]
from collections import Counter
for review in new_review:
review_processed=review.lower()
for p in list(punctuation):
review_processed=review_processed.replace(p,'')
pattern = re.compile("|".join(bad), flags = re.IGNORECASE)
freq_iter=collections.Counter(pattern.findall(review_processed))
frequency=[value for (key,value) in sorted(freq_iter.items())]
freq_all.append(frequency)
freq=[v for v in freq_all]
fulldata = [ [file_name[i]] + freq for i, freq in enumerate(freq)]
writer=csv.writer(open("freq_10k_test.csv",'a',newline='', encoding='cp850', errors='replace'))
writer.writerows(fulldata)
csvfile.flush()
如你所愿:
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
for kw in key_words:
count = mystring.count(kw)
print('%s %s' % (kw, count))
或者换句话说:
from collections import defaultdict
words = set(mystring.split())
key_words=['loss', 'debt', 'debts', 'elephant']
d = defaultdict(int)
for word in words:
d[word] += 1
for kw in key_words:
print('%s %s' % (kw, d[kw]))
你可以预先初始化计数器,像这样:
freq_iter = collections.Counter()
freq_iter.update({x:0 for x in bad})
freq_iter.update(pattern.findall(review_processed))
关于 Counter
的一个好处是您实际上不必预先初始化它 - 您可以只做 c = Counter(); c['key'] += 1
,但没有什么能阻止您将某些值预先初始化为 0如果你想。
对于debt
/debts
这件事——这只是一个没有充分说明的问题。在这种情况下,您想要代码做什么?如果你想让它匹配最长的匹配模式,你需要对列表进行最长优先排序,这将解决它。如果您希望两者都被报告,您可能需要进行多次搜索并保存所有结果。
已更新以添加一些有关为什么找不到的信息 debts
:这与正则表达式 findall 的关系比其他任何事情都大。 re.findall
总是寻找最短的匹配项,但一旦找到,它也不会包含在后续匹配项中:
In [2]: re.findall('(debt|debts)', 'debtor debts my debt')
Out[2]: ['debt', 'debt', 'debt']
如果你真的想找到每个单词的所有个实例,你需要单独进行:
In [3]: re.findall('debt', 'debtor debts my debt')
Out[3]: ['debt', 'debt', 'debt']
In [4]: re.findall('debts', 'debtor debts my debt')
Out[4]: ['debts']
然而,也许您真正要找的是字。在这种情况下,使用 \b
运算符要求分词:
In [13]: re.findall(r'\bdebt\b', 'debtor debts my debt')
Out[13]: ['debt']
In [14]: re.findall(r'(\b(?:debt|debts)\b)', 'debtor debts my debt')
Out[14]: ['debts', 'debt']
我不知道这是不是你想要的...在这种情况下,它能够正确区分 debt
和 debts
,但它错过了 debtor
因为它只匹配一个子字符串,我们要求它不要匹配。
根据你的用例,你可能想研究一下文本的词干...我相信 nltk 中有一个非常简单的(只用过一次,所以我不会尝试 post 一个例子...这个问题 Combining text stemming and removal of punctuation in NLTK and scikit-learn 可能会有用),它应该将 debt
、debts
和 debtor
都减少到同一个词根 debt
,对其他词做类似的事情。这可能有帮助,也可能没有帮助;我不知道你用它做什么。
一个圆滑的解决方案是使用正则表达式:
import regex
mystring='The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words=['loss', 'debt', 'debts', 'elephant']
print ({k:len(regex.findall(k,mystring,overlapped=True)) for k in key_words})
结果:
{'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
计算出现次数可以在一个简单的一行中完成:
counts = {k: mystring.count(k) for k in key_words}
将其与 csv.DictWriter
放在一起得出:
import csv
mystring = 'The lossof our income made us go into debt but this is not too bad as we like some debts.'
key_words = ['loss', 'debt', 'debts', 'elephant']
counts = {k: mystring.count(k) for k in key_words}
print(counts) # {'loss': 1, 'debt': 2, 'debts': 1, 'elephant': 0}
# write out
with open('out.csv', 'w', newline='') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=counts, delimiter=' ')
# key_words
writer.writeheader()
# counts
writer.writerow(counts)
# out.csv:
# loss debt debts elephant
# 1 2 1 0