如何只计算字典中的单词,同时返回字典键名的计数
How to count only the words in a dictionary, while returning a count of the dictionary key name
我想给我发一个 excel 文件。首先,我必须将所有行连接成一个大文本文件。然后,扫描文本以查找字典中的单词。如果找到该词,则将其计为字典键名。最后 return 关系 table [word, count] 中统计单词的列表。
我可以数出字数,但无法使字典部分正常工作。
我的问题是:
- 我这样做是否正确?
- 这有可能吗,怎么可能?
来自互联网的调整代码
import collections
import re
import matplotlib.pyplot as plt
import pandas as pd
#% matplotlib inline
#file = open('PrideAndPrejudice.txt', 'r')
#file = file.read()
''' Convert excel column/ rows into a string of words'''
#text_all = pd.read_excel('C:\Python_Projects\Rake\data_file.xlsx')
#df=pd.DataFrame(text_all)
#case_words= df['case_text']
#print(case_words)
#case_concat= case_words.str.cat(sep=' ')
#print (case_concat)
text_all = ("Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever.")
''' done'''
import collections
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
# Read input file, note the encoding is specified here
# It may be different in your text file
# Startwords
startwords = {'happy':'glad','sad': 'lonely','big': 'tall', 'smart': 'clever'}
#startwords = startwords.union(set(['happy','sad','big','smart']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text_all.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
#file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
错误:空'DataFrame':没有要绘制的数字数据
预期输出:
- 快乐 1
- 悲伤 1
- 大 1
- 智能 1
if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
这部分好像有问题,先检查word
在startwords
,再检查wordcount
,如果在wordcount
,应该增加这个词按你的逻辑算。所以我相信你必须切换执行。
if word in wordcount:
//in dict, count++
wordcount[word] += 1
else:
// first time, set to 1
wordcount[word] = 1
这里有一个方法应该适用于最新版本的 pandas
(0.25.3 在撰写本文时):
# Setup
df = pd.DataFrame({'case_text': ["Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever."]})
startwords = {"happy":["glad","estatic"],
"sad": ["depressed", "lonely"],
"big": ["tall", "fat"],
"smart": ["clever", "bright"]}
# First you need to rearrange your startwords dict
startwords_map = {w: k for k, v in startwords.items() for w in v}
(df['case_text'].str.lower() # casts to lower case
.str.replace('[.,\*!?:]', '') # removes punctuation and special characters
.str.split() # splits the text on whitespace
.explode() # expands into a single pandas.Series of words
.map(startwords_map) # maps the words to the startwords
.value_counts() # counts word occurances
.to_dict()) # outputs to dict
[出局]
{'happy': 2, 'big': 1, 'smart': 1, 'sad': 1}
我想给我发一个 excel 文件。首先,我必须将所有行连接成一个大文本文件。然后,扫描文本以查找字典中的单词。如果找到该词,则将其计为字典键名。最后 return 关系 table [word, count] 中统计单词的列表。 我可以数出字数,但无法使字典部分正常工作。 我的问题是:
- 我这样做是否正确?
- 这有可能吗,怎么可能?
来自互联网的调整代码
import collections
import re
import matplotlib.pyplot as plt
import pandas as pd
#% matplotlib inline
#file = open('PrideAndPrejudice.txt', 'r')
#file = file.read()
''' Convert excel column/ rows into a string of words'''
#text_all = pd.read_excel('C:\Python_Projects\Rake\data_file.xlsx')
#df=pd.DataFrame(text_all)
#case_words= df['case_text']
#print(case_words)
#case_concat= case_words.str.cat(sep=' ')
#print (case_concat)
text_all = ("Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever.")
''' done'''
import collections
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
# Read input file, note the encoding is specified here
# It may be different in your text file
# Startwords
startwords = {'happy':'glad','sad': 'lonely','big': 'tall', 'smart': 'clever'}
#startwords = startwords.union(set(['happy','sad','big','smart']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in text_all.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace("\"","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("\nOK. The {} most common words are as follows\n".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
#file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
错误:空'DataFrame':没有要绘制的数字数据
预期输出:
- 快乐 1
- 悲伤 1
- 大 1
- 智能 1
if word in startwords:
if word in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
这部分好像有问题,先检查word
在startwords
,再检查wordcount
,如果在wordcount
,应该增加这个词按你的逻辑算。所以我相信你必须切换执行。
if word in wordcount:
//in dict, count++
wordcount[word] += 1
else:
// first time, set to 1
wordcount[word] = 1
这里有一个方法应该适用于最新版本的 pandas
(0.25.3 在撰写本文时):
# Setup
df = pd.DataFrame({'case_text': ["Billy was glad to see jack. Jack was estatic to play with Billy. Jack and Billy were lonely without eachother. Jack is tall and Billy is clever."]})
startwords = {"happy":["glad","estatic"],
"sad": ["depressed", "lonely"],
"big": ["tall", "fat"],
"smart": ["clever", "bright"]}
# First you need to rearrange your startwords dict
startwords_map = {w: k for k, v in startwords.items() for w in v}
(df['case_text'].str.lower() # casts to lower case
.str.replace('[.,\*!?:]', '') # removes punctuation and special characters
.str.split() # splits the text on whitespace
.explode() # expands into a single pandas.Series of words
.map(startwords_map) # maps the words to the startwords
.value_counts() # counts word occurances
.to_dict()) # outputs to dict
[出局]
{'happy': 2, 'big': 1, 'smart': 1, 'sad': 1}