计数器()和most_common
Counter() and most_common
我正在使用 Counter() 来计算 excel 文件中的单词数。
我的目标是从文档中获取最常用的词。
Counter() 无法正常处理我的文件的问题。
这是代码:
#1. Building a Counter with bag-of-words
import pandas as pd
df = pd.read_excel('combined_file.xlsx', index_col=None)
import nltk
from nltk.tokenize import word_tokenize
# Tokenize the article: tokens
df['tokens'] = df['body'].apply(nltk.word_tokenize)
# Convert the tokens into string values
df_tokens_list = df.tokens.tolist()
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [[string.lower() for string in sublist] for sublist in df_tokens_list]
# Import Counter
from collections import Counter
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(x for xs in lower_tokens for x in set(xs))
# Print the 10 most common tokens
print(bow_simple.most_common(10))
#2. Text preprocessing practice
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
# Retain alphabetic words: alpha_only
alpha_only = [t for t in bow_simple if t.isalpha()]
# Remove all stop words: no_stops
from nltk.corpus import stopwords
no_stops = [t for t in alpha_only if t not in stopwords.words("english")]
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
# Create the bag-of-words: bow
bow = Counter(lemmatized)
print(bow)
# Print the 10 most common tokens
print(bow.most_common(10))
预处理后出现频率最高的词是:
[('dry', 3), ('try', 3), ('clean', 3), ('love', 2), ('one', 2), ('serum', 2), ('eye', 2), ('boot', 2), ('woman', 2), ('cream', 2)]
如果我们在 excel 中手动计算这些单词,则情况并非如此。
你知道我的代码可能有什么问题吗?在这方面,我将不胜感激。
文件的 link 在这里:
https://www.dropbox.com/scl/fi/43nu0yf45obbyzprzc86n/combined_file.xlsx?dl=0&rlkey=7j959kz0urjxflf6r536brppt
问题是 bow_simple
值是一个计数器,您需要对其进行进一步处理。这意味着所有项目只会在列表中出现一次,最终结果只是计算当降低并使用 nltk
处理时在计数器中出现的单词变体的数量。解决方案是创建一个扁平化的词表并将其输入 alpha_only
:
# Create a Counter with the lowercase tokens: bow_simple
wordlist = [item for sublist in lower_tokens for item in sublist] #flatten list of lists
bow_simple = Counter(wordlist)
然后在alpha_only中使用wordlist:
alpha_only = [t for t in wordlist if t.isalpha()]
输出:
[('eye', 3617), ('product', 2567), ('cream', 2278), ('skin', 1791), ('good', 1081), ('use', 1006), ('really', 984), ('using', 928), ('feel', 798), ('work', 785)]
我正在使用 Counter() 来计算 excel 文件中的单词数。 我的目标是从文档中获取最常用的词。 Counter() 无法正常处理我的文件的问题。 这是代码:
#1. Building a Counter with bag-of-words
import pandas as pd
df = pd.read_excel('combined_file.xlsx', index_col=None)
import nltk
from nltk.tokenize import word_tokenize
# Tokenize the article: tokens
df['tokens'] = df['body'].apply(nltk.word_tokenize)
# Convert the tokens into string values
df_tokens_list = df.tokens.tolist()
# Convert the tokens into lowercase: lower_tokens
lower_tokens = [[string.lower() for string in sublist] for sublist in df_tokens_list]
# Import Counter
from collections import Counter
# Create a Counter with the lowercase tokens: bow_simple
bow_simple = Counter(x for xs in lower_tokens for x in set(xs))
# Print the 10 most common tokens
print(bow_simple.most_common(10))
#2. Text preprocessing practice
# Import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
# Retain alphabetic words: alpha_only
alpha_only = [t for t in bow_simple if t.isalpha()]
# Remove all stop words: no_stops
from nltk.corpus import stopwords
no_stops = [t for t in alpha_only if t not in stopwords.words("english")]
# Instantiate the WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# Lemmatize all tokens into a new list: lemmatized
lemmatized = [wordnet_lemmatizer.lemmatize(t) for t in no_stops]
# Create the bag-of-words: bow
bow = Counter(lemmatized)
print(bow)
# Print the 10 most common tokens
print(bow.most_common(10))
预处理后出现频率最高的词是:
[('dry', 3), ('try', 3), ('clean', 3), ('love', 2), ('one', 2), ('serum', 2), ('eye', 2), ('boot', 2), ('woman', 2), ('cream', 2)]
如果我们在 excel 中手动计算这些单词,则情况并非如此。 你知道我的代码可能有什么问题吗?在这方面,我将不胜感激。
文件的 link 在这里: https://www.dropbox.com/scl/fi/43nu0yf45obbyzprzc86n/combined_file.xlsx?dl=0&rlkey=7j959kz0urjxflf6r536brppt
问题是 bow_simple
值是一个计数器,您需要对其进行进一步处理。这意味着所有项目只会在列表中出现一次,最终结果只是计算当降低并使用 nltk
处理时在计数器中出现的单词变体的数量。解决方案是创建一个扁平化的词表并将其输入 alpha_only
:
# Create a Counter with the lowercase tokens: bow_simple
wordlist = [item for sublist in lower_tokens for item in sublist] #flatten list of lists
bow_simple = Counter(wordlist)
然后在alpha_only中使用wordlist:
alpha_only = [t for t in wordlist if t.isalpha()]
输出:
[('eye', 3617), ('product', 2567), ('cream', 2278), ('skin', 1791), ('good', 1081), ('use', 1006), ('really', 984), ('using', 928), ('feel', 798), ('work', 785)]