为文本挖掘创建词汇字典
Create vocabulary dictionary for text mining
我有以下代码:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
现在我正在尝试像这样计算词频:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
接下来我要打印词汇表。因此我这样做:
vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
现在我得到了输出 none。虽然我期待这样的事情:
{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
有什么地方出了问题吗?
我想你可以试试这个:
print vectorizer.vocabulary_
CountVectorizer
不支持您要查找的内容。
您可以使用 Counter
class:
from collections import Counter
train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
word_counter.update(s.split())
print(word_counter)
给予
Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})
或者你可以使用来自 nltk 的 FreqDist
:
from nltk import FreqDist
train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
word_dist.update(s.split())
print(dict(word_dist))
给予
{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}
我有以下代码:
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
现在我正在尝试像这样计算词频:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
接下来我要打印词汇表。因此我这样做:
vectorizer.fit_transform(train_set)
print vectorizer.vocabulary
现在我得到了输出 none。虽然我期待这样的事情:
{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
有什么地方出了问题吗?
我想你可以试试这个:
print vectorizer.vocabulary_
CountVectorizer
不支持您要查找的内容。
您可以使用 Counter
class:
from collections import Counter
train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
word_counter.update(s.split())
print(word_counter)
给予
Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})
或者你可以使用来自 nltk 的 FreqDist
:
from nltk import FreqDist
train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
word_dist.update(s.split())
print(dict(word_dist))
给予
{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}