gensim.corpora.Dictionary 是否保存了术语频率?
Does gensim.corpora.Dictionary have term frequency saved?
gensim.corpora.Dictionary是否保存了词频?
从gensim.corpora.Dictionary
,可以得到单词的文档频率(即特定单词出现在多少文档中):
from nltk.corpus import brown
from gensim.corpora import Dictionary
documents = brown.sents()
brown_dict = Dictionary(documents)
# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
[输出]:
The word "these" appears in 1213 documents
并且有一个filter_n_most_frequent(remove_n)
函数可以删除第n个最频繁的标记:
filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
filter_n_most_frequent
函数是根据文档频率还是词频去除第n大的?
如果是后者,有什么方法可以访问 gensim.corpora.Dictionary
对象中单词的词频吗?
不,gensim.corpora.Dictionary
不保存词频。你可以see the source code here。 class只存储以下成员变量:
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
这意味着 class 中的所有内容都将频率定义为文档频率,而不是术语频率,因为后者从不全局存储。这适用于 filter_n_most_frequent(remove_n)
以及所有其他方法。
你能做这样的事情吗?
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab = list(dictionary.values()) #list of terms in the dictionary
vocab_tf = [dict(i) for i in corpus]
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies
词典没有,语料库有
# Term frequency
# load dictionary
dictionary = corpora.Dictionary.load('YourDict.dict')
# load corpus
corpus = corpora.MmCorpus('YourCorpus.mm')
CorpusTermFrequency = array([[(dictionary[id], freq) for id, freq in cp] for cp in corpus])
一种从 bow 表示而不是创建密集向量计算词频的有效方法。
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab_tf={}
for i in corpus:
for item,count in dict(i).items():
if item in vocab_tf:
vocab_tf[item]+=count
else:
vocab_tf[item] = count
我有同样的简单问题。看起来这个词的频率是隐藏的,在对象中是不可访问的。不知道为什么它会让测试和验证变得很痛苦。我所做的是将字典导出为文本..
dictionary.save_as_text('c:\research\gensimDictionary.txt')
在那个文本文件中,它们有三列。例如,这里有单词 "summit" "summon" 和 "sumo"
关键词频率
10 登顶 1227
3658召唤118
8477 相扑 40
我找到了一个解决方案,.cfs 是单词频率。请参阅 https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary
print(str(dictionary[10]), str(dictionary.cfs[10]))
登顶1227
简单
gensim.corpora.Dictionary
现在在其 cfs
属性中存储了术语频率。你可以看到 documentation here.
cfs
Collection frequencies: token_id -> how many instances of this token are contained in the documents.
Type: dict of (int, int)
gensim.corpora.Dictionary是否保存了词频?
从gensim.corpora.Dictionary
,可以得到单词的文档频率(即特定单词出现在多少文档中):
from nltk.corpus import brown
from gensim.corpora import Dictionary
documents = brown.sents()
brown_dict = Dictionary(documents)
# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
[输出]:
The word "these" appears in 1213 documents
并且有一个filter_n_most_frequent(remove_n)
函数可以删除第n个最频繁的标记:
filter_n_most_frequent(remove_n)
Filter out the ‘remove_n’ most frequent tokens that appear in the documents.After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
filter_n_most_frequent
函数是根据文档频率还是词频去除第n大的?
如果是后者,有什么方法可以访问 gensim.corpora.Dictionary
对象中单词的词频吗?
不,gensim.corpora.Dictionary
不保存词频。你可以see the source code here。 class只存储以下成员变量:
self.token2id = {} # token -> tokenId
self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared
self.num_docs = 0 # number of documents processed
self.num_pos = 0 # total number of corpus positions
self.num_nnz = 0 # total number of non-zeroes in the BOW matrix
这意味着 class 中的所有内容都将频率定义为文档频率,而不是术语频率,因为后者从不全局存储。这适用于 filter_n_most_frequent(remove_n)
以及所有其他方法。
你能做这样的事情吗?
dictionary = corpora.Dictionary(documents)
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab = list(dictionary.values()) #list of terms in the dictionary
vocab_tf = [dict(i) for i in corpus]
vocab_tf = list(pd.DataFrame(vocab_tf).sum(axis=0)) #list of term frequencies
词典没有,语料库有
# Term frequency
# load dictionary
dictionary = corpora.Dictionary.load('YourDict.dict')
# load corpus
corpus = corpora.MmCorpus('YourCorpus.mm')
CorpusTermFrequency = array([[(dictionary[id], freq) for id, freq in cp] for cp in corpus])
一种从 bow 表示而不是创建密集向量计算词频的有效方法。
corpus = [dictionary.doc2bow(sent) for sent in documents]
vocab_tf={}
for i in corpus:
for item,count in dict(i).items():
if item in vocab_tf:
vocab_tf[item]+=count
else:
vocab_tf[item] = count
我有同样的简单问题。看起来这个词的频率是隐藏的,在对象中是不可访问的。不知道为什么它会让测试和验证变得很痛苦。我所做的是将字典导出为文本..
dictionary.save_as_text('c:\research\gensimDictionary.txt')
在那个文本文件中,它们有三列。例如,这里有单词 "summit" "summon" 和 "sumo"
关键词频率
10 登顶 1227
3658召唤118
8477 相扑 40
我找到了一个解决方案,.cfs 是单词频率。请参阅 https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary
print(str(dictionary[10]), str(dictionary.cfs[10]))
登顶1227
简单
gensim.corpora.Dictionary
现在在其 cfs
属性中存储了术语频率。你可以看到 documentation here.
cfs
Collection frequencies: token_id -> how many instances of this token are contained in the documents.
Type: dict of (int, int)