删除列表中不常出现的单词
Removing words in a list that appear infrequently
我有许多文档已标记化并变成了以标记作为元素的列表 - 然后我将所有这些列表插入到一个列表中,这样我就有了一个标记列表。
简单示例:
[["egg","apple","bread","milk","pear"], ["egg","apple","bread","milk"], ["egg","apple","bread","milk"]]
我想删除出现在不到 x% 的文档中的标记(例如上面的 "pear",因为它只出现在三个文档中的一个文档中。)但是,我不确定如何以有效的方式执行此操作 - 我知道数据结构可能有问题,但我需要在代码的下一部分以这种格式输出。
我现在的代码是这样的,在文档很多的时候显然效率不是很高:
min_docs = 0.05*len(tokenized_document_list)
whitelist = []
for document in tokenized_document_list: #Go through each document
for token in document: #Go through each token in each document
if token in whitelist:
continue
else:
token_count = 0
for document_t in tokenized_document_list: #Go through each document looking for token
if token in document_t:
token_count = token_count + 1
if token_count > min_docs:
whitelist.append(token)
break
if token_count < min_docs:
document.remove(token)
如有任何建议,我们将不胜感激!
from collections import defaultdict
import six
def calc_token_frequencies(doc_list):
frequencies = defaultdict(int) # Each dict item will start off as int(0)
for token_set in doc_list:
for token in token_set:
frequencies[token] += 1
return frequencies
if __name__ == '__main__':
# Use a list of sets here in order to leverage set features
tokenized_document_list = [{"egg", "apple", "bread", "milk", "pear"},
{"egg", "apple", "bread", "milk"},
{"egg", "apple", "bread", "milk"}]
# Count the number of documents each token was in.
token_frequencies = calc_token_frequencies(tokenized_document_list)
# I used 50% here instead of the example 5% so that it would do something useful.
token_min_docs = 0.5*len(tokenized_document_list)
# Calculate the black list via set comprehension.
token_blacklist = {token for token, doc_count in six.iteritems(token_frequencies)
if doc_count < token_min_docs}
# Remove items on the black list
for doc_tokens in tokenized_document_list:
doc_tokens.difference_update(token_blacklist)
print tokenized_document_list
我有许多文档已标记化并变成了以标记作为元素的列表 - 然后我将所有这些列表插入到一个列表中,这样我就有了一个标记列表。
简单示例:
[["egg","apple","bread","milk","pear"], ["egg","apple","bread","milk"], ["egg","apple","bread","milk"]]
我想删除出现在不到 x% 的文档中的标记(例如上面的 "pear",因为它只出现在三个文档中的一个文档中。)但是,我不确定如何以有效的方式执行此操作 - 我知道数据结构可能有问题,但我需要在代码的下一部分以这种格式输出。
我现在的代码是这样的,在文档很多的时候显然效率不是很高:
min_docs = 0.05*len(tokenized_document_list)
whitelist = []
for document in tokenized_document_list: #Go through each document
for token in document: #Go through each token in each document
if token in whitelist:
continue
else:
token_count = 0
for document_t in tokenized_document_list: #Go through each document looking for token
if token in document_t:
token_count = token_count + 1
if token_count > min_docs:
whitelist.append(token)
break
if token_count < min_docs:
document.remove(token)
如有任何建议,我们将不胜感激!
from collections import defaultdict
import six
def calc_token_frequencies(doc_list):
frequencies = defaultdict(int) # Each dict item will start off as int(0)
for token_set in doc_list:
for token in token_set:
frequencies[token] += 1
return frequencies
if __name__ == '__main__':
# Use a list of sets here in order to leverage set features
tokenized_document_list = [{"egg", "apple", "bread", "milk", "pear"},
{"egg", "apple", "bread", "milk"},
{"egg", "apple", "bread", "milk"}]
# Count the number of documents each token was in.
token_frequencies = calc_token_frequencies(tokenized_document_list)
# I used 50% here instead of the example 5% so that it would do something useful.
token_min_docs = 0.5*len(tokenized_document_list)
# Calculate the black list via set comprehension.
token_blacklist = {token for token, doc_count in six.iteritems(token_frequencies)
if doc_count < token_min_docs}
# Remove items on the black list
for doc_tokens in tokenized_document_list:
doc_tokens.difference_update(token_blacklist)
print tokenized_document_list