删除列表中不常出现的单词

Question

我有许多文档已标记化并变成了以标记作为元素的列表 - 然后我将所有这些列表插入到一个列表中，这样我就有了一个标记列表。

简单示例：

[["egg","apple","bread","milk","pear"], ["egg","apple","bread","milk"], ["egg","apple","bread","milk"]]

我想删除出现在不到 x% 的文档中的标记（例如上面的 "pear"，因为它只出现在三个文档中的一个文档中。）但是，我不确定如何以有效的方式执行此操作 - 我知道数据结构可能有问题，但我需要在代码的下一部分以这种格式输出。

我现在的代码是这样的，在文档很多的时候显然效率不是很高：

min_docs = 0.05*len(tokenized_document_list)
whitelist = []
for document in tokenized_document_list: #Go through each document
    for token in document: #Go through each token in each document
        if token in whitelist:
            continue
        else:
            token_count = 0
            for document_t in tokenized_document_list: #Go through each document looking for token
                if token in document_t:
                    token_count = token_count + 1
                    if token_count > min_docs:
                        whitelist.append(token)
                        break
            if token_count < min_docs:
                document.remove(token)

如有任何建议，我们将不胜感激！

Answer 1

from collections import defaultdict
import six


def calc_token_frequencies(doc_list):
    frequencies = defaultdict(int)  # Each dict item will start off as int(0)
    for token_set in doc_list:
        for token in token_set:
            frequencies[token] += 1
    return frequencies


if __name__ == '__main__':
    # Use a list of sets here in order to leverage set features
    tokenized_document_list = [{"egg", "apple", "bread", "milk", "pear"},
                               {"egg", "apple", "bread", "milk"},
                               {"egg", "apple", "bread", "milk"}]

    # Count the number of documents each token was in.
    token_frequencies = calc_token_frequencies(tokenized_document_list)

    # I used 50% here instead of the example 5% so that it would do something useful.
    token_min_docs = 0.5*len(tokenized_document_list)

    # Calculate the black list via set comprehension.
    token_blacklist = {token for token, doc_count in six.iteritems(token_frequencies)
                       if doc_count < token_min_docs}

    # Remove items on the black list
    for doc_tokens in tokenized_document_list:
        doc_tokens.difference_update(token_blacklist)

    print tokenized_document_list

删除列表中不常出现的单词

Removing words in a list that appear infrequently

text

list

token

python-2.7