NumPy 还是字典？

Question

我必须处理大型数据集。我需要存储每个句子的词频；我可以使用字典列表或使用 NumPy 数组来完成。

但是，我将不得不排序和追加（以防单词已经存在）- 在这种情况下哪个更好？

Answer 1

正如您在评论中提到的，您不知道最终将获得的 words/tweets 矩阵的大小，因此使用数组成为一个麻烦的解决方案。

由于您提到的原因，在这里使用字典感觉更自然。字典的键是推文中的单词，值可以是包含 (tweet_id, term_frequency) 元素的列表。

最终你可能想用你的术语频率做一些其他事情（例如分类）。我怀疑这就是为什么您想从一开始就使用 numpy 数组的原因。不过，如果您想这样做的话，之后将字典转换为 numpy 数组应该不会太难。

但是请注意，这个数组可能非常大（1M * 单词数）而且非常稀疏，这意味着它将主要包含零。因为这个 numpy 数组会占用大量内存来存储大量零，您可能需要查看一种内存效率更高的数据结构来存储稀疏矩阵（参见 scipy.sparse）。

希望这对您有所帮助。

Answer 2

您描述的问题的解决方案是 scipy's sparse matrix。

一个小例子：

from scipy.sparse import csr_matrix
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))

print csr_matrix((data, indices, indptr), dtype=int).toarray()

每个句子是行，每个术语是一列。

另一个提示 - 查看 CountVectorizer：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)
    
print X.toarray()
#prints
 [[1 1 1 1 1]
 [1 0 1 1 1]
 [0 0 0 1 0]
 [1 1 1 1 1]]

现在 X 是您的文档术语矩阵（请注意 X 是 csr_matrix）。你也可以使用 TfidfTransformer 如果你想 tf-idf 它。

NumPy 还是字典？

NumPy or Dictionary?

python

sorting

dictionary

numpy

scikit-learn

另一个提示 - 查看 CountVectorizer：