在 Python 中计算共现矩阵的任何替代方法？

Question

我正在尝试计算大型语料库的共现矩阵，但这需要很长时间（+6 小时）。有没有更快的方法？

我的做法：

将此数组视为 corpus 并将语料库的每个元素视为 context:

corpus = [
    'where python is used',
    'what is python used in',
    'why python is best',
    'what companies use python'
]

算法：

words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')

for context in corpus:
    context = context.split(' ')
    for i in range(len(context)):
        for j in range(i + 1, len(context)):
            row = words.index(context[i])
            column = words.index(context[j])
            c_matrix[row][column] += 1

Answer 1

提供的算法效率不高，因为它需要重新计算 words.index(...) 很多时间。您可以先预先计算索引，然后构建矩阵。这是一个明显更好的解决方案：

words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')

for context in corpus:
    context = context.split(' ')
    index = [words.index(item) for item in context]
    for i in range(len(context)):
        for j in range(i + 1, len(context)):
            c_matrix[index[i]][index[j]] += 1

此外，您可以将 index 转换为 Numpy 数组并使用 Numba（或 Cython）从 [=12] 快速构建 c_matrix =].

最后可以将words转成字典（以当前列表中的字符串作为字典键，当前列表中的索引作为字典值），以便索引会更快（恒定时间获取）。

生成的算法应该快几个数量级。如果这还不够，那么您可能需要根据您的需要用更高级（但也更复杂）稀疏数据结构替换矩阵c_matrix。

在 Python 中计算共现矩阵的任何替代方法？

Any alternate approaches to calculate co-occurrence matrix in Python?

python

numpy

machine-learning