从 scipy 频率矩阵和值数组创建一个 'virtual' numpy 数组

Question

我有一个 M 由 W 频率矩阵 doc_word_freqs 表示单词 w 的次数 出现在文档 m 中的 scipy CSR 矩阵中。我还有一个 W 维向量 z_scores，其中包含与每个单词关联的一些值（在我的特定情况下，两个子集之间每个单词的对数优势比的 z 分数语料库，但这与问题无关）。

我想针对每个文档的一组 z 分数计算一些指标（在本例中为方差）。也就是说，类似于：

np.var(doc_z_scores, axis=1)

其中 doc_z_scores 有 M 行，每行包含文档 m[=32] 中每个单词的 z 分数列表=].这是我现在拥有的，但它相当不雅且非常慢：

docs = [[]] * doc_word_freqs.shape[0] # Make a list of M empty lists for m, w in zip(*doc_word_freqs.nonzero()): # For each non-zero index in doc_word_freqs, append the # the z-score of that word the appropriate number of times for _ in range(doc_word_freqs[m, w]): docs[m].append(word_z_scores[w]) # Calculate the variance of each of the resulting lists and return return np.array([np.var(m) for m in docs])

有没有什么方法可以在不实际创建方差数组（或任何其他可能的度量）的情况下做到这一点？

Answer 1

我不是 100% 确定我正确理解了你的问题。您可以使用 matrix-vector 乘法：

weight = (doc_word_freqs @ np.ones_like(word_z_scores)).A.ravel()
mean = (doc_word_freqs @ word_z_scores).A.ravel() / weight
raw_2nd = (doc_word_freqs @ (word_z_scores**2)).A.ravel()
variance = raw_2nd / weight - mean**2

对于 "unbiased" 方差，在适当的地方使用 -1。

从 scipy 频率矩阵和值数组创建一个 'virtual' numpy 数组

Create a 'virtual' numpy array from scipy frequency matrix and array of values

python

numpy

scipy

sparse-matrix