稀疏矩阵上的 scikit-learn HashingVectorizer

scikit-learn HashingVectorizer on sparse matrix

在 scikit-learn 中,我如何 运行 HashingVectorizer 对已经存在于 scipy.sparse 矩阵中的数据进行处理?

我的数据是 svmlight 格式,所以我用 sklearn.datasets.load_svmlight_file 加载它并得到一个 scipy.sparse 矩阵来处理。

要用TfidfTransformer from scikit-learn can be fed such a sparse matrix to transform it, but how can I give the same sparse matrix to the HashingVectorizer代替吗?

编辑: 是否可以在稀疏矩阵上使用一系列方法调用,也许使用 FeatureHasher

编辑 2:在与下面的用户 cfh 进行了有益的讨论之后,我的目标是从输入:从 svmlight 数据获得的稀疏计数矩阵到输出:令牌出现的矩阵,例如 HashingVectorizer 是给予。这怎么可能?

我在下面提供了一个示例代码,非常感谢您提供有关如何执行此操作的帮助,在此先感谢:

from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from scipy.sparse import csr_matrix

# example data
X_train = np.array([[1., 1.], [2., 3.], [4., 0.]])
print "X_train: \n", X_train
# transform to scipy.sparse.csr.csr_matrix to be consistent with output from load_svmlight_file
X_train_crs = csr_matrix(X_train)
print "X_train_crs: \n", X_train_crs   
# no problem to run TfidfTransformer() on this csr matrix to get a transformed csr matrix
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X_train)
print "tfidf: \n", tfidf
# How do I use the HashingVectorizer with X_train_crs ?
hv = HashingVectorizer(n_features=2)

散列基本上是将单词随机组合到数量较少的桶中。使用已经计算出的频率矩阵,您可以像这样模拟它:

n_features = X_train.shape[1]
n_desired_features = n_features / 5
buckets = np.random.random_integers(0, n_desired_features-1, size=n_features)
X_new = np.zeros((X_train.shape[0], n_desired_features), dtype=X_train.dtype)
for i in range(n_features):
    X_new[:,buckets[i]] += X_train[:,i]

当然您可以根据需要调整n_desired_features。 只需确保对测试数据也使用相同的 buckets

如果你需要对稀疏矩阵做同样的事情,你可以这样做:

M = coo_matrix((repeat(1,n_features), (range(n_features), buckets)),
               shape=(n_features,n_desired_features))
X_new = X_train.dot(M)