Scikit 最近邻的 TFIDF 点列表

List of TFIDF points for Scikit Nearest Neighbor

我可以 运行 一个 TFIDF 的邻居,但不是他们的列表。

在细节之前,我应该提到我这样做的原因是因为运行每个数据点的邻居需要很长时间,我想给邻居一个点列表将在内部进行优化.

根据 NN 文档: http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html#sklearn.neighbors.NearestNeighbors.kneighbors

它说我可以查询多个点:

>>>X = [[0., 1., 0.], [1., 0., 1.]]
>>>neigh.kneighbors(X, return_distance=False) 
>>>array([[1],
   [2]]...)

我也在尝试这样做。 我可以 运行 每个点的邻居:

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

samples = ["This is a test","a very good test","some more text"]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(samples)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
neigh = NearestNeighbors(n_neighbors=1, n_jobs=-1) 
neigh.fit(X_train_tfidf)

ll=[]
test=["Test if this works","Zoom zoom"]
for k in test:
    predict = count_vect.transform([k])
    X_tfidf2 = tfidf_transformer.transform(predict)
    ll.append(X_tfidf2)
    res = neigh.kneighbors(X_tfidf2, return_distance=False)
#res = neigh.kneighbors(ll, return_distance=False)

当我将所有 TFIDF 稀疏矩阵添加到列表中并尝试时,出现错误。取消注释最后一行以获得错误。

错误: ValueError:使用序列设置数组元素(在线 res = neigh.kneighbors...)

尝试:

from scipy import sparse

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

samples = ["This is a test","a very good test","some more text"]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(samples)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
neigh = NearestNeighbors(n_neighbors=1, n_jobs=-1) 
neigh.fit(X_train_tfidf)

ll=[]
test=["Test if this works","Zoom zoom"]
for k in test:
    predict = count_vect.transform([k])
    X_tfidf2 = tfidf_transformer.transform(predict)
    ll.append(X_tfidf2)

ll = sparse.vstack((ll))
res = neigh.kneighbors(ll, return_distance=False)

没有循环: 来自 scipy 导入稀疏

from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

samples = ["This is a test","a very good test","some more text"]
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(samples)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
neigh = NearestNeighbors(n_neighbors=1, n_jobs=-1) 
neigh.fit(X_train_tfidf)

test=["Test if this works","Zoom zoom"]
X_test_counts = count_vect.transform(test)

X_test_tfidf = tfidf_transformer.transform(X_test_counts)

res = neigh.kneighbors(X_test_tfidf, return_distance=False)