我如何将 Rdd 更改为 Vectors.dense pyspark

Question

我是 pyspark 的新手 我需要更改我的 rdd :

tfidf.collect()
output:
[('fuel', 0.06190145817054232),
 ('months', 0.03095072908527116),
 ('lasting', 0.03095072908527116),
 ('noticeably', 0.03095072908527116),
 ('gravitational', 0.06190145817054232),
 ('minor', 0.03095072908527116),
 ('mass', 0.03095072908527116),
 ('apollo', 0.03095072908527116),
 ('missions', 0.03095072908527116),
 ('possible', 0.03095072908527116),
 ('perturbations', 0.03095072908527116),
 ('quite', 0.03095072908527116),
 ('crash', 0.03095072908527116),
 ('mapped', 0.03095072908527116),
 ('irregular', 0.03095072908527116),
 ('field', 0.06190145817054232),
 ('none', 0.03095072908527116),
 ('earth', 0.03095072908527116),
 ('issues', 0.03095072908527116),
 ('altitudes', 0.03095072908527116),
 ('know', 0.03095072908527116),
 ('big', 0.03095072908527116),
 ('problem', 0.03095072908527116)]

我在另一个例子中发现的与此类似的东西：

#this is a whole other exemple 
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]

我不知道如何将 tfidf rdd 制作成类似的数据

Answer 1

使用 RDD 的 map

from pyspark.ml.linalg import Vectors

tfidf.map(lambda x: (x[0], Vectors.dense(x[1]))).collect()
[('fuel', DenseVector([0.0619])),
 ('months', DenseVector([0.031])),
 ('lasting', DenseVector([0.031])),
 ('noticeably', DenseVector([0.031])),
 ('gravitational', DenseVector([0.0619])),
 ('minor', DenseVector([0.031])),
 ('mass', DenseVector([0.031])),
 ('apollo', DenseVector([0.031])),
 ('missions', DenseVector([0.031])),
 ('possible', DenseVector([0.031])),
 ('perturbations', DenseVector([0.031])),
 ('quite', DenseVector([0.031])),
 ('crash', DenseVector([0.031])),
 ('mapped', DenseVector([0.031])),
 ('irregular', DenseVector([0.031])),
 ('field', DenseVector([0.0619])),
 ('none', DenseVector([0.031])),
 ('earth', DenseVector([0.031])),
 ('issues', DenseVector([0.031])),
 ('altitudes', DenseVector([0.031])),
 ('know', DenseVector([0.031])),
 ('big', DenseVector([0.031])),
 ('problem', DenseVector([0.031]))]

我如何将 Rdd 更改为 Vectors.dense pyspark

How can i change Rdd to Vectors.dense pyspark

python

pyspark