我如何将 Rdd 更改为 Vectors.dense pyspark
How can i change Rdd to Vectors.dense pyspark
我是 pyspark 的新手
我需要更改我的 rdd :
tfidf.collect()
output:
[('fuel', 0.06190145817054232),
('months', 0.03095072908527116),
('lasting', 0.03095072908527116),
('noticeably', 0.03095072908527116),
('gravitational', 0.06190145817054232),
('minor', 0.03095072908527116),
('mass', 0.03095072908527116),
('apollo', 0.03095072908527116),
('missions', 0.03095072908527116),
('possible', 0.03095072908527116),
('perturbations', 0.03095072908527116),
('quite', 0.03095072908527116),
('crash', 0.03095072908527116),
('mapped', 0.03095072908527116),
('irregular', 0.03095072908527116),
('field', 0.06190145817054232),
('none', 0.03095072908527116),
('earth', 0.03095072908527116),
('issues', 0.03095072908527116),
('altitudes', 0.03095072908527116),
('know', 0.03095072908527116),
('big', 0.03095072908527116),
('problem', 0.03095072908527116)]
我在另一个例子中发现的与此类似的东西:
#this is a whole other exemple
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
我不知道如何将 tfidf rdd 制作成类似的数据
使用 RDD 的 map
from pyspark.ml.linalg import Vectors
tfidf.map(lambda x: (x[0], Vectors.dense(x[1]))).collect()
[('fuel', DenseVector([0.0619])),
('months', DenseVector([0.031])),
('lasting', DenseVector([0.031])),
('noticeably', DenseVector([0.031])),
('gravitational', DenseVector([0.0619])),
('minor', DenseVector([0.031])),
('mass', DenseVector([0.031])),
('apollo', DenseVector([0.031])),
('missions', DenseVector([0.031])),
('possible', DenseVector([0.031])),
('perturbations', DenseVector([0.031])),
('quite', DenseVector([0.031])),
('crash', DenseVector([0.031])),
('mapped', DenseVector([0.031])),
('irregular', DenseVector([0.031])),
('field', DenseVector([0.0619])),
('none', DenseVector([0.031])),
('earth', DenseVector([0.031])),
('issues', DenseVector([0.031])),
('altitudes', DenseVector([0.031])),
('know', DenseVector([0.031])),
('big', DenseVector([0.031])),
('problem', DenseVector([0.031]))]
我是 pyspark 的新手 我需要更改我的 rdd :
tfidf.collect()
output:
[('fuel', 0.06190145817054232),
('months', 0.03095072908527116),
('lasting', 0.03095072908527116),
('noticeably', 0.03095072908527116),
('gravitational', 0.06190145817054232),
('minor', 0.03095072908527116),
('mass', 0.03095072908527116),
('apollo', 0.03095072908527116),
('missions', 0.03095072908527116),
('possible', 0.03095072908527116),
('perturbations', 0.03095072908527116),
('quite', 0.03095072908527116),
('crash', 0.03095072908527116),
('mapped', 0.03095072908527116),
('irregular', 0.03095072908527116),
('field', 0.06190145817054232),
('none', 0.03095072908527116),
('earth', 0.03095072908527116),
('issues', 0.03095072908527116),
('altitudes', 0.03095072908527116),
('know', 0.03095072908527116),
('big', 0.03095072908527116),
('problem', 0.03095072908527116)]
我在另一个例子中发现的与此类似的东西:
#this is a whole other exemple
data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
我不知道如何将 tfidf rdd 制作成类似的数据
使用 RDD 的 map
from pyspark.ml.linalg import Vectors
tfidf.map(lambda x: (x[0], Vectors.dense(x[1]))).collect()
[('fuel', DenseVector([0.0619])),
('months', DenseVector([0.031])),
('lasting', DenseVector([0.031])),
('noticeably', DenseVector([0.031])),
('gravitational', DenseVector([0.0619])),
('minor', DenseVector([0.031])),
('mass', DenseVector([0.031])),
('apollo', DenseVector([0.031])),
('missions', DenseVector([0.031])),
('possible', DenseVector([0.031])),
('perturbations', DenseVector([0.031])),
('quite', DenseVector([0.031])),
('crash', DenseVector([0.031])),
('mapped', DenseVector([0.031])),
('irregular', DenseVector([0.031])),
('field', DenseVector([0.0619])),
('none', DenseVector([0.031])),
('earth', DenseVector([0.031])),
('issues', DenseVector([0.031])),
('altitudes', DenseVector([0.031])),
('know', DenseVector([0.031])),
('big', DenseVector([0.031])),
('problem', DenseVector([0.031]))]