pyspark:将 DenseVector 扩展为 RDD 中的元组
pyspark: expand a DenseVector to tuple in a RDD
我有如下RDD,每条记录都是(bigint, vector)的元组:
myRDD.take(5)
[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))]
如何扩展 Dense 向量并使其成为元组的一部分?即我希望以上内容成为:
[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)]
谢谢!
好吧,因为 pyspark.ml.linalg.DenseVector
(或 mllib
)是迭代的(提供 __len__
和 __getitem__
方法)你可以像对待其他任何东西一样对待它 python集合,例如:
def as_tuple(kv):
"""
>>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37])))
(1, 9.25, 1.0, 0.31, 0.31, 162.37)
"""
k, v = kv
# Use *v.toArray() if you want to support Sparse one as well.
return (k, *v)
对于Python 2替换:
(k, *v)
与:
from itertools import chain
tuple(chain([k], v))
或:
(k, ) + tuple(v)
如果要将值转换为 Python(不是 NumPy)标量,请使用:
v.toArray().tolist()
代替v
。
我有如下RDD,每条记录都是(bigint, vector)的元组:
myRDD.take(5)
[(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(0, DenseVector([5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0])),
(1, DenseVector([9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432])),
(1, DenseVector([9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432]))]
如何扩展 Dense 向量并使其成为元组的一部分?即我希望以上内容成为:
[(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(0, 5.0, 20.0, 0.3444, 0.3295, 54.3122, 4.0),
(1, 9.2463, 1.0, 0.392, 0.3381, 162.6437, 7.9432),
(1, 9.2463, 2.0, 0.392, 0.3381, 162.6437, 7.9432)]
谢谢!
好吧,因为 pyspark.ml.linalg.DenseVector
(或 mllib
)是迭代的(提供 __len__
和 __getitem__
方法)你可以像对待其他任何东西一样对待它 python集合,例如:
def as_tuple(kv):
"""
>>> as_tuple((1, DenseVector([9.25, 1.0, 0.31, 0.31, 162.37])))
(1, 9.25, 1.0, 0.31, 0.31, 162.37)
"""
k, v = kv
# Use *v.toArray() if you want to support Sparse one as well.
return (k, *v)
对于Python 2替换:
(k, *v)
与:
from itertools import chain
tuple(chain([k], v))
或:
(k, ) + tuple(v)
如果要将值转换为 Python(不是 NumPy)标量,请使用:
v.toArray().tolist()
代替v
。