Pyspark：如何将字符串（从密集向量创建）转换回密集向量？

Question

我有一个大数据集（大约 1000 万行），我正在寻找一种从字符串重新创建密集向量的有效方法。

这是我的架构

root
 |-- features: string (nullable = true)
 |-- id_index: double (nullable = true)

这是第一行

train.first() 


Row(features='[-1.8744251359864337,0.8208032878135908,1.6772737952383912,0.5074761601167237,-0.9327241948725055,1.064324833351145,-0.026543021475899584,-0.2738297628597614,1.1621882143427753,0.022718595764125882,-0.480804744856163,-0.058405708900107677,0.05971905240143063,-0.3469121380857816,-0.18753641543435115,-0.07209073425907712,0.3231645936694398,0.19913281255794962,-0.27914981007260486,-0.14564720252350738,0.20391682163361805,-0.32573666381677435,0.7576647591212007,0.4242633700261033,-0.15593357299211452,0.017449221887097507,0.05121680297513904,0.5842733444225926,0.10450917006313973,-0.24553120193983335,-0.5334612434119697,0.5517353774258191,-0.3116056252939926,-0.9396807558084017,0.12348781369817632,0.6166678815053761,0.05457562154488685,-0.13311701358504352,0.003852337914245302,-0.3513220177034468,0.23513621861470274,0.30291278930119236,-0.29289442414132855]', id_index=34823.0)

特征列是使用 PCA 创建的，然后为了重新采样，我必须将它们转换为字符串，现在我想重新创建密集向量以便使用 spark.ml

有什么建议吗？

谢谢！

Answer 1

您可以使用 from_json 解析数组的字符串，然后使用 array_to_vector 创建密集向量（对于 Spark 版本 >= 3.1.0

）

import pyspark.sql.functions as F
from pyspark.ml.functions import array_to_vector

train = train.withColumn('features_vector', array_to_vector(F.from_json('features', "array<double>")))
train.printSchema()

# root
#  |-- features: string (nullable = true)
#  |-- id_index: double (nullable = true)
#  |-- features_vector: vector (nullable = true)

或者为 Spark 版本 < 3.1.0 使用 UDF

import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT

arraytovector = F.udf(lambda vs: Vectors.dense(vs), VectorUDT())

train = train.withColumn('features_vector', arraytovector(F.from_json('features', "array<double>")))
train.printSchema()

# root
#  |-- features: string (nullable = true)
#  |-- id_index: double (nullable = true)
#  |-- features_vector: vector (nullable = true)

Pyspark：如何将字符串（从密集向量创建）转换回密集向量？

Pyspark: How to convert a string (created from a dense vector) back to a dense vector?

string

vectorization

type-conversion

apache-spark

pyspark