Pyspark:如何将字符串(从密集向量创建)转换回密集向量?
Pyspark: How to convert a string (created from a dense vector) back to a dense vector?
我有一个大数据集(大约 1000 万行),我正在寻找一种从字符串重新创建密集向量的有效方法。
这是我的架构
root
|-- features: string (nullable = true)
|-- id_index: double (nullable = true)
这是第一行
train.first()
Row(features='[-1.8744251359864337,0.8208032878135908,1.6772737952383912,0.5074761601167237,-0.9327241948725055,1.064324833351145,-0.026543021475899584,-0.2738297628597614,1.1621882143427753,0.022718595764125882,-0.480804744856163,-0.058405708900107677,0.05971905240143063,-0.3469121380857816,-0.18753641543435115,-0.07209073425907712,0.3231645936694398,0.19913281255794962,-0.27914981007260486,-0.14564720252350738,0.20391682163361805,-0.32573666381677435,0.7576647591212007,0.4242633700261033,-0.15593357299211452,0.017449221887097507,0.05121680297513904,0.5842733444225926,0.10450917006313973,-0.24553120193983335,-0.5334612434119697,0.5517353774258191,-0.3116056252939926,-0.9396807558084017,0.12348781369817632,0.6166678815053761,0.05457562154488685,-0.13311701358504352,0.003852337914245302,-0.3513220177034468,0.23513621861470274,0.30291278930119236,-0.29289442414132855]', id_index=34823.0)
特征列是使用 PCA 创建的,然后为了重新采样,我必须将它们转换为字符串,现在我想重新创建密集向量以便使用 spark.ml
有什么建议吗?
谢谢!
您可以使用 from_json
解析数组的字符串,然后使用 array_to_vector
创建密集向量(对于 Spark 版本 >= 3.1.0
)
import pyspark.sql.functions as F
from pyspark.ml.functions import array_to_vector
train = train.withColumn('features_vector', array_to_vector(F.from_json('features', "array<double>")))
train.printSchema()
# root
# |-- features: string (nullable = true)
# |-- id_index: double (nullable = true)
# |-- features_vector: vector (nullable = true)
或者为 Spark 版本 < 3.1.0 使用 UDF
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
arraytovector = F.udf(lambda vs: Vectors.dense(vs), VectorUDT())
train = train.withColumn('features_vector', arraytovector(F.from_json('features', "array<double>")))
train.printSchema()
# root
# |-- features: string (nullable = true)
# |-- id_index: double (nullable = true)
# |-- features_vector: vector (nullable = true)
我有一个大数据集(大约 1000 万行),我正在寻找一种从字符串重新创建密集向量的有效方法。
这是我的架构
root
|-- features: string (nullable = true)
|-- id_index: double (nullable = true)
这是第一行
train.first()
Row(features='[-1.8744251359864337,0.8208032878135908,1.6772737952383912,0.5074761601167237,-0.9327241948725055,1.064324833351145,-0.026543021475899584,-0.2738297628597614,1.1621882143427753,0.022718595764125882,-0.480804744856163,-0.058405708900107677,0.05971905240143063,-0.3469121380857816,-0.18753641543435115,-0.07209073425907712,0.3231645936694398,0.19913281255794962,-0.27914981007260486,-0.14564720252350738,0.20391682163361805,-0.32573666381677435,0.7576647591212007,0.4242633700261033,-0.15593357299211452,0.017449221887097507,0.05121680297513904,0.5842733444225926,0.10450917006313973,-0.24553120193983335,-0.5334612434119697,0.5517353774258191,-0.3116056252939926,-0.9396807558084017,0.12348781369817632,0.6166678815053761,0.05457562154488685,-0.13311701358504352,0.003852337914245302,-0.3513220177034468,0.23513621861470274,0.30291278930119236,-0.29289442414132855]', id_index=34823.0)
特征列是使用 PCA 创建的,然后为了重新采样,我必须将它们转换为字符串,现在我想重新创建密集向量以便使用 spark.ml
有什么建议吗?
谢谢!
您可以使用 from_json
解析数组的字符串,然后使用 array_to_vector
创建密集向量(对于 Spark 版本 >= 3.1.0
import pyspark.sql.functions as F
from pyspark.ml.functions import array_to_vector
train = train.withColumn('features_vector', array_to_vector(F.from_json('features', "array<double>")))
train.printSchema()
# root
# |-- features: string (nullable = true)
# |-- id_index: double (nullable = true)
# |-- features_vector: vector (nullable = true)
或者为 Spark 版本 < 3.1.0 使用 UDF
import pyspark.sql.functions as F
from pyspark.ml.linalg import Vectors, VectorUDT
arraytovector = F.udf(lambda vs: Vectors.dense(vs), VectorUDT())
train = train.withColumn('features_vector', arraytovector(F.from_json('features', "array<double>")))
train.printSchema()
# root
# |-- features: string (nullable = true)
# |-- id_index: double (nullable = true)
# |-- features_vector: vector (nullable = true)