创建一个包含数千列的 Spark 数据框，然后添加一个包含所有列的 ArrayType 列

Question

我想使用如下 Scala 代码在 Spark 中创建一个数据框：

col_1	col_2	col_3	..	col_2048
0.123	0.234	...	...	0.323
0.345	0.456	...	...	0.534

然后在其中添加一个额外的 ArrayType 列，将所有这些 2048 列数据保存在一列中：

col_1	col_2	col_3	..	col_2048	array_col
0.123	0.234	...	...	0.323	[0,123, 0.234, ..., 0.323]
0.345	0.456	...	...	0.534	[0.345, 0.456, ..., 0.534]

Answer 1

PySpark:

创建列列表并使用 python 映射。

cols = df.columns

df.withColumn('array_col', f.array(*map(lambda c: f.col(c), cols)))

Answer 2

试试这个

df.withColumn("array_col",array(df.columns.map(col): _*)).show

Create a Spark dataframe with thousands of columns and then add a column of ArrayType that hold them all