如何通过在 PySpark 中选择结构数组列的一个字段来提取数组列
How to extract array column by selecting one field of struct-array column in PySpark
我有一个包含结构数组列 properties
的数据框 df
(其元素是具有键 x
和 y
的结构字段的数组列),我想要通过从列 properties
.
中提取 x
值来创建新的数组列
一个样本输入数据框应该是这样的
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [
(1, [{'x':11, 'y':'str1a'}, ]),
(2, [{'x':21, 'y':'str2a'}, {'x':22, 'y':0.22, 'z':'str2b'}, ]),
]
my_schema = StructType([
StructField('id', LongType()),
StructField('properties', ArrayType(
StructType([
StructField('x', LongType()),
StructField('y', StringType()),
])
)
),
])
df = spark.createDataFrame(data, schema=my_schema)
df.show()
# +---+--------------------+
# | id| properties|
# +---+--------------------+
# | 1| [[11, str1a]]|
# | 2|[[21, str2a], [22...|
# +---+--------------------+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
另一方面,所需的输出 df_new
应该类似于
df_new.show()
# +---+--------------------+--------+
# | id| properties|x_values|
# +---+--------------------+--------+
# | 1| [[11, str1a]]| [11]|
# | 2|[[21, str2a], [22...|[21, 22]|
# +---+--------------------+--------+
df_new.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
# |-- x_values: array (nullable = true)
# | |-- element: long (containsNull = true)
有人知道此类任务的解决方案吗?
理想情况下,我正在寻找一种无需依赖 F.explode
即可逐行运行的解决方案。
事实上,在我的实际数据库中,我还没有确定等同于 id
列,并且在调用 F.explode
之后,我不确定如何将分解后的值合并在一起。
尝试使用 properties.x
然后从属性数组中提取所有值。
示例:
df.withColumn("x_values",col("properties.x")).show(10,False)
#or by using higher order functions
df.withColumn("x_values",expr("transform(properties,p -> p.x)")).show(10,False)
#+---+-------------------------+--------+
#|id |properties |x_values|
#+---+-------------------------+--------+
#|1 |[[11, str1a]] |[11] |
#|2 |[[21, str2a], [22, 0.22]]|[21, 22]|
#+---+-------------------------+--------+
我有一个包含结构数组列 properties
的数据框 df
(其元素是具有键 x
和 y
的结构字段的数组列),我想要通过从列 properties
.
x
值来创建新的数组列
一个样本输入数据框应该是这样的
import pyspark.sql.functions as F
from pyspark.sql.types import *
data = [
(1, [{'x':11, 'y':'str1a'}, ]),
(2, [{'x':21, 'y':'str2a'}, {'x':22, 'y':0.22, 'z':'str2b'}, ]),
]
my_schema = StructType([
StructField('id', LongType()),
StructField('properties', ArrayType(
StructType([
StructField('x', LongType()),
StructField('y', StringType()),
])
)
),
])
df = spark.createDataFrame(data, schema=my_schema)
df.show()
# +---+--------------------+
# | id| properties|
# +---+--------------------+
# | 1| [[11, str1a]]|
# | 2|[[21, str2a], [22...|
# +---+--------------------+
df.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
另一方面,所需的输出 df_new
应该类似于
df_new.show()
# +---+--------------------+--------+
# | id| properties|x_values|
# +---+--------------------+--------+
# | 1| [[11, str1a]]| [11]|
# | 2|[[21, str2a], [22...|[21, 22]|
# +---+--------------------+--------+
df_new.printSchema()
# root
# |-- id: long (nullable = true)
# |-- properties: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- x: long (nullable = true)
# | | |-- y: string (nullable = true)
# |-- x_values: array (nullable = true)
# | |-- element: long (containsNull = true)
有人知道此类任务的解决方案吗?
理想情况下,我正在寻找一种无需依赖 F.explode
即可逐行运行的解决方案。
事实上,在我的实际数据库中,我还没有确定等同于 id
列,并且在调用 F.explode
之后,我不确定如何将分解后的值合并在一起。
尝试使用 properties.x
然后从属性数组中提取所有值。
示例:
df.withColumn("x_values",col("properties.x")).show(10,False)
#or by using higher order functions
df.withColumn("x_values",expr("transform(properties,p -> p.x)")).show(10,False)
#+---+-------------------------+--------+
#|id |properties |x_values|
#+---+-------------------------+--------+
#|1 |[[11, str1a]] |[11] |
#|2 |[[21, str2a], [22, 0.22]]|[21, 22]|
#+---+-------------------------+--------+