替换深层嵌套模式 Spark Dataframe 中的值
Replace value in deep nested schema Spark Dataframe
我是 pyspark 的新手。我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件。我需要用 null 替换数据框(带有嵌套模式)中的一些值,我已经看到这个 solution 它适用于结构但不确定它如何适用于数组。
我的模式是这样的
|-- unitOfMeasure: struct
| |-- raw: struct
| | |-- id: string
| | |-- codingSystemId: string
| | |-- display: string
| |-- standard: struct
| | |-- id: string
| | |-- codingSystemId: string
|-- Id: string
|-- actions: array
| |-- element: struct
| | |-- action: string
| | |-- actionDate: string
| | |-- actor: struct
| | | |-- actorId: string
| | | |-- aliases: array
| | | | |-- element: struct
| | | | | |-- value: string
| | | | | |-- type: string
| | | | | |-- assigningAuthority: string
| | | |-- fullName: string
我想做的是将 unitOfMeasure.raw.id
替换为 null
actions.element.action
为 null
actions.element.actor.aliases.element.value
和 null 保持我数据框的其余部分不变。
有什么办法可以实现吗?
对于数组列,与结构字段相比有点复杂。
一种选择是将数组分解为新列,以便您可以访问和更新嵌套结构。更新后,您必须重建初始数组列。
但我更喜欢使用为 Spark >=2.4 引入的高阶函数 transform
这是一个示例:
输入方向:
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action: string (nullable = true)
| | |-- actionDate: string (nullable = true)
| | |-- actor: struct (nullable = true)
| | | |-- actorId: long (nullable = true)
| | | |-- aliases: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- assigningAuthority: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- value: string (nullable = true)
| | | |-- fullName: string (nullable = true)
+--------------------------------------------------------------+
|actions |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+
我们将 lambda 函数传递给 transfrom
,其中 select 所有结构字段,并将 actions.action
和 actions.actor.aliases.value
替换为 null
。
transform_expr = """transform (actions, x ->
struct(null as action,
x.actionDate as actionDate,
struct(x.actor.actorId as actorId,
transform(x.actor.aliases, y ->
struct(null as value,
y.type as type,
y.assigningAuthority as assigningAuthority)
) as aliases,
x.actor.fullName as fullName
) as actor
))"""
df.withColumn("actions", expr(transform_expr)).show(truncate=False)
输出方向:
+------------------------------------------------+
|actions |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+
我是 pyspark 的新手。我试图了解如何访问具有多层嵌套结构和数组的镶木地板文件。我需要用 null 替换数据框(带有嵌套模式)中的一些值,我已经看到这个 solution 它适用于结构但不确定它如何适用于数组。
我的模式是这样的
|-- unitOfMeasure: struct
| |-- raw: struct
| | |-- id: string
| | |-- codingSystemId: string
| | |-- display: string
| |-- standard: struct
| | |-- id: string
| | |-- codingSystemId: string
|-- Id: string
|-- actions: array
| |-- element: struct
| | |-- action: string
| | |-- actionDate: string
| | |-- actor: struct
| | | |-- actorId: string
| | | |-- aliases: array
| | | | |-- element: struct
| | | | | |-- value: string
| | | | | |-- type: string
| | | | | |-- assigningAuthority: string
| | | |-- fullName: string
我想做的是将 unitOfMeasure.raw.id
替换为 null
actions.element.action
为 null
actions.element.actor.aliases.element.value
和 null 保持我数据框的其余部分不变。
有什么办法可以实现吗?
对于数组列,与结构字段相比有点复杂。 一种选择是将数组分解为新列,以便您可以访问和更新嵌套结构。更新后,您必须重建初始数组列。
但我更喜欢使用为 Spark >=2.4 引入的高阶函数 transform
这是一个示例:
输入方向:
|-- actions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- action: string (nullable = true)
| | |-- actionDate: string (nullable = true)
| | |-- actor: struct (nullable = true)
| | | |-- actorId: long (nullable = true)
| | | |-- aliases: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- assigningAuthority: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- value: string (nullable = true)
| | | |-- fullName: string (nullable = true)
+--------------------------------------------------------------+
|actions |
+--------------------------------------------------------------+
|[[action_name1, 2019-12-08, [2, [[aa, t1, v1]], full_name1]]] |
|[[action_name2, 2019-12-09, [3, [[aaa, t2, v2]], full_name2]]]|
+--------------------------------------------------------------+
我们将 lambda 函数传递给 transfrom
,其中 select 所有结构字段,并将 actions.action
和 actions.actor.aliases.value
替换为 null
。
transform_expr = """transform (actions, x ->
struct(null as action,
x.actionDate as actionDate,
struct(x.actor.actorId as actorId,
transform(x.actor.aliases, y ->
struct(null as value,
y.type as type,
y.assigningAuthority as assigningAuthority)
) as aliases,
x.actor.fullName as fullName
) as actor
))"""
df.withColumn("actions", expr(transform_expr)).show(truncate=False)
输出方向:
+------------------------------------------------+
|actions |
+------------------------------------------------+
|[[, 2019-12-08, [2, [[, t1, aa]], full_name1]]] |
|[[, 2019-12-09, [3, [[, t2, aaa]], full_name2]]]|
+------------------------------------------------+