关系化 json 深层嵌套数组

Question

我有以下目录，想用 AWS glue 来压平它

| accountId | resourceId | items                                                           |
|-----------|------------|-----------------------------------------------------------------|
| 1         | r1         | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1         | r2         | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2         | r3         | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |

这是我的架构

root
 |-- accountId: 
 |-- resourceId: 
 |-- PeriodId: 
 |-- items: 
 |    |-- application: 
 |    |    |-- component: array

我想将其展平为以下内容：

| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1         | r1         | tool | 1.0     |
| 1         | r1         | app  | 1.0     |
| 1         | r2         | tool | 2.0     |
| 1         | r2         | app  | 2.0     |
| 2         | r3         | tool | 3.0     |
| 2         | r3         | app  | 3.0     |

Answer 1

根据我从您的模式和数据中了解到的情况，您的结构是一个深度嵌套的结构，因此您可以 explode 在 items.application.component 上，然后 select 您的 name和 version 列。

这个link可能会帮助你理解：https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html

from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()


    +---------+----------+----+-------+
    |accountId|resourceId|name|version|
    +---------+----------+----+-------+
    |        1|        r1|tool|    1.0|
    |        1|        r1| app|    1.0|
    |        1|        r2|tool|    2.0|
    |        1|        r2| app|    2.0|
    |        2|        r3|tool|    3.0|
    |        2|        r3| app|    3.0|
    +---------+----------+----+-------+

关系化 json 深层嵌套数组

Relationalize json deep nested array

pyspark

aws-glue