关系化 json 深层嵌套数组
Relationalize json deep nested array
我有以下目录,想用 AWS glue 来压平它
| accountId | resourceId | items |
|-----------|------------|-----------------------------------------------------------------|
| 1 | r1 | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1 | r2 | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2 | r3 | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |
这是我的架构
root
|-- accountId:
|-- resourceId:
|-- PeriodId:
|-- items:
| |-- application:
| | |-- component: array
我想将其展平为以下内容:
| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1 | r1 | tool | 1.0 |
| 1 | r1 | app | 1.0 |
| 1 | r2 | tool | 2.0 |
| 1 | r2 | app | 2.0 |
| 2 | r3 | tool | 3.0 |
| 2 | r3 | app | 3.0 |
根据我从您的模式和数据中了解到的情况,您的结构是一个深度嵌套的结构,因此您可以 explode
在 items.application.component
上,然后 select
您的 name
和 version
列。
这个link可能会帮助你理解:https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()
+---------+----------+----+-------+
|accountId|resourceId|name|version|
+---------+----------+----+-------+
| 1| r1|tool| 1.0|
| 1| r1| app| 1.0|
| 1| r2|tool| 2.0|
| 1| r2| app| 2.0|
| 2| r3|tool| 3.0|
| 2| r3| app| 3.0|
+---------+----------+----+-------+
我有以下目录,想用 AWS glue 来压平它
| accountId | resourceId | items |
|-----------|------------|-----------------------------------------------------------------|
| 1 | r1 | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1 | r2 | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2 | r3 | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |
这是我的架构
root
|-- accountId:
|-- resourceId:
|-- PeriodId:
|-- items:
| |-- application:
| | |-- component: array
我想将其展平为以下内容:
| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1 | r1 | tool | 1.0 |
| 1 | r1 | app | 1.0 |
| 1 | r2 | tool | 2.0 |
| 1 | r2 | app | 2.0 |
| 2 | r3 | tool | 3.0 |
| 2 | r3 | app | 3.0 |
根据我从您的模式和数据中了解到的情况,您的结构是一个深度嵌套的结构,因此您可以 explode
在 items.application.component
上,然后 select
您的 name
和 version
列。
这个link可能会帮助你理解:https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()
+---------+----------+----+-------+
|accountId|resourceId|name|version|
+---------+----------+----+-------+
| 1| r1|tool| 1.0|
| 1| r1| app| 1.0|
| 1| r2|tool| 2.0|
| 1| r2| app| 2.0|
| 2| r3|tool| 3.0|
| 2| r3| app| 3.0|
+---------+----------+----+-------+