如何将 json 格式转换为关系格式？

Question

这是我的代码：

%spark.pyspark

df_principalBody = spark.sql("""
 SELECT
      gtin
      , principalBodyConstituents
      --, principalBodyConstituents.coatings.materialType.value
    FROM
      v_df_source""")

df_principalBody.createOrReplaceTempView("v_df_principalBody")

df_principalBody.collect();

这是输出：

[Row(gtin='7617014161936', principalBodyConstituents=[Row(coatings=[Row(materialType=Row(value='003', valueRange='405')

如何读取关系格式的 value 和 valueRange 字段？我试过爆炸和平整，但它不起作用。

我的一部分 json:

{
  "gtin": "7617014161936",
  "timePeriods": [
    {
      "fractionData": {
        "principalBody": {
          "constituents": [
            {
              "coatings": [
                {
                  "materialType": {
                    "value": "003",
                    "valueRange": "405"
                  },
                  "percentage": 0.1
                }
              ],
...

Answer 1

您可以使用 data_dict.items() 列出 key/value 对。

我使用了你的部分 json 如下 -

str1 = """{"gtin": "7617014161936","timePeriods": [{"fractionData": {"principalBody": {"constituents": [{"coatings": [
                {
                  "materialType": {
                    "value": "003",
                    "valueRange": "405"
                  },
                  "percentage": 0.1
                }
             ]}]}}}]}"""

import json

res = json.loads(str1)

res_dict = res['timePeriods'][0]['fractionData']['principalBody']['constituents'][0]['coatings'][0]['materialType']

df = spark.createDataFrame(data=res_dict.items())

输出：

+----------+---+
|        _1| _2|
+----------+---+
|     value|003|
|valueRange|405|
+----------+---+

您甚至可以指定您的模式：

from pyspark.sql.types import *

df = spark.createDataFrame(res_dict.items(), 
                      schema=StructType(fields=[
                          StructField("key", StringType()), 
                          StructField("value", StringType())])).show()

导致

+----------+-----+
|       key|value|
+----------+-----+
|     value|  003|
|valueRange|  405|
+----------+-----+

如何将 json 格式转换为关系格式？

How to bring json format to relational form?

json

pyspark