从 Pyspark 中的嵌套 Json-String 列中提取架构

Question

假设我有以下 table:

body
{"Day":1,"vals":[{"id":"1", "val":"3"}, {"id":"2", "val":"4"}]}

我的目标是在 Pyspark 中为这个嵌套的 json 列写下架构。我尝试了以下两件事：

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  StructType([
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
    StructType([
      StructField("id", StringType(), True),
      StructField("val", DoubleType(), True)
    ])
  ])
  )
])

这里我得到了

的错误

'StructType' object has no attribute 'name'

另一种方法是将嵌套数组声明为 ArrayType：

schema = StructType([
  StructField("Day", StringType()),
  StructField(
  "vals",
  ArrayType(
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    ArrayType(
        StructField("id", StringType(), True),
        StructField("val", DoubleType(), True)
      , True)
    , True)
  )
])

这里我得到以下错误：

takes from 2 to 3 positional arguments but 5 were given

这可能来自仅以 Sql 类型作为参数的数组。

任何人都可以告诉我他们创建模式的方法是什么，因为我是整个主题的超级新手..

Answer 1

这是您要查找的结构：

Data = [
    (1, [("1","3"), ("2","4")])
  ]

schema = StructType([
        StructField('Day', IntegerType(), True),
        StructField('vals', ArrayType(StructType([
            StructField('id', StringType(), True),
            StructField('val', StringType(), True)
            ]),True))
         ])
df = spark.createDataFrame(data=Data,schema=schema)
df.printSchema()
df.show(truncate=False)

这将为您提供下一个输出：

root
 |-- Day: integer (nullable = true)
 |-- vals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- val: string (nullable = true)

+---+----------------+
|Day|vals            |
+---+----------------+
|1  |[{1, 3}, {1, 3}]|
+---+----------------+

从 Pyspark 中的嵌套 Json-String 列中提取架构

Extract Schema from nested Json-String column in Pyspark

python

json

pyspark

databricks