无法从 <> 中提取值需要结构类型但得到了字符串；

Question

我有一些嵌套的 json，我已经将它们并行化并吐出为 json。完整的记录如下所示：

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data":{
            "type":"region",
            "id":"1061",
            "meta":{
               "displayValue":"Park Region"
            }
         }
      }
   }
}

但是，数据正在等待数据清理，目前未填充区域字段。

{
   "id":"1",
   "type":"site",
   "attributes":{
      "description":"Number 1 Park",
      "activeInactive":{
         "text":"Active",
         "colour":"#4CBB17"
      },
      "lastUpdated":"2019-12-05T08:51:39"
   },
   "relationships":{
      "region":{
         "data": null
         }
      }
   }
}

如果关系不存在（即它是孤立站点），data 元素将为 null。

我运行这个 JSON 通过 RDD 进入 spark 数据帧。数据框的模式是：

attributes:struct
    activeInactive:struct
       colour:string
       text:string
    description:string
    lastUpdated:string
id:string
relationships:struct
    region:struct
       data:string

我在使用 df.select(col('relationships.region.data.meta.displayValue')) 为区域编码时遇到错误，就好像嵌套字段在那里，而不是按照主题标题显示数据。我假设这是因为与数据框的架构冲突。

问题是我怎样才能让它更动态，并且在填充它时仍然获得 displayValue 而无需重新访问代码？

Answer 1

在读取 json 文件时，您可以使用以下语法将模式强加到输出数据帧上：

df = spark.read.json("<path to json file>", schema = <schema object>)

这样数据字段仍然会显示为空，但它将是具有完整嵌套结构的 StructType()。
根据提供的数据片段，适用的架构对象如下所示：

schemaObject = StructType([
  StructField('id', StringType(), True),
  StructField('type', StringType(), True),
  StructField('attributes', StructType([
    StructField('descrption', StringType(), True),
    StructField('activeInactive', StructType([
      StructField('text', StringType(), True),
      StructField('colour', StringType(), True)
    ]), True),
    StructField('lastUpdated', StringType(), True)
  ]), True),
  StructField('relationships'StructType([
    StructField('region', StructType([
      StructField('data', StructType([
        StructField('type', StringType(), True),
        StructField('id', StringType(), True),
        StructField('meta', StructType([
          StructField('displayValue', StringType(), True)
        ]), True)
      ]), True)
    ]), True)
  ]), True)
])

无法从 <> 中提取值需要结构类型但得到了字符串；

Can't extract value from <> need struct type but got string;

apache-spark

pyspark