优雅 Json 在 Spark 中展平

Question

我在 spark 中有以下数据框：

val test = sqlContext.read.json(path = "/path/to/jsonfiles/*")  
test.printSchema
root
 |-- properties: struct (nullable = true)
 |    |-- prop_1: string (nullable = true)
 |    |-- prop_2: string (nullable = true)
 |    |-- prop_3: boolean (nullable = true)
 |    |-- prop_4: long (nullable = true)
...

我想做的是展平这个数据框，以便 prop_1 ... prop_n 存在于顶层。即

test.printSchema
root
|-- prop_1: string (nullable = true)
|-- prop_2: string (nullable = true)
|-- prop_3: boolean (nullable = true)
|-- prop_4: long (nullable = true)
...

类似的问题有多种解决方案。我能找到的最好的是 posed 。但是，解决方案仅在 properties 的类型为 Array 时有效。在我的例子中，属性是 StructType.

类型

另一种方法类似于：

test.registerTempTable("test")
val test2 = sqlContext.sql("""SELECT properties.prop_1, ... FROM test""")

但在这种情况下，我必须明确指定每一行，这是不雅的。

解决这个问题的最佳方法是什么？

Answer 1

如果您不是在寻找递归解决方案，那么在 1.6+ 带星号的点语法中应该可以正常工作：

val df = sqlContext.read.json(sc.parallelize(Seq(
  """{"properties": {
       "prop1": "foo", "prop2": "bar", "prop3": true, "prop4": 1}}"""
)))

df.select($"properties.*").printSchema
// root
//  |-- prop1: string (nullable = true)
//  |-- prop2: string (nullable = true)
//  |-- prop3: boolean (nullable = true)
//  |-- prop4: long (nullable = true)

不幸的是，这在 1.5 及之前的版本中不起作用。

在这种情况下，您可以直接从架构中提取所需的信息。您会在 which should be easy to adjust to fit this scenario and another one (recursive schema flattening in Python) 中找到一个示例。

优雅 Json 在 Spark 中展平

Elegant Json flatten in Spark

json

scala

apache-spark

apache-spark-sql