Spark 数据帧将嵌套 JSON 转换为单独的列

Question

我有一个 JSON 流，具有以下结构，可以转换为数据帧

{
  "a": 3936,
  "b": 123,
  "c": "34",
  "attributes": {
    "d": "146",
    "e": "12",
    "f": "23"
  }
}

数据框显示函数产生以下输出

sqlContext.read.json(jsonRDD).show

+----+-----------+---+---+
|   a| attributes|  b|  c|
+----+-----------+---+---+
|3936|[146,12,23]|123| 34|
+----+-----------+---+---+

如何将属性列（嵌套 JSON 结构）拆分为 attributes.d、attributes.e 和 attributes.f 作为 separate columns into a new dataframe，这样我就可以在新的dataframe中将列作为a，b，c，attributes.d，attributes.e和attributes.f？

Answer 1

使用 attributes.d 表示法，您可以创建新列并将它们包含在您的 DataFrame 中。查看 Java 中的 withColumn() 方法。

Answer 2

如果您想要从 a 到 f 命名的列：

df.select("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")

如果您想要以 attributes. 前缀命名的列：

df.select($"a", $"b", $"c", $"attributes.d" as "attributes.d", $"attributes.e" as "attributes.e", $"attributes.f" as "attributes.f")

如果您的列的名称是从外部来源（例如配置）提供的：

val colNames: Seq("a", "b", "c", "attributes.d", "attributes.e", "attributes.f")

df.select(colNames.head, colNames.tail: _*).toDF(colNames:_*)

Answer 3

使用Python

使用python的pandas库提取DataFrame。
将数据类型从 'str' 更改为 'dict'。
获取每个特征的值。

将结果保存到新文件。

import pandas as pd

data = pd.read_csv("data.csv")  # load the csv file from your disk
json_data = data['Desc']        # get the DataFrame of Desc
data = data.drop('Desc', 1)     # delete Desc column
Total, Defective = [], []       # setout list

for i in json_data:
    i = eval(i)     # change the data type from 'str' to 'dict'
    Total.append(i['Total'])    # append 'Total' feature
    Defective.append(i['Defective'])    # append 'Defective' feature

# finally,complete the DataFrame
data['Total'] = Total
data['Defective'] = Defective

data.to_csv("result.csv")       # save to the result.csv and check it

Spark 数据帧将嵌套 JSON 转换为单独的列

Spark dataframes convert nested JSON to seperate columns

apache-spark

apache-spark-sql

spark-dataframe