类似 csv 的输入文本到 json 字符串

Question

我有一个类似 csv 的输入文件：

"2017-06-01T01:01:01Z";"{\"name\":\"aaa\",\"properties\":{"\"propA\":\"some value\",\"propB\":\"other value\"}}"
"2017-06-01T01:01:01Z";"{\"name\":\"bbb\",\"properties\":{"\"propB\":\"some value\","\"propC\":\"some value\",\"propD\":\"other value\"}}"

我想得到这样的 json 字符串，以便我可以从纯 json 字符串创建数据框：

[{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propA":"...",
      "propB":"..."
    }
  }
},{
  "createdTime": "...",
  "value":{
    "name":"...",
    "properties": {
      "propB":"...",
      "propC":"...",
      "propD":"..."
    }
  }
}]

这是半结构化数据。有些行可能有属性 A，但其他行可能有

如何在 Spark 中使用 Scalar 执行此操作？

Answer 1

根据我从你的问题中了解到的情况，你想从你拥有的类似 csv 的文件中创建 dataframe。如果我没记错的话，下面是你可以做的

val data = sc.textFile("path to your csv-like file")
val jsonrdd = data.map(line => line.split(";"))
  .map(array => "{\"createdTime\":"+array(0)+",\"value\":"+ array(1).replace(",\"", ",").replace("\\"", "\"").replace("\"{", "{").replace("{\"\"", "{\"").replace("}\"", "}")+"},")

val df = sqlContext.read.json(jsonrdd)
df.show(false)

你应该 dataframe 作为

+--------------------+----------------------------------------------+
|createdTime         |value                                         |
+--------------------+----------------------------------------------+
|2017-06-01T01:01:01Z|[aaa,[some value,other value,null,null]]      |
|2017-06-01T01:01:01Z|[bbb,[null,some value,some value,other value]]|
+--------------------+----------------------------------------------+

以上 dataframe's schema 将是

root
 |-- createdTime: string (nullable = true)
 |-- value: struct (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- properties: struct (nullable = true)
 |    |    |-- propA: string (nullable = true)
 |    |    |-- propB: string (nullable = true)
 |    |    |-- propC: string (nullable = true)
 |    |    |-- propD: string (nullable = true)

类似 csv 的输入文本到 json 字符串

csv-like input text to json string

scalar

apache-spark