spark scala：从一列中提取 xml

Question

假设 df 具有以下结构：

root
 |-- id: decimal(38,0) (nullable = true)
 |-- text: string (nullable = true)

此处 text 包含大致 XML 类型记录的字符串。然后我可以应用以下步骤将必要的条目提取到平面 table:

首先追加根节点，因为原来有none。（问题#1：这一步是必须的，还是可以省略？）

val df2 = df.withColumn("text", concat(lit("<root>"),$"text",lit("</root>")))

接下来，解析 XML:

val payloadSchema = schema_of_xml(df.select("text").as[String])
val df3 = spark.read.option("rootTag","root").option("rowTag","row").schema(payloadSchema)xml(df2.select("text").as[String])

这会生成 df3:

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

我终于爆了：

val df4 = df3.withColumn("exploded_cols", explode($"row"))

进入

root
 |-- row: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)
 |-- exploded_cols: struct (nullable = true)
 |    |-- key: string (nullable = true)
 |    |-- value: string (nullable = true)

我的目标如下table:

val df5 = df4.select("exploded_cols.*")

和

root
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

主要问题： 我希望最后的 table 还包含 id: decimal(38,0) (nullable = true) 条目以及展开的 key, value 列，例如

root
 |-- id: decimal(38,0) (nullable = true)
 |-- key: string (nullable = true)
 |-- value: string (nullable = true)

但是，我不确定如何调用 spark.read.option 而不在方法中单独选择 df2.select("text").as[String]（请参阅 df3）。是否可以简化此脚本？

这应该很简单，所以我不确定是否需要一个可重现的示例。另外，我是从 r 背景中失明的，所以我错过了所有的 scala 基础知识，但我正在努力学习。

Answer 1

使用 spak-xml 库的 from_xml 函数。

val df = // Read source data
val schema = // Define schema of XML text

df.withColumn("xmlData", from_xml("xmlColName", schema))

spark scala：从一列中提取 xml

spark scala: extracting xml from one column

scala

apache-spark