Spark XML 出现空值时标签丢失

Question

下面是我的数据框。


+-------+----+----------+
|   city|year|saleAmount|
+-------+----+----------+
|Toronto|2017|      50.0|
|Toronto|null|      50.0|
|Sanjose|2017|     200.0|
|Sanjose|null|     200.0|
|  Plano|2015|      50.0|
|  Plano|2016|      50.0|
|  Plano|null|     100.0|
|Newyork|2016|     150.0|
|Newyork|null|     150.0|
| Dallas|2016|     100.0|
| Dallas|2017|     120.0|
| Dallas|null|     220.0|
|   null|null|     720.0|
+-------+----+----------+

我尝试使用

将其转换为 xml


df.write.format("com.databricks.spark.xml")
    .mode("overwrite")
    .option("treatEmptyValuesAsNulls", "true")
    .option("rowTag", "ROW")
    .save("myxml")

但是 xml 中缺少一些标签，如下所示

<ROWS>
    <ROW>
        <city>Toronto</city>
        <year>2017</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Toronto</city>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year>2017</year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2015</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2016</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <year>2016</year>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2016</year>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2017</year>
        <saleAmount>120.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <saleAmount>220.0</saleAmount>
    </ROW>
    <ROW>
        <saleAmount>720.0</saleAmount>
    </ROW>
</ROWS>

DataFrame 打印时，如上所示，它正确地给出了所有空值。但是当转换为 XML 时，相应的 xml 元素标签丢失了......这是数据块 XML api 的工作方式吗？

上面 xml year 缺失...因为 year 值在数据帧中是 null。

在 spark-xml 中是否有任何选项也显示 null 值标签？

Answer 1

如果你想输出空标签，你需要提供一个默认的nullValue，它将出现在标签中：

df.write.format("xml")
    .mode("overwrite")
    .option("nullValue", "")
    .option("rowTag", "ROW")
    .save("myxml")

将产生

<ROWS>
    <ROW>
        <city>Toronto</city>
        <year>2017</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Toronto</city>
        <year></year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year>2017</year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year></year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2015</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    ...
</ROWS>

现在这可能是一个非常糟糕的主意，因为您不能为每个标签指定不同的值，因此很容易生成 xml 不符合任何 XSD 它们的文件应该。

在上面的示例中，要读取生成的文件，您需要将 treatEmptyValuesAsNulls 选项设置为 true 或指定一个 nullValue 选项：

val df = spark.read.format("xml").option("treatEmptyValuesAsNulls","true").load("myxml")

or 

val df = spark.read.format("xml").option("nullValue","").load("myxml")

Answer 2

.option("nullValue","") 对我有用，而不是 option("treatEmptyValuesAsNull","true") 你已经使用过但没有得到想要的输出

以下使用 dependencies/versions

    <!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
            <dependency>
                <groupId>com.databricks</groupId>
                <artifactId>spark-xml_2.11</artifactId>
                <version>0.4.1</version>
            </dependency>

           <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-core_2.11</artifactId>
                <version>2.4.1</version>
            </dependency>
            <dependency>
                <groupId>org.apache.spark</groupId>
                <artifactId>spark-sql_2.11</artifactId>
                <version>2.4.1</version>
            </dependency>

完整示例：

package examples

import org.apache.log4j.Level
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, sum}

object SparkXmlTest extends App with Logging {
  val logger = org.apache.log4j.Logger.getLogger("org")
  logger.setLevel(Level.WARN)
  val spark = SparkSession.builder()
    .appName(this.getClass.getName)
    .config("spark.master", "local[*]").getOrCreate()

  import spark.implicits._

  val sales = Seq(
    ("Dallas", 2016, 100d),
    ("Dallas", 2017, 120d),
    ("Sanjose", 2017, 200d),
    ("Plano", 2015, 50d),
    ("Plano", 2016, 50d),
    ("Newyork", 2016, 150d),
    ("Toronto", 2017, 50d)

  ).toDF("city", "year", "saleAmount")
  sales.printSchema()

  val first = sales
    .rollup("city", "year")
    .agg(sum("saleAmount") as "saleAmount").sort($"city".desc_nulls_last, $"year".asc_nulls_last)


  logInfo("group by city and year")
  // The above query is semantically equivalent to the following
  val second = sales
    .groupBy("city", "year") // <-- subtotals (city, year)
    .agg(sum("saleAmount") as "saleAmount")
  second.show


  logInfo("group by city ")
  val third = sales
    .groupBy("city") // <-- subtotals (city)
    .agg(sum("saleAmount") as "saleAmount")
    .select($"city", lit(null) as "year", $"saleAmount") // <-- year is null

  third.show

  logInfo("group by for grand total")

  logInfo("final df using union of group by city and year / group by city /group by to get grand total")
  val fourth = sales
    .groupBy() // <-- grand total
    .agg(sum("saleAmount") as "saleAmount")
    .select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null

  fourth.show

  fourth.printSchema()
  first.union(second).union(third).union(fourth)
    .coalesce(1).write.format("com.databricks.spark.xml")
    .mode("overwrite")
    //.option("treatEmptyValuesAsNulls", "true")
    .option("nullValue", "")
    .option("rowTag", "ROW")
    .save("sparkxmltest.xml");
}

具有所需输出的结果 XML：

<ROWS>
    <ROW>
        <city>Toronto</city>
        <year>2017</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Toronto</city>
        <year></year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year>2017</year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year></year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2015</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2016</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year></year>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <year>2016</year>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <year></year>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2016</year>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2017</year>
        <saleAmount>120.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year></year>
        <saleAmount>220.0</saleAmount>
    </ROW>
    <ROW>
        <city></city>
        <year></year>
        <saleAmount>720.0</saleAmount>
    </ROW>
    <ROW>
        <city>Toronto</city>
        <year>2017</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year>2017</year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2017</year>
        <saleAmount>120.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2015</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <year>2016</year>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year>2016</year>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year>2016</year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Dallas</city>
        <year></year>
        <saleAmount>220.0</saleAmount>
    </ROW>
    <ROW>
        <city>Plano</city>
        <year></year>
        <saleAmount>100.0</saleAmount>
    </ROW>
    <ROW>
        <city>Newyork</city>
        <year></year>
        <saleAmount>150.0</saleAmount>
    </ROW>
    <ROW>
        <city>Toronto</city>
        <year></year>
        <saleAmount>50.0</saleAmount>
    </ROW>
    <ROW>
        <city>Sanjose</city>
        <year></year>
        <saleAmount>200.0</saleAmount>
    </ROW>
    <ROW>
        <city></city>
        <year></year>
        <saleAmount>720.0</saleAmount>
    </ROW>
</ROWS>

Spark XML 出现空值时标签丢失

Spark XML Tags are missing when null values are coming

xml

scala

apache-spark

databricks