Spark XML 出现空值时标签丢失
Spark XML Tags are missing when null values are coming
下面是我的数据框。
+-------+----+----------+
| city|year|saleAmount|
+-------+----+----------+
|Toronto|2017| 50.0|
|Toronto|null| 50.0|
|Sanjose|2017| 200.0|
|Sanjose|null| 200.0|
| Plano|2015| 50.0|
| Plano|2016| 50.0|
| Plano|null| 100.0|
|Newyork|2016| 150.0|
|Newyork|null| 150.0|
| Dallas|2016| 100.0|
| Dallas|2017| 120.0|
| Dallas|null| 220.0|
| null|null| 720.0|
+-------+----+----------+
我尝试使用
将其转换为 xml
df.write.format("com.databricks.spark.xml")
.mode("overwrite")
.option("treatEmptyValuesAsNulls", "true")
.option("rowTag", "ROW")
.save("myxml")
但是 xml 中缺少一些标签,如下所示
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<saleAmount>720.0</saleAmount>
</ROW>
</ROWS>
DataFrame 打印时,如上所示,它正确地给出了所有空值。但是当转换为 XML 时,相应的 xml 元素标签丢失了......这是数据块 XML api 的工作方式吗?
上面 xml year
缺失...因为 year
值在数据帧中是 null
。
在 spark-xml 中是否有任何选项也显示 null
值标签?
如果你想输出空标签,你需要提供一个默认的nullValue
,它将出现在标签中:
df.write.format("xml")
.mode("overwrite")
.option("nullValue", "")
.option("rowTag", "ROW")
.save("myxml")
将产生
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
...
</ROWS>
现在这可能是一个非常糟糕的主意,因为您不能为每个标签指定不同的值,因此很容易生成 xml 不符合任何 XSD 它们的文件应该。
在上面的示例中,要读取生成的文件,您需要将 treatEmptyValuesAsNulls
选项设置为 true 或指定一个 nullValue
选项:
val df = spark.read.format("xml").option("treatEmptyValuesAsNulls","true").load("myxml")
or
val df = spark.read.format("xml").option("nullValue","").load("myxml")
.option("nullValue","")
对我有用,而不是 option("treatEmptyValuesAsNull","true")
你已经使用过但没有得到想要的输出
以下使用 dependencies/versions
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.1</version>
</dependency>
完整示例:
package examples
import org.apache.log4j.Level
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, sum}
object SparkXmlTest extends App with Logging {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
val sales = Seq(
("Dallas", 2016, 100d),
("Dallas", 2017, 120d),
("Sanjose", 2017, 200d),
("Plano", 2015, 50d),
("Plano", 2016, 50d),
("Newyork", 2016, 150d),
("Toronto", 2017, 50d)
).toDF("city", "year", "saleAmount")
sales.printSchema()
val first = sales
.rollup("city", "year")
.agg(sum("saleAmount") as "saleAmount").sort($"city".desc_nulls_last, $"year".asc_nulls_last)
logInfo("group by city and year")
// The above query is semantically equivalent to the following
val second = sales
.groupBy("city", "year") // <-- subtotals (city, year)
.agg(sum("saleAmount") as "saleAmount")
second.show
logInfo("group by city ")
val third = sales
.groupBy("city") // <-- subtotals (city)
.agg(sum("saleAmount") as "saleAmount")
.select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
third.show
logInfo("group by for grand total")
logInfo("final df using union of group by city and year / group by city /group by to get grand total")
val fourth = sales
.groupBy() // <-- grand total
.agg(sum("saleAmount") as "saleAmount")
.select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null
fourth.show
fourth.printSchema()
first.union(second).union(third).union(fourth)
.coalesce(1).write.format("com.databricks.spark.xml")
.mode("overwrite")
//.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.option("rowTag", "ROW")
.save("sparkxmltest.xml");
}
具有所需输出的结果 XML:
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year></year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year></year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year></year>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<city></city>
<year></year>
<saleAmount>720.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year></year>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year></year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year></year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city></city>
<year></year>
<saleAmount>720.0</saleAmount>
</ROW>
</ROWS>
下面是我的数据框。
+-------+----+----------+ | city|year|saleAmount| +-------+----+----------+ |Toronto|2017| 50.0| |Toronto|null| 50.0| |Sanjose|2017| 200.0| |Sanjose|null| 200.0| | Plano|2015| 50.0| | Plano|2016| 50.0| | Plano|null| 100.0| |Newyork|2016| 150.0| |Newyork|null| 150.0| | Dallas|2016| 100.0| | Dallas|2017| 120.0| | Dallas|null| 220.0| | null|null| 720.0| +-------+----+----------+
我尝试使用
将其转换为 xml
df.write.format("com.databricks.spark.xml")
.mode("overwrite")
.option("treatEmptyValuesAsNulls", "true")
.option("rowTag", "ROW")
.save("myxml")
但是 xml 中缺少一些标签,如下所示
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<saleAmount>720.0</saleAmount>
</ROW>
</ROWS>
DataFrame 打印时,如上所示,它正确地给出了所有空值。但是当转换为 XML 时,相应的 xml 元素标签丢失了......这是数据块 XML api 的工作方式吗?
上面 xml year
缺失...因为 year
值在数据帧中是 null
。
在 spark-xml 中是否有任何选项也显示 null
值标签?
如果你想输出空标签,你需要提供一个默认的nullValue
,它将出现在标签中:
df.write.format("xml")
.mode("overwrite")
.option("nullValue", "")
.option("rowTag", "ROW")
.save("myxml")
将产生
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
...
</ROWS>
现在这可能是一个非常糟糕的主意,因为您不能为每个标签指定不同的值,因此很容易生成 xml 不符合任何 XSD 它们的文件应该。
在上面的示例中,要读取生成的文件,您需要将 treatEmptyValuesAsNulls
选项设置为 true 或指定一个 nullValue
选项:
val df = spark.read.format("xml").option("treatEmptyValuesAsNulls","true").load("myxml")
or
val df = spark.read.format("xml").option("nullValue","").load("myxml")
.option("nullValue","")
对我有用,而不是 option("treatEmptyValuesAsNull","true")
你已经使用过但没有得到想要的输出
以下使用 dependencies/versions
<!-- https://mvnrepository.com/artifact/com.databricks/spark-xml -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.11</artifactId>
<version>0.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.1</version>
</dependency>
完整示例:
package examples
import org.apache.log4j.Level
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{lit, sum}
object SparkXmlTest extends App with Logging {
val logger = org.apache.log4j.Logger.getLogger("org")
logger.setLevel(Level.WARN)
val spark = SparkSession.builder()
.appName(this.getClass.getName)
.config("spark.master", "local[*]").getOrCreate()
import spark.implicits._
val sales = Seq(
("Dallas", 2016, 100d),
("Dallas", 2017, 120d),
("Sanjose", 2017, 200d),
("Plano", 2015, 50d),
("Plano", 2016, 50d),
("Newyork", 2016, 150d),
("Toronto", 2017, 50d)
).toDF("city", "year", "saleAmount")
sales.printSchema()
val first = sales
.rollup("city", "year")
.agg(sum("saleAmount") as "saleAmount").sort($"city".desc_nulls_last, $"year".asc_nulls_last)
logInfo("group by city and year")
// The above query is semantically equivalent to the following
val second = sales
.groupBy("city", "year") // <-- subtotals (city, year)
.agg(sum("saleAmount") as "saleAmount")
second.show
logInfo("group by city ")
val third = sales
.groupBy("city") // <-- subtotals (city)
.agg(sum("saleAmount") as "saleAmount")
.select($"city", lit(null) as "year", $"saleAmount") // <-- year is null
third.show
logInfo("group by for grand total")
logInfo("final df using union of group by city and year / group by city /group by to get grand total")
val fourth = sales
.groupBy() // <-- grand total
.agg(sum("saleAmount") as "saleAmount")
.select(lit(null) as "city", lit(null) as "year", $"saleAmount") // <-- city and year are null
fourth.show
fourth.printSchema()
first.union(second).union(third).union(fourth)
.coalesce(1).write.format("com.databricks.spark.xml")
.mode("overwrite")
//.option("treatEmptyValuesAsNulls", "true")
.option("nullValue", "")
.option("rowTag", "ROW")
.save("sparkxmltest.xml");
}
具有所需输出的结果 XML:
<ROWS>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year></year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year></year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year></year>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<city></city>
<year></year>
<saleAmount>720.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year>2017</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year>2017</year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2017</year>
<saleAmount>120.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2015</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year>2016</year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year>2016</year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year>2016</year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Dallas</city>
<year></year>
<saleAmount>220.0</saleAmount>
</ROW>
<ROW>
<city>Plano</city>
<year></year>
<saleAmount>100.0</saleAmount>
</ROW>
<ROW>
<city>Newyork</city>
<year></year>
<saleAmount>150.0</saleAmount>
</ROW>
<ROW>
<city>Toronto</city>
<year></year>
<saleAmount>50.0</saleAmount>
</ROW>
<ROW>
<city>Sanjose</city>
<year></year>
<saleAmount>200.0</saleAmount>
</ROW>
<ROW>
<city></city>
<year></year>
<saleAmount>720.0</saleAmount>
</ROW>
</ROWS>