我的 Spark 作业正在删除 hdfs 中的目标文件夹

Question

我有一个脚本可以将 Hive table 内容写入 HDFS 中的 CSV 文件。目标文件夹名称在 JSON 参数文件中给出。当我启动脚本时，我注意到我已经创建的文件夹被自动删除，然后抛出一个错误，指出目标文件不存在。这是我的脚本：

sigma.cache // sigma is the df that contains the hive table. Tested OK
sigma.repartition(1).write.mode(SaveMode.Overwrite).format("csv").option("header", true).option("delimiter", "|").save(Parametre_vigiliste.cible)
val conf = new Configuration()
val fs = FileSystem.get(conf)
//Parametre_vigiliste.cible is the variable inide the JSON file that contains the target folder name
val file = fs.globStatus(new Path(Parametre_vigiliste.cible + "/part*"))(0).getPath().getName(); 
fs.rename(new Path(Parametre_vigiliste.cible + "/" + file), new Path(Parametre_vigiliste.cible + "/" + "FIC_PER_DATALAKE_.txt"));
sigma.unpersist()

抛出错误：

exception caught: java.lang.UnsupportedOperationException: CSV data source does not support null data type.

此代码可以出于某种原因删除文件夹吗？谢谢。

Answer 1

因此，正如 Prateek 所建议的那样，我尝试了 sigma.printSchema 并发现了一些空列。我纠正了这个问题，它奏效了。

我的 Spark 作业正在删除 hdfs 中的目标文件夹

My Spark job is deleting the target folder inside hdfs

csv

scala

file

hdfs

apache-spark