如何抑制 "No input paths specified in job" 和 return 一个空的 RDD / DataFrame？

Question

我正在编写一个内部 API 来简化 Spark 与自定义测量数据格式的使用。由于使用的模式根据测量数据的类型而不同，我使用 DataFrame API，并且我使用 Hadoop 的 FileInputFormat 和 sc.newAPIHadoopFile 来读取它们，因为测量数据格式不能简化为简单的文本文件。

在我的 API 中，我想 return 清空 DataFrame 而不是抛出 No input paths specified in job 异常，所以我首先采用了天真的方法：

try
  spark
    .sparkContext
    .newAPIHadoopFile(inPath,
                      classOf[OneOfMyCustomMeasuringDataInputFormat],
                      classOf[SomeAppropriateKeyWritable],
                      classOf[SomeAppropriateValueWritable],
                      conf)
    .map {
           case (k, v) => SomeAppropriateRecordCaseClass(/* data from k and v */)
         }
    .toDF
  catch {
    case e: IOException if e.getMessage.equals("No input paths specified in job") =>
    spark.createDataFrame(spark.sparkContext.emptyRDD[Row],
                          // Some implicits I made to simplify schema construction:
                          ("foo" of SomeType) ::
                          ("bar" of SomeOtherType) ::
                          // more ::
                          Nil : StructType)
  }

然而，由于 RDD 是惰性的，当没有输入路径时，此异常不会触发，直到真正访问 DF。

目前，我在我所有的 FileInputFormat 中处理这个问题，并指示我的同事在将来可能会添加更多格式，以检查 listStatus 方法和 return 中的此异常一个空列表，但我想知道这是否可以在一般情况下做得更多。

Answer 1

在深入研究 Hadoop 和 Spark 的源代码后，我发现按照目前的编码方式，目前真正最好的解决方案是在 FileInputFormat 秒内处理这个问题。我添加了一个额外的选项，该选项被放入我的 Hadoop Configuration，名为 FileInputFormat.dontThrowOnEmptyPaths，我的自定义输入格式遵循这一点。它们会像我上面的代码示例中那样捕获相应的 IOException ，并且仅在未设置此选项或将其设置为 false.

时才重新抛出它

这是一种解决方法，I posted an enhancement suggestion to the JIRA about this.

如何抑制 "No input paths specified in job" 和 return 一个空的 RDD / DataFrame？

How to suppress "No input paths specified in job" and return an empty RDD / DataFrame instead?

scala

apache-spark

rdd

spark-dataframe