为什么 Spark DataFrame 在 Java 中没有 saveAsORCFile() 方法?

Why Spark DataFrame doesn't have saveAsORCFile() method in Java?

JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
     new Function<String, Record>() {
       public Record call(String line) throws Exception {
         String[] fields = line.split(",");
         Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
         return sd;
       }
});

 HiveContext hiveContext = new HiveContext(sc);
 DataFrame dataFrameAsORC = hiveContext.applySchema(dataTable, Record.class);
 dataFrameAsORC.saveAsORCFile("/to/hadoop/path");//does not compile

我正在使用 Spark 1.4,我发现了这个 Spark-JIRA 其中提到 1.4 版本支持 saveAsORCFile 但我在 DataFrame 的 JavaDoc 中找不到它。

我是 Spark 新手。

saveAsORCFile 是数据帧上的 implicit method

您必须先像这样导入它:

import org.apache.spark.sql.hive.orc._

在你的代码之前。

感谢 Holden's 评论:

[Scala] implicit conversions aren't supported in Java, instead you should use the standard save API for dataframes and specify the orc type when saving

这是使用 OP 示例的代码:

JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
     new Function<String, Record>() {
       public Record call(String line) throws Exception {
         String[] fields = line.split(",");
         Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
         return sd;
       }
});

// Pretty sure you can use HiveContext interchangeably here but I had
// better luck with SQLContext
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);

// Schema for the DataFrame is created automatically from Record.class
DataFrame dataFrame = sqlContext.createDataFrame(dataTable, Record.class);
// Record.class needs all the fields with getters/setters
// Need to do a select first or else save doesn't work
// not sure what fields OP's Record has, so I made some up
dataFrame.select("id", "field_one", "field_two", "create_tmsp")
   .save("/to/hadoop/path", "org.apache.spark.sql.hive.orc", SaveMode.ErrorIfExists);

注意:org.apache.spark.sql.hive.orc 来自 spark-hive。

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_2.10</artifactId>
   <version>${spark.version}</version>
   <scope>provided</scope>
</dependency>