为什么 Spark DataFrame 在 Java 中没有 saveAsORCFile() 方法?
Why Spark DataFrame doesn't have saveAsORCFile() method in Java?
JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
return sd;
}
});
HiveContext hiveContext = new HiveContext(sc);
DataFrame dataFrameAsORC = hiveContext.applySchema(dataTable, Record.class);
dataFrameAsORC.saveAsORCFile("/to/hadoop/path");//does not compile
我正在使用 Spark 1.4,我发现了这个 Spark-JIRA 其中提到 1.4 版本支持 saveAsORCFile 但我在 DataFrame 的 JavaDoc 中找不到它。
我是 Spark 新手。
saveAsORCFile 是数据帧上的 implicit method。
您必须先像这样导入它:
import org.apache.spark.sql.hive.orc._
在你的代码之前。
感谢 Holden's 评论:
[Scala] implicit conversions aren't supported in Java, instead you should use
the standard save API for dataframes and specify the orc type when
saving
这是使用 OP 示例的代码:
JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
return sd;
}
});
// Pretty sure you can use HiveContext interchangeably here but I had
// better luck with SQLContext
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
// Schema for the DataFrame is created automatically from Record.class
DataFrame dataFrame = sqlContext.createDataFrame(dataTable, Record.class);
// Record.class needs all the fields with getters/setters
// Need to do a select first or else save doesn't work
// not sure what fields OP's Record has, so I made some up
dataFrame.select("id", "field_one", "field_two", "create_tmsp")
.save("/to/hadoop/path", "org.apache.spark.sql.hive.orc", SaveMode.ErrorIfExists);
注意:org.apache.spark.sql.hive.orc 来自 spark-hive。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
return sd;
}
});
HiveContext hiveContext = new HiveContext(sc);
DataFrame dataFrameAsORC = hiveContext.applySchema(dataTable, Record.class);
dataFrameAsORC.saveAsORCFile("/to/hadoop/path");//does not compile
我正在使用 Spark 1.4,我发现了这个 Spark-JIRA 其中提到 1.4 版本支持 saveAsORCFile 但我在 DataFrame 的 JavaDoc 中找不到它。
我是 Spark 新手。
saveAsORCFile 是数据帧上的 implicit method。
您必须先像这样导入它:
import org.apache.spark.sql.hive.orc._
在你的代码之前。
感谢 Holden's 评论:
[Scala] implicit conversions aren't supported in Java, instead you should use the standard save API for dataframes and specify the orc type when saving
这是使用 OP 示例的代码:
JavaRDD<Record> dataTable = sc.textFile("hdfs://path/to/files/hdfs").map(
new Function<String, Record>() {
public Record call(String line) throws Exception {
String[] fields = line.split(",");
Record sd = new Record(fields[0], fields[1], fields[2], fields[3]);
return sd;
}
});
// Pretty sure you can use HiveContext interchangeably here but I had
// better luck with SQLContext
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
// Schema for the DataFrame is created automatically from Record.class
DataFrame dataFrame = sqlContext.createDataFrame(dataTable, Record.class);
// Record.class needs all the fields with getters/setters
// Need to do a select first or else save doesn't work
// not sure what fields OP's Record has, so I made some up
dataFrame.select("id", "field_one", "field_two", "create_tmsp")
.save("/to/hadoop/path", "org.apache.spark.sql.hive.orc", SaveMode.ErrorIfExists);
注意:org.apache.spark.sql.hive.orc 来自 spark-hive。
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>