AWS EMR Spark - 获取 CSV 并与 SparkSql 一起使用 api

Question

//download file  csv
ByteArrayOutputStream downloadedFile = downloadFile();

//save file in temp folder csv   (
java.io.File tmpCsvFile = save(downloadedFile);

//reading
Dataset<Row> ds = session
        .read()
        .option("header", "true") 
        .csv(tmpCsvFile.getAbsolutePath())

tmpCsv文件保存在以下路径:

/mnt/yarn/usercache/hadoop/appcache/application_1511379756333_0001/container_1511379756333_0001_02_000001/tmp/1OkYaovxMsmR7iPoPnb8mx45MWvwr6k1y9xIdh8g7K0Q3118887242212394029.csv

读取异常:

org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://ip-33-33-33-33.ec2.internal:8020/mnt/yarn/usercache/hadoop/appcache/application_1511379756333_0001/container_1511379756333_0001_02_000001/tmp/1OkYaovxMsmR7iPoPnb8mx45MWvwr6k1y9xIdh8g7K0Q3118887242212394029.csv;

我认为问题在于文件保存在本地，当我尝试读取 spark-sql api 时找不到文件。我已经用 sparkContext.addFile() 试过了，但没用。

有什么解决办法吗？

谢谢

Answer 1

Spark 支持大量的文件系统，用于读写。

Local/Regular (文件://)
S3 (s3://)
HDFS (hdfs://)

作为标准行为，如果未指定 URI，spark-sql 将使用 hdfs://driver_address:port/path.

在路径中添加file:///的解决方案，只能在客户端模式下工作，在我的情况下（集群）它没有。当驱动程序创建读取文件的任务时，它将被传递给执行程序到没有文件的节点之一。

我们能做什么？在 Hadoop 上写一个文件。

   Configuration conf = new Configuration();
   ByteArrayOutputStream downloadedFile = downloadFile();
   //convert outputstream in inputstream
   InputStream is=Functions.FROM_BAOS_TO_IS.apply(fileOutputStream);
   String myfile="miofile.csv";
   //acquiring the filesystem
   FileSystem fs = FileSystem.get(URI.create(dest),conf);
   //openoutputstream to hadoop
   OutputStream outf = fs.create( new Path(dest));
   //write file 
   IOUtils.copyBytes(tmpIS, outf, 4096, true);
   //commit the read task
   Dataset<Row> ds = session
    .read()
    .option("header", "true") 
    .csv(myfile)

谢谢，欢迎提供更好的解决方案

AWS EMR Spark - 获取 CSV 并与 SparkSql 一起使用 api

AWS EMR Spark - get CSV And use with SparkSql api

java

emr

apache-spark

apache-spark-sql

spark-dataframe