Spark Streaming - Parquet 文件上传到 S3 错误
Spark Streaming - Parquet file upload to S3 error
我对 Spark Streaming 话题完全陌生。
通过流媒体应用程序,我正在创建大小约为 2,5MB 的 Parquet 文件并将它们存储在 S3/Local 目录中。
我使用的方法如下:
data.write.parquet(destination)
其中“数据”是一个 DataFrame
如果目的地是本地路径,一切都会像魅力一样工作,但如果我将它发送到 s3,路径类似于“s3n://bucket/directory/filename”,我会遇到以下异常:
15/12/17 10:47:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon.newInstance(ParquetRelation.scala:272)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
从存储桶中读取操作正常。
尽管有错误,但存储在存储桶中。就像“目录和文件夹”一样,它会为给定路径创建文件夹,但最后不是文件,而是“文件名和文件夹”文件。
技术详情:#
- S3 浏览器
- Windows 8.1
- IntelliJ CE 14.1.5
- Spark 流应用程序
- 适用于 Hadoop 2.6.0 的 Spark 1.5
问题出在 Hadoop 库中。我不得不用 windows SDK 7 重建 winutils (winutils.exe) 和本机库 (hadoop.dll) 然后我不得不将它移动到 %HADOOP_HOME%\bin%
并将 %HADOOP_HOME%\bin%
添加到Path
变量。可以在 hadoop-2.7.1-src\hadoop-common-project\hadoop-common\target
下找到要重建的项目。对于 win utils,我建议使用 windows 优化分支 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/
我对 Spark Streaming 话题完全陌生。
通过流媒体应用程序,我正在创建大小约为 2,5MB 的 Parquet 文件并将它们存储在 S3/Local 目录中。
我使用的方法如下:
data.write.parquet(destination)
其中“数据”是一个 DataFrame
如果目的地是本地路径,一切都会像魅力一样工作,但如果我将它发送到 s3,路径类似于“s3n://bucket/directory/filename”,我会遇到以下异常:
15/12/17 10:47:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon.newInstance(ParquetRelation.scala:272)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$$anonfun$apply$mcV$sp.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
从存储桶中读取操作正常。 尽管有错误,但存储在存储桶中。就像“目录和文件夹”一样,它会为给定路径创建文件夹,但最后不是文件,而是“文件名和文件夹”文件。
技术详情:#
- S3 浏览器
- Windows 8.1
- IntelliJ CE 14.1.5
- Spark 流应用程序
- 适用于 Hadoop 2.6.0 的 Spark 1.5
问题出在 Hadoop 库中。我不得不用 windows SDK 7 重建 winutils (winutils.exe) 和本机库 (hadoop.dll) 然后我不得不将它移动到 %HADOOP_HOME%\bin%
并将 %HADOOP_HOME%\bin%
添加到Path
变量。可以在 hadoop-2.7.1-src\hadoop-common-project\hadoop-common\target
下找到要重建的项目。对于 win utils,我建议使用 windows 优化分支 http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/