带有齐柏林飞艇的 pyspark 是 emr 给出 NoClassDefFoundError

pyspark with zeppelin on was emr gives NoClassDefFoundError

我是 运行 emr 上的 zeppelin,使用 pyspark 处理一些日志文件。

我收到这个 "java.lang.NoClassDefFoundError: com/amazonaws/services/s3/AmazonS3" 错误。

不确定如何解决。我看过各种资源。帮助表示赞赏。

---错误日志---

Py4JJavaError: An error occurred while calling o188.partitions. : java.lang.NoClassDefFoundError: com/amazonaws/services/s3/AmazonS3 at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.initialize(EmrFileSystem.java:99) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2644) at org.apache.hadoop.fs.FileSystem.access0(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:200) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:279) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.api.java.JavaRDDLike$class.partitions(JavaRDDLike.scala:65) at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:47) at sun.reflect.GeneratedMethodAccessor67.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.s3.AmazonS3 at java.net.URLClassLoader.run(URLClassLoader.java:366) at java.net.URLClassLoader.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 32 more

对于给您带来的不便,我们深表歉意!这是由于 emr-4.2.0 中引入的更改意外地从有效的 Zeppelin 类路径中删除了 AWS Java SDK 库。过去几天已将修复推送到大多数地区,并将在本周末推送到所有其他地区,因此现在应该可以在 emr-4.2.0 中再次使用。