运行 使用 Gradle 来自 Intellij 的 spark-redshift
Running spark-redshift from Intellij using Gradle
我正在尝试使用 spark-redshift 库,但无法对 sqlContext.read() 命令创建的数据帧进行操作(从 redshift 读取)。
这是我的代码:
Class.forName("com.amazon.redshift.jdbc41.Driver")
val conf = new SparkConf().setAppName("Spark Application").setMaster("local[2]")
val sc = new SparkContext(conf)
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "****")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "****")
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://URL")
.option("dbtable", "table")
.option("tempdir", "s3n://bucket/folder")
.load()
df.registerTempTable("table")
val data = sqlContext.sql("SELECT * FROM table")
data.show()
这是我在 运行 Scala 对象的 main 方法中的上述代码时收到的错误:
Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1096)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:116)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject.apply(DataSourceStrategy.scala:279)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject.apply(DataSourceStrategy.scala:278)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:310)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:274)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:49)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
at com.triplelift.spark.Main$.main(Main.scala:37)
at com.triplelift.spark.Main.main(Main.scala)
如果这有帮助,我还有 gradle 依赖项:
dependencies {
compile (
'com.amazonaws:aws-java-sdk:1.10.31',
'com.amazonaws:aws-java-sdk-redshift:1.10.31',
'org.apache.spark:spark-core_2.10:1.5.1',
'org.apache.spark:spark-streaming_2.10:1.5.1',
'org.apache.spark:spark-mllib_2.10:1.5.1',
'org.apache.spark:spark-sql_2.10:1.5.1',
'com.databricks:spark-redshift_2.10:0.5.2',
'com.fasterxml.jackson.core:jackson-databind:2.6.3'
)
testCompile group: 'junit', name: 'junit', version: '4.11'
}
不用说了,在data.show()求值的时候就报错了
关于一个不相关的说明...任何使用 Intellij 14 的人都知道如何将 Redshift 驱动程序永久添加到模块中吗?每次我进行 gradle 刷新时,它都会从项目结构中的依赖项中删除。奇怪。
最初的问题是出现此错误:
com.fasterxml.jackson.databind.JsonMappingException:
Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
所以我在这里遵循了这个答案:
所以我添加了这一行 'com.fasterxml.jackson.core:jackson-databind:2.6.3' 并在不同版本(即 2.4.4)之间切换,然后开始在项目视图中查看我的外部库...所以我删除了新的 jackson-databind 依赖项并想查看所有激发加载的杰克逊库...那时我注意到杰克逊库都是 2.5.1,除了 jackson-module-scala_2.10,它在 2.4.4 上 - 所以而不是四处乱窜jackson-databind 依赖项,我添加了这个:
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'
现在我的代码可以工作了。似乎 spark-core 1.51 在放入 Maven 之前没有正确构建?不确定。
注意:始终检查您的传递依赖项及其版本...
我正在尝试使用 spark-redshift 库,但无法对 sqlContext.read() 命令创建的数据帧进行操作(从 redshift 读取)。
这是我的代码:
Class.forName("com.amazon.redshift.jdbc41.Driver")
val conf = new SparkConf().setAppName("Spark Application").setMaster("local[2]")
val sc = new SparkContext(conf)
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "****")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", "****")
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://URL")
.option("dbtable", "table")
.option("tempdir", "s3n://bucket/folder")
.load()
df.registerTempTable("table")
val data = sqlContext.sql("SELECT * FROM table")
data.show()
这是我在 运行 Scala 对象的 main 方法中的上述代码时收到的错误:
Exception in thread "main" java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1096)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:116)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun.apply(DataSourceStrategy.scala:53)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject.apply(DataSourceStrategy.scala:279)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject.apply(DataSourceStrategy.scala:278)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:310)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:274)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:49)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun.apply(QueryPlanner.scala:58)
at scala.collection.Iterator$$anon.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:926)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:924)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:930)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:930)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1314)
at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1377)
at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:401)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:362)
at org.apache.spark.sql.DataFrame.show(DataFrame.scala:370)
at com.triplelift.spark.Main$.main(Main.scala:37)
at com.triplelift.spark.Main.main(Main.scala)
如果这有帮助,我还有 gradle 依赖项:
dependencies {
compile (
'com.amazonaws:aws-java-sdk:1.10.31',
'com.amazonaws:aws-java-sdk-redshift:1.10.31',
'org.apache.spark:spark-core_2.10:1.5.1',
'org.apache.spark:spark-streaming_2.10:1.5.1',
'org.apache.spark:spark-mllib_2.10:1.5.1',
'org.apache.spark:spark-sql_2.10:1.5.1',
'com.databricks:spark-redshift_2.10:0.5.2',
'com.fasterxml.jackson.core:jackson-databind:2.6.3'
)
testCompile group: 'junit', name: 'junit', version: '4.11'
}
不用说了,在data.show()求值的时候就报错了
关于一个不相关的说明...任何使用 Intellij 14 的人都知道如何将 Redshift 驱动程序永久添加到模块中吗?每次我进行 gradle 刷新时,它都会从项目结构中的依赖项中删除。奇怪。
最初的问题是出现此错误:
com.fasterxml.jackson.databind.JsonMappingException:
Could not find creator property with name 'id' (in class org.apache.spark.rdd.RDDOperationScope)
所以我在这里遵循了这个答案:
所以我添加了这一行 'com.fasterxml.jackson.core:jackson-databind:2.6.3' 并在不同版本(即 2.4.4)之间切换,然后开始在项目视图中查看我的外部库...所以我删除了新的 jackson-databind 依赖项并想查看所有激发加载的杰克逊库...那时我注意到杰克逊库都是 2.5.1,除了 jackson-module-scala_2.10,它在 2.4.4 上 - 所以而不是四处乱窜jackson-databind 依赖项,我添加了这个:
compile 'com.fasterxml.jackson.module:jackson-module-scala_2.10:2.6.3'
现在我的代码可以工作了。似乎 spark-core 1.51 在放入 Maven 之前没有正确构建?不确定。
注意:始终检查您的传递依赖项及其版本...