在缺少 json 数据源的 spark 中读取 json
reading a json in spark missing json datasource
我正在尝试使用以下代码将示例 json 文件读入 SqlContext,但它失败了,随后出现数据源错误。
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "C:\samplepath\sample.json"
val jsondata = sqlContext.read.json(path)
java.lang.ClassNotFoundException: Failed to find data source: json.
Please find packages at http://spark-packages.org
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244)
at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: json.DefaultSource
at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62)
at scala.util.Try.orElse(Try.scala:82)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62)
... 50 more
我试图寻找可能丢失的 spark 包,但找不到任何有助于修复它的东西。
我使用 Pyspark 尝试了类似的代码,但由于类似的 json 数据源 ClassNotFoundException 而失败。
在进一步尝试将现有 RDD 转换为 JsonRDD 后,我能够成功获得结果。有什么我想念的吗?我在 Scala-2.10.5 上使用 Spark-1.6.1。任何帮助表示赞赏。谢谢
val stringRDD = sc.parallelize(Seq("""
{ "isActive": false,
"balance": ",431.73",
"picture": "http://placehold.it/32x32",
"age": 35,
"eyeColor": "blue"
}""",
"""{
"isActive": true,
"balance": ",515.60",
"picture": "http://placehold.it/32x32",
"age": 34,
"eyeColor": "blue"
}""",
"""{
"isActive": false,
"balance": ",765.29",
"picture": "http://placehold.it/32x32",
"age": 26,
"eyeColor": "blue"
}""")
)
sqlContext.jsonRDD(stringRDD).registerTempTable("testjson")
sqlContext.sql("SELECT age from testjson").collect
我使用源代码创建了 jar,因此我认为问题在于缺少一些资源。我从 spark 网站下载了最新的 jar,它按预期工作。
我正在尝试使用以下代码将示例 json 文件读入 SqlContext,但它失败了,随后出现数据源错误。
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "C:\samplepath\sample.json"
val jsondata = sqlContext.read.json(path)
java.lang.ClassNotFoundException: Failed to find data source: json. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244) at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: json.DefaultSource at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 50 more
我试图寻找可能丢失的 spark 包,但找不到任何有助于修复它的东西。
我使用 Pyspark 尝试了类似的代码,但由于类似的 json 数据源 ClassNotFoundException 而失败。
在进一步尝试将现有 RDD 转换为 JsonRDD 后,我能够成功获得结果。有什么我想念的吗?我在 Scala-2.10.5 上使用 Spark-1.6.1。任何帮助表示赞赏。谢谢
val stringRDD = sc.parallelize(Seq("""
{ "isActive": false,
"balance": ",431.73",
"picture": "http://placehold.it/32x32",
"age": 35,
"eyeColor": "blue"
}""",
"""{
"isActive": true,
"balance": ",515.60",
"picture": "http://placehold.it/32x32",
"age": 34,
"eyeColor": "blue"
}""",
"""{
"isActive": false,
"balance": ",765.29",
"picture": "http://placehold.it/32x32",
"age": 26,
"eyeColor": "blue"
}""")
)
sqlContext.jsonRDD(stringRDD).registerTempTable("testjson")
sqlContext.sql("SELECT age from testjson").collect
我使用源代码创建了 jar,因此我认为问题在于缺少一些资源。我从 spark 网站下载了最新的 jar,它按预期工作。