在缺少 json 数据源的 spark 中读取 json

reading a json in spark missing json datasource

我正在尝试使用以下代码将示例 json 文件读入 SqlContext,但它失败了,随后出现数据源错误。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val path = "C:\samplepath\sample.json"
val jsondata = sqlContext.read.json(path)

java.lang.ClassNotFoundException: Failed to find data source: json. Please find packages at http://spark-packages.org at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109) at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:244) at org.apache.spark.deploy.SparkSubmit$.doRunMain(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: json.DefaultSource at scala.tools.nsc.interpreter.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun$$anonfun$apply.apply(ResolvedDataSource.scala:62) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$$anonfun.apply(ResolvedDataSource.scala:62) at scala.util.Try.orElse(Try.scala:82) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:62) ... 50 more

我试图寻找可能丢失的 spark 包,但找不到任何有助于修复它的东西。

我使用 Pyspark 尝试了类似的代码,但由于类似的 json 数据源 ClassNotFoundException 而失败。

在进一步尝试将现有 RDD 转换为 JsonRDD 后,我能够成功获得结果。有什么我想念的吗?我在 Scala-2.10.5 上使用 Spark-1.6.1。任何帮助表示赞赏。谢谢

val stringRDD = sc.parallelize(Seq(""" 
  { "isActive": false,
    "balance": ",431.73",
    "picture": "http://placehold.it/32x32",
    "age": 35,
    "eyeColor": "blue"
  }""",
   """{
    "isActive": true,
    "balance": ",515.60",
    "picture": "http://placehold.it/32x32",
    "age": 34,
    "eyeColor": "blue"
  }""", 
  """{
    "isActive": false,
    "balance": ",765.29",
    "picture": "http://placehold.it/32x32",
    "age": 26,
    "eyeColor": "blue"
  }""")
)
sqlContext.jsonRDD(stringRDD).registerTempTable("testjson")
sqlContext.sql("SELECT age from testjson").collect

我使用源代码创建了 jar,因此我认为问题在于缺少一些资源。我从 spark 网站下载了最新的 jar,它按预期工作。