包含 2 列 RDD 连接的 spark scala 脚本中的编译问题
Compilation issue in spark scala script containing join on RDDs with 2 columns
我正在尝试使用 sbt package 命令编译以下脚本。
import org.apache.spark.SparkContext, org.apache.spark.SparkConf, org.apache.spark.rdd.PairRDDFunctions
object CusMaxRevenue {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("CusMaxRevenue")
val sc = new SparkContext(conf)
val ordersRDD = sc.textFile("/user/sk/sqoop_import/orders")
val orderItemsRDD = sc.textFile("/user/sk/sqoop_import/order_items")
// val ordersParsedRDD = ordersRDD.map( rec => ((rec.split(",")(0).toInt), (rec.split(",")(1),rec.split(",")(2)) ))
val ordersParsedRDD = ordersRDD.map( rec => ((rec.split(",")(0).toInt), rec.split(",")(1) ))
val orderItemsParsedRDD = orderItemsRDD.map(rec => ((rec.split(",")(1)).toInt, rec.split(",")(4).toFloat))
val ordersJoinOrderItems = orderItemsParsedRDD.join(ordersParsedRDD)
}
}
我收到以下错误:
[info] Set current project to Customer with Max revenue (in build file:/home/sk/scala/app3/)
[info] Compiling 1 Scala source to /home/sk/scala/app3/target/scala-2.10/classes...
[error] /home/sk/scala/app3/src/main/scala/CusMaxRevenue.scala:14: value join is not a member of org.apache.spark.rdd.RDD[(Int, Float)]
[error] val ordersJoinOrderItems = orderItemsParsedRDD.join(ordersParsedRDD)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
示例数据:
--ordersParsedRDD
(1,2013-07-25 00:00:00.0)
(2,2013-07-25 00:00:00.0)
(3,2013-07-25 00:00:00.0)
(4,2013-07-25 00:00:00.0)
(5,2013-07-25 00:00:00.0)
--orderItemsParsedRDD
(9.98)
(2,199.99)
(2,250.0)
(2,129.99)
(4,49.98)
当我在 spark scala 提示符下单独执行语句时,连接似乎有效。
PS:我在 RDD 中有几列,但为了进一步调查,我只保留了 2,但我仍然遇到编译问题!
附加信息:
我的 CusMaxRevenue.sbt 文件的内容
name := "Customer with Max revenue"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1"
试试这个:
val orderItemsParsedRDD = orderItemsRDD.map(rec => ( ((rec.split(",")(1).toInt), rec.split(",")(4).toFloat))
您需要添加导入:
import org.apache.spark.SparkContext._
这会带来所有隐式转换。
我正在尝试使用 sbt package 命令编译以下脚本。
import org.apache.spark.SparkContext, org.apache.spark.SparkConf, org.apache.spark.rdd.PairRDDFunctions
object CusMaxRevenue {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("CusMaxRevenue")
val sc = new SparkContext(conf)
val ordersRDD = sc.textFile("/user/sk/sqoop_import/orders")
val orderItemsRDD = sc.textFile("/user/sk/sqoop_import/order_items")
// val ordersParsedRDD = ordersRDD.map( rec => ((rec.split(",")(0).toInt), (rec.split(",")(1),rec.split(",")(2)) ))
val ordersParsedRDD = ordersRDD.map( rec => ((rec.split(",")(0).toInt), rec.split(",")(1) ))
val orderItemsParsedRDD = orderItemsRDD.map(rec => ((rec.split(",")(1)).toInt, rec.split(",")(4).toFloat))
val ordersJoinOrderItems = orderItemsParsedRDD.join(ordersParsedRDD)
}
}
我收到以下错误:
[info] Set current project to Customer with Max revenue (in build file:/home/sk/scala/app3/)
[info] Compiling 1 Scala source to /home/sk/scala/app3/target/scala-2.10/classes...
[error] /home/sk/scala/app3/src/main/scala/CusMaxRevenue.scala:14: value join is not a member of org.apache.spark.rdd.RDD[(Int, Float)]
[error] val ordersJoinOrderItems = orderItemsParsedRDD.join(ordersParsedRDD)
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
示例数据:
--ordersParsedRDD
(1,2013-07-25 00:00:00.0)
(2,2013-07-25 00:00:00.0)
(3,2013-07-25 00:00:00.0)
(4,2013-07-25 00:00:00.0)
(5,2013-07-25 00:00:00.0)
--orderItemsParsedRDD
(9.98)
(2,199.99)
(2,250.0)
(2,129.99)
(4,49.98)
当我在 spark scala 提示符下单独执行语句时,连接似乎有效。 PS:我在 RDD 中有几列,但为了进一步调查,我只保留了 2,但我仍然遇到编译问题!
附加信息: 我的 CusMaxRevenue.sbt 文件的内容
name := "Customer with Max revenue"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1"
试试这个:
val orderItemsParsedRDD = orderItemsRDD.map(rec => ( ((rec.split(",")(1).toInt), rec.split(",")(4).toFloat))
您需要添加导入:
import org.apache.spark.SparkContext._
这会带来所有隐式转换。