将 RDD[String] 转换为 RDD[myclass]

Question

我正在尝试将 RDD[String] 转换为 RDD[Picture] 但无法做到。如果我能设法将 RDD 转换为 RDD[Picture]，我会使用 def hasValidCountry 来检查图片元数据的纬度和经度值是否有效。之后，我尝试使用图片 class 中的 def hasTags 检查用户标签是否有效。我遇到的问题：

发现隐式转换：行 ⇒ augmentString(row)：scala.collection.immutable.StringOps
类型不匹配；找到：需要的字符串：Array[String]
value InterestingPics 不是 Array[Nothing] 的成员可能的原因：可能在 `value InterestingPics' 之前缺少分号？

我的意图是选择具有有效国家和标签的行并将所有行转换为新的 RDD[图片] class。

ScalaFile1（我已经更新了 ScalaFile）：

  object Part2 {
      def main(args: Array[String]): Unit = {
        var spark: SparkSession = null
        try {
          spark = SparkSession.builder().appName("Flickr using dataframes").config("spark.master", "local[*]").getOrCreate()
          val originalFlickrMeta: RDD[String] = spark.sparkContext.textFile("flickrSample.txt")        
          
      val InterestingPics = originalFlickrMeta.map(row => row.split('\t')).map(field => Picture(field(0).toString())
      InterestingPics.collect
      InterestingPics.take(5).foreach(println)

Answer 1

这有效，例如：

case class case_for_rdd(c1: Int, c2: String, c3: String)

val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv01-4.txt")
val rdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2)))
rdd.collect

使用数组从文件读取 RDD 的更复杂示例。数组需要一个分隔符。

1,10,100,aa|bb|cc
2,20,200,xxxxxx|yyyyyyyy|z|aaa

Some sample code, but use List as otherwise you get to see array addresses, that's what those strange strings are, courtesy of smarter people here:

case class case_for_rdd(c1: Int, c2: String, c3: String, a4: List[String])  
val rdd_data = spark.sparkContext.textFile("/FileStore/tables/csv03.txt")
val myCaseRdd = rdd_data.map(row => row.split(',')).map(field => case_for_rdd(field(0).toInt, field(1), field(2), (field(3).split("\|").toList)))
myCaseRdd.collect

我的建议是使用 DF，这样拆分东西会更容易。此外，通过转换操作 rdd，然后 case class 丢失。带 DF api 的数组没有这样的问题。

Answer 2

在@thebluephantom 的帮助下，我找到了我的问题的解决方案。非常感谢。

val InterestingPics = originalFlickrMeta.map(line => (new Picture(line.split("\t")))).filter(f => f.c != null && f.userTags.length > 0)
      InterestingPics.collect().foreach(println)

将 RDD[String] 转换为 RDD[myclass]

Transforming RDD[String] to RDD[myclass]

scala

apache-spark

rdd