如何从 Apache Spark 中的单个文件记录创建多个 RDD 行

How to create multiple RDD rows from a single file record in Apache Spark

我正在使用 Apache Spark 处理以下逻辑。我的输入文件包含以下格式的行,以竖线分隔:

14586|9297,0.000128664|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722

第一列是一个键。之后可能有一列或多列。每个后续列都有一个辅助键和一个值,如下所示:4673,0.00730174 在阅读此文件时,我希望得到的 RDD 只有 3 列,在第一列之后压平其他列,但保留主键,如下所示:

14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722

我如何在 Scala 中做到这一点?

您是否考虑过使用 flatMap?它允许您从单行输入创建多个 0-n 行。只需解析该行并使用主行键的不同值重建该行。

这是您要找的东西吗?

val sc: SparkContext = ...
val rdd = sc.parallelize(Seq(
  "14586|9297,0.000128664|9298,0.0683921",
  "14587|4673,0.00730174",
  "14588|9233,1.15112e-07|9234,4.80094e-05|9235,1.91492e-05|9236,0.00776722"
)).flatMap { line =>
  val splits = line.split('|')
  val key = splits.head
  val pairs = splits.tail

  pairs.map { pair =>
    s"$key|$pair"
  }
}

rdd collect() foreach println

输出:

14586|9297,0.000128664
14586|9298,0.0683921
14587|4673,0.00730174
14588|9233,1.15112e-07
14588|9234,4.80094e-05
14588|9235,1.91492e-05
14588|9236,0.00776722