Spark 和 Scala 中的文本操作

Question

这是我的数据：

review/text: The product picture and part number match, but they together do not math the description.

review/text: A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.

review/text: This power supply did the job and got my computer back online in a hurry.

review/text: Not only did the supply work. it was easy to install, a lot quieter than the PowMax that fried.

review/text: This is an awesome power supply that was extremely easy to install. 

review/text: I had my doubts since best buy would end up charging me . at the time I bought my camera for the card and the cable.

review/text: Amazing... Installed the board, and that's it, no driver needed. Work great, no error messages.

我试过了：

import org.apache.spark.{SparkContext, SparkConf}

object test12 {
  def filterfunc(s: String): Array[((String))] = {
    s.split( """\.""") 
      .map(_.split(" ")
      .filter(_.nonEmpty)
      .map(_.replaceAll( """\W""", "")
      .toLowerCase)
      .filter(_.nonEmpty)
      .flatMap(x=>x)
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.textFile("data/2012/2012.txt")
    val stopWords = sc.broadcast(List[String]("reviewtext", "a", "about", "above", "according", "accordingly", "across", "actually",...)

    var grouped_doc_words = rdd.flatMap({ (line) =>
      val words = line.map(filterfunc).filter(word_filter.value))
      words.map(w => {
        (line.hashCode(), w)
      })
    }).groupByKey()

  }
}

我想生成这个输出：

doc1: product picture number match together not math description. 
doc2: necessity garmin. adapter power unit my motorcycle. works like charm.
doc3: power supply job computer online hurry.
doc4: not supply work. easy install quieter powmax fried.
...

一些例外：1- (not , n't , non , none) 不被发射 2- 所有点 (.) 符号必须保留

我上面的代码运行得不是很好。

Answer 1

我认为标记行有错误：

var grouped_doc_words = rdd.flatMap({ (line) =>
  val words = line.map(filterfunc).filter(word_filter.value)) // **
  words.map(w => {
    (line.hashCode(), w)
  })
}).groupByKey()

这里：

line.map(filterfunc)

应该是：

filterfunc(line)

解释：

line 是一个字符串。 map 运行s 完成了一组项目。当您执行 line.map(...) 时，它基本上运行是每个 Char 上传递的函数 - 这不是您想要的。

scala> val line2 = "This is a long string"
line2: String = This is a long string

scala> line2.map(_.length)
<console>:13: error: value length is not a member of Char
              line2.map(_.length)

此外，我不知道你在 filterfunction 中使用什么：

.map(_.replaceAll( """\W""", "")

我最后无法正确地运行 spark-shell。如果这些解决了您的问题，您可以更新吗？

Answer 2

为什么不像这样：

这样你就不需要任何分组或 flatMapping。

编辑：

我是手写的，确实有一些小错误，但我希望思路清晰。这是经过测试的代码：

def processLine(s: String, stopWords: Set[String]): List[String] = {
    s.toLowerCase()
      .replaceAll(""""[^a-zA-Z\.]""", "")
      .replaceAll("""\.""", " .")
      .split("\s+")
      .filter(!stopWords.contains(_))
      .toList
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.parallelize(
      List(
        "The product picture and part number match, but they together do not math the description.",
        "A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.",
        "This power supply did the job and got my computer back online in a hurry."
      )
    )
    val stopWords = sc.broadcast(
      Set("reviewtext", "a", "about", "above",
        "according", "accordingly",
        "across", "actually", "..."))
    val grouped_doc_words = rdd.map(processLine(_, stopWords.value))
    grouped_doc_words.collect().foreach(p => println(p))
  }

结果是：

List(the, product, picture, and, part, number, match,, but, they, together, do, not, math, the, description, .)
List(necessity, for, the, garmin, ., used, the, adapter, to, power, the, unit, on, my, motorcycle, ., works, like, charm, .)
List(this, power, supply, did, the, job, and, got, my, computer, back, online, in, hurry, .)

现在，如果您想要字符串而不是列表，只需执行以下操作：

grouped_doc_words.map(_.mkString(" "))

Spark 和 Scala 中的文本操作

Text manipulation in Spark and Scala

text

scala

apache-spark