SBT 中未解决的 Lucene 依赖

Question

我正在尝试为我的数据编写一个文本分类器应用程序，这些数据是从我们产品的各个评论网站上删除的。我正在使用电影分类器的示例只是为了获得运行片段，然后根据我的要求进行更改。

我使用的示例是使用 Lucene 分析器来提取文本描述，但它没有编译（我使用的是 SBT）。编译错误如下。

> compile
[info] Updating {file:/D:/ScalaApps/MovieClassifier/}movieclassifier...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz
[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..
[info] Resolving org.spark-project.hive.shims#hive-shims-common-secure;0.13.1a .
[info] Resolving org.apache.lucene#lucene-analyzers-common_2.10;5.1.0 ...
[warn]  module not found: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0
[warn] ==== local: tried
[warn]   C:\Users\manik.jasrotia\.ivy2\local\org.apache.lucene\lucene-analyzers-
common_2.10.1.0\ivys\ivy.xml
[warn] ==== public: tried
[warn]   https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-commo
n_2.10/5.1.0/lucene-analyzers-common_2.10-5.1.0.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn]  Note: Unresolved dependencies path:
[warn]          org.apache.lucene:lucene-analyzers-common_2.10:5.1.0 (D:\ScalaAp
ps\MovieClassifier\build.sbt#L7-18)
[warn]            +- naivebayes_document_classifier:naivebayes_document_classifi
er_2.10:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.lucen
e#lucene-analyzers-common_2.10;5.1.0: not found
[error] Total time: 31 s, completed Dec 6, 2015 11:01:45 AM
>

我正在使用两个 Scala 文件（Stemmer.scala 和 MovieClassifier.scala）。下面给出了这两个程序以及 Build.sbt 文件。感谢任何帮助。

电影分类器

import org.apache.spark.mllib.classification.NaiveBayes  
import org.apache.spark.mllib.regression.LabeledPoint  
import org.apache.spark.{SparkContext, SparkConf}  
import org.apache.spark.mllib.feature.{IDF, HashingTF}

object MovieRatingClassifier {  
  def main(args:Array[String])
    {

      val sparkConfig = new SparkConf().setAppName("Movie Rating Classifier")
      val sc = new SparkContext(sparkConfig)

      /*
    This loads the data from HDFS.
    HDFS is a distributed file storage system so this technically 
    could be a very large multi terabyte file
      */      
      val dataFile = sc.textFile("D:/spark4/mydata/naive_bayes_movie_classification.txt")

      /*
    HashingTF and IDF are helpers in MLlib that helps us vectorize our
    synopsis instead of doing it manually
      */       
      val hashingTF = new HashingTF()

      /*
    Our ultimate goal is to get our data into a collection of type LabeledPoint.
    The MLlib implementation uses LabeledPoints to train the Naive Bayes model.
    First we parse the file for ratings and vectorize the synopses
       */

      val ratings=dataFile.map{x=>
    x.split(";")
    match {
      case Array(rating,synopsis) =>
        rating.toDouble
    }
      }

      val synopsis_frequency_vector=dataFile.map{x=>
    x.split(";")
    match {
      case Array(rating,synopsis) =>
        val stemmed=Stemmer.tokenize(synopsis)
        hashingTF.transform(stemmed)
    }
      }

      synopsis_frequency_vector.cache()

      /*
       http://en.wikipedia.org/wiki/Tf%E2%80%93idf
       https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
      */
      val idf = new IDF().fit(synopsis_frequency_vector)
      val tfidf=idf.transform(synopsis_frequency_vector)

      /*produces (rating,vector) tuples*/
      val zipped=ratings.zip(tfidf)

      /*Now we transform them into LabeledPoints*/
      val labeledPoints = zipped.map{case (label,vector)=> LabeledPoint(label,vector)}

      val model = NaiveBayes.train(labeledPoints)

      /*--- Model is trained now we get it to classify our test file with only synopsis ---*/
      val testDataFile = sc.textFile("D:/spark4/naive_bayes_movie_classification-test.txt")

      /*We only have synopsis now. The rating is what we want to achieve.*/
      val testVectors=testDataFile.map{x=>
    val stemmed=Stemmer.tokenize(x)
    hashingTF.transform(stemmed)
      }
      testVectors.cache()

      val tfidf_test = idf.transform(testVectors)

      val result = model.predict(tfidf_test)

      result.collect.foreach(x=>println("Predicted rating for the movie is: "+x))

    }
}

词干分析器

import org.apache.lucene.analysis.en.EnglishAnalyzer  
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute  
import scala.collection.mutable.ArrayBuffer

object Stemmer {

  // Adopted from
  // https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/

  def tokenize(content:String):Seq[String]={
    val analyzer=new EnglishAnalyzer()
    val tokenStream=analyzer.tokenStream("contents", content)
    //CharTermAttribute is what we're extracting

    val term=tokenStream.addAttribute(classOf[CharTermAttribute])

    tokenStream.reset() // must be called by the consumer before consumption to clean the stream



    var result = ArrayBuffer.empty[String]

    while(tokenStream.incrementToken()) {
    val termValue = term.toString
    if (!(termValue matches ".*[\d\.].*")) {
      result += term.toString
    }
    }
    tokenStream.end()
    tokenStream.close()
    result
  }
}

Build.sbt 文件

name := "NaiveBayes_Document_Classifier"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"

libraryDependencies += "org.apache.spark" % "spark-mllib" % "1.4.0" % "provided"

libraryDependencies += "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"

Answer 1

你确定你没有输入

libraryDependencies += "org.apache.lucene" %% "lucene-analyzers-common" % "5.1.0"

(double %%) 而不是你在这里写的？因为当它实际上是一个 Java 库时，看起来你确实在请求一个 scala 版本的 lucene。它应该像你在这里写的那样是一个 % 而 mllib 应该是一个双 %%。即尝试

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.4.0" % "provided",
  "org.apache.spark" %% "spark-mllib" % "1.4.0" % "provided",
  "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
)

请注意，您似乎从此处收到的答案中引入了回归

Answer 2

这个问题通过使用下面的依赖关系得到解决

"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"

SBT 中未解决的 Lucene 依赖

unresolved Lucene dependency in SBT

scala

sbt

apache-spark-sql