SBT 中未解决的 Lucene 依赖
unresolved Lucene dependency in SBT
我正在尝试为我的数据编写一个文本分类器应用程序,这些数据是从我们产品的各个评论网站上删除的。我正在使用电影分类器的示例只是为了获得 运行 片段,然后根据我的要求进行更改。
我使用的示例是使用 Lucene 分析器来提取文本描述,但它没有编译(我使用的是 SBT)。编译错误如下。
> compile
[info] Updating {file:/D:/ScalaApps/MovieClassifier/}movieclassifier...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz
[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..
[info] Resolving org.spark-project.hive.shims#hive-shims-common-secure;0.13.1a .
[info] Resolving org.apache.lucene#lucene-analyzers-common_2.10;5.1.0 ...
[warn] module not found: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0
[warn] ==== local: tried
[warn] C:\Users\manik.jasrotia\.ivy2\local\org.apache.lucene\lucene-analyzers-
common_2.10.1.0\ivys\ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-commo
n_2.10/5.1.0/lucene-analyzers-common_2.10-5.1.0.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.lucene:lucene-analyzers-common_2.10:5.1.0 (D:\ScalaAp
ps\MovieClassifier\build.sbt#L7-18)
[warn] +- naivebayes_document_classifier:naivebayes_document_classifi
er_2.10:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.lucen
e#lucene-analyzers-common_2.10;5.1.0: not found
[error] Total time: 31 s, completed Dec 6, 2015 11:01:45 AM
>
我正在使用两个 Scala 文件(Stemmer.scala 和 MovieClassifier.scala)。下面给出了这两个程序以及 Build.sbt 文件。感谢任何帮助。
电影分类器
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.feature.{IDF, HashingTF}
object MovieRatingClassifier {
def main(args:Array[String])
{
val sparkConfig = new SparkConf().setAppName("Movie Rating Classifier")
val sc = new SparkContext(sparkConfig)
/*
This loads the data from HDFS.
HDFS is a distributed file storage system so this technically
could be a very large multi terabyte file
*/
val dataFile = sc.textFile("D:/spark4/mydata/naive_bayes_movie_classification.txt")
/*
HashingTF and IDF are helpers in MLlib that helps us vectorize our
synopsis instead of doing it manually
*/
val hashingTF = new HashingTF()
/*
Our ultimate goal is to get our data into a collection of type LabeledPoint.
The MLlib implementation uses LabeledPoints to train the Naive Bayes model.
First we parse the file for ratings and vectorize the synopses
*/
val ratings=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
rating.toDouble
}
}
val synopsis_frequency_vector=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
val stemmed=Stemmer.tokenize(synopsis)
hashingTF.transform(stemmed)
}
}
synopsis_frequency_vector.cache()
/*
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
*/
val idf = new IDF().fit(synopsis_frequency_vector)
val tfidf=idf.transform(synopsis_frequency_vector)
/*produces (rating,vector) tuples*/
val zipped=ratings.zip(tfidf)
/*Now we transform them into LabeledPoints*/
val labeledPoints = zipped.map{case (label,vector)=> LabeledPoint(label,vector)}
val model = NaiveBayes.train(labeledPoints)
/*--- Model is trained now we get it to classify our test file with only synopsis ---*/
val testDataFile = sc.textFile("D:/spark4/naive_bayes_movie_classification-test.txt")
/*We only have synopsis now. The rating is what we want to achieve.*/
val testVectors=testDataFile.map{x=>
val stemmed=Stemmer.tokenize(x)
hashingTF.transform(stemmed)
}
testVectors.cache()
val tfidf_test = idf.transform(testVectors)
val result = model.predict(tfidf_test)
result.collect.foreach(x=>println("Predicted rating for the movie is: "+x))
}
}
词干分析器
import org.apache.lucene.analysis.en.EnglishAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import scala.collection.mutable.ArrayBuffer
object Stemmer {
// Adopted from
// https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
def tokenize(content:String):Seq[String]={
val analyzer=new EnglishAnalyzer()
val tokenStream=analyzer.tokenStream("contents", content)
//CharTermAttribute is what we're extracting
val term=tokenStream.addAttribute(classOf[CharTermAttribute])
tokenStream.reset() // must be called by the consumer before consumption to clean the stream
var result = ArrayBuffer.empty[String]
while(tokenStream.incrementToken()) {
val termValue = term.toString
if (!(termValue matches ".*[\d\.].*")) {
result += term.toString
}
}
tokenStream.end()
tokenStream.close()
result
}
}
Build.sbt 文件
name := "NaiveBayes_Document_Classifier"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-mllib" % "1.4.0" % "provided"
libraryDependencies += "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
你确定你没有输入
libraryDependencies += "org.apache.lucene" %% "lucene-analyzers-common" % "5.1.0"
(double %%
) 而不是你在这里写的?因为当它实际上是一个 Java 库时,看起来你确实在请求一个 scala 版本的 lucene。它应该像你在这里写的那样是一个 %
而 mllib
应该是一个双 %%
。即尝试
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.4.0" % "provided",
"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
)
请注意,您似乎从此处收到的答案中引入了回归
这个问题通过使用下面的依赖关系得到解决
"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
我正在尝试为我的数据编写一个文本分类器应用程序,这些数据是从我们产品的各个评论网站上删除的。我正在使用电影分类器的示例只是为了获得 运行 片段,然后根据我的要求进行更改。
我使用的示例是使用 Lucene 分析器来提取文本描述,但它没有编译(我使用的是 SBT)。编译错误如下。
> compile
[info] Updating {file:/D:/ScalaApps/MovieClassifier/}movieclassifier...
[info] Resolving com.sun.jersey.jersey-test-framework#jersey-test-framework-griz
[info] Resolving com.fasterxml.jackson.module#jackson-module-scala_2.10;2.4.4 ..
[info] Resolving org.spark-project.hive.shims#hive-shims-common-secure;0.13.1a .
[info] Resolving org.apache.lucene#lucene-analyzers-common_2.10;5.1.0 ...
[warn] module not found: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0
[warn] ==== local: tried
[warn] C:\Users\manik.jasrotia\.ivy2\local\org.apache.lucene\lucene-analyzers-
common_2.10.1.0\ivys\ivy.xml
[warn] ==== public: tried
[warn] https://repo1.maven.org/maven2/org/apache/lucene/lucene-analyzers-commo
n_2.10/5.1.0/lucene-analyzers-common_2.10-5.1.0.pom
[info] Resolving org.fusesource.jansi#jansi;1.4 ...
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: UNRESOLVED DEPENDENCIES ::
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn] :: org.apache.lucene#lucene-analyzers-common_2.10;5.1.0: not found
[warn] ::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] Note: Unresolved dependencies path:
[warn] org.apache.lucene:lucene-analyzers-common_2.10:5.1.0 (D:\ScalaAp
ps\MovieClassifier\build.sbt#L7-18)
[warn] +- naivebayes_document_classifier:naivebayes_document_classifi
er_2.10:1.0
[trace] Stack trace suppressed: run last *:update for the full output.
[error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.lucen
e#lucene-analyzers-common_2.10;5.1.0: not found
[error] Total time: 31 s, completed Dec 6, 2015 11:01:45 AM
>
我正在使用两个 Scala 文件(Stemmer.scala 和 MovieClassifier.scala)。下面给出了这两个程序以及 Build.sbt 文件。感谢任何帮助。
电影分类器
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.mllib.feature.{IDF, HashingTF}
object MovieRatingClassifier {
def main(args:Array[String])
{
val sparkConfig = new SparkConf().setAppName("Movie Rating Classifier")
val sc = new SparkContext(sparkConfig)
/*
This loads the data from HDFS.
HDFS is a distributed file storage system so this technically
could be a very large multi terabyte file
*/
val dataFile = sc.textFile("D:/spark4/mydata/naive_bayes_movie_classification.txt")
/*
HashingTF and IDF are helpers in MLlib that helps us vectorize our
synopsis instead of doing it manually
*/
val hashingTF = new HashingTF()
/*
Our ultimate goal is to get our data into a collection of type LabeledPoint.
The MLlib implementation uses LabeledPoints to train the Naive Bayes model.
First we parse the file for ratings and vectorize the synopses
*/
val ratings=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
rating.toDouble
}
}
val synopsis_frequency_vector=dataFile.map{x=>
x.split(";")
match {
case Array(rating,synopsis) =>
val stemmed=Stemmer.tokenize(synopsis)
hashingTF.transform(stemmed)
}
}
synopsis_frequency_vector.cache()
/*
http://en.wikipedia.org/wiki/Tf%E2%80%93idf
https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
*/
val idf = new IDF().fit(synopsis_frequency_vector)
val tfidf=idf.transform(synopsis_frequency_vector)
/*produces (rating,vector) tuples*/
val zipped=ratings.zip(tfidf)
/*Now we transform them into LabeledPoints*/
val labeledPoints = zipped.map{case (label,vector)=> LabeledPoint(label,vector)}
val model = NaiveBayes.train(labeledPoints)
/*--- Model is trained now we get it to classify our test file with only synopsis ---*/
val testDataFile = sc.textFile("D:/spark4/naive_bayes_movie_classification-test.txt")
/*We only have synopsis now. The rating is what we want to achieve.*/
val testVectors=testDataFile.map{x=>
val stemmed=Stemmer.tokenize(x)
hashingTF.transform(stemmed)
}
testVectors.cache()
val tfidf_test = idf.transform(testVectors)
val result = model.predict(tfidf_test)
result.collect.foreach(x=>println("Predicted rating for the movie is: "+x))
}
}
词干分析器
import org.apache.lucene.analysis.en.EnglishAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import scala.collection.mutable.ArrayBuffer
object Stemmer {
// Adopted from
// https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
def tokenize(content:String):Seq[String]={
val analyzer=new EnglishAnalyzer()
val tokenStream=analyzer.tokenStream("contents", content)
//CharTermAttribute is what we're extracting
val term=tokenStream.addAttribute(classOf[CharTermAttribute])
tokenStream.reset() // must be called by the consumer before consumption to clean the stream
var result = ArrayBuffer.empty[String]
while(tokenStream.incrementToken()) {
val termValue = term.toString
if (!(termValue matches ".*[\d\.].*")) {
result += term.toString
}
}
tokenStream.end()
tokenStream.close()
result
}
}
Build.sbt 文件
name := "NaiveBayes_Document_Classifier"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.0" % "provided"
libraryDependencies += "org.apache.spark" % "spark-mllib" % "1.4.0" % "provided"
libraryDependencies += "org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
你确定你没有输入
libraryDependencies += "org.apache.lucene" %% "lucene-analyzers-common" % "5.1.0"
(double %%
) 而不是你在这里写的?因为当它实际上是一个 Java 库时,看起来你确实在请求一个 scala 版本的 lucene。它应该像你在这里写的那样是一个 %
而 mllib
应该是一个双 %%
。即尝试
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.4.0" % "provided",
"org.apache.spark" %% "spark-mllib" % "1.4.0" % "provided",
"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"
)
请注意,您似乎从此处收到的答案中引入了回归
这个问题通过使用下面的依赖关系得到解决
"org.apache.lucene" % "lucene-analyzers-common" % "5.1.0"