Spark MLin Word2vec
Spark MLin Word2vec
我正在尝试 运行 Spark MLlibs word2vec implementation.I 我正在使用 scala this.My 模型的输入是 strings.It 的序列数组,如下所示
scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...
val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....
每个句子都在一个单独的列表中,如图所示 above.I 运行 将 v 作为输入的模型
scala> val model = word2vec.fit(v)
但是这个模型的输出看起来不太合适。当我保存模型并尝试读取其镶木地板文件 (a) 时,我得到以下结果。
model.save(sc, "myModelPath")
val a=sqlContext.read.parquet("myModelPath")
a.show(20,false)
+--------------------------------------------------------------------+
|word |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)] |
|[WrappedArray(good, experience)] |
|[WrappedArray(love, room, !)] |
|[WrappedArray(parking, .)] |
|[WrappedArray(breakfast, great, !)] |
|[WrappedArray(bed, comfortable, room, spacious, .)] |
这个 word2vec 模型不是为每个单词创建向量,而是为单词数组创建向量。
我不确定向该模型提供输入的正确方法是什么,以及它如何打断句子或单词。
我敢打赌,如果您查看 v.first
,您会看到 List([WrappedArray(0_42)])
,如果您查看 v.first.head
,您会看到 [WrappedArray(0_42)]
。但是 v.first.head
是一个字符串,您实际看到的是 "[WrappedArray(0_42)]"
。没有 WrappedArray,只有一个字符串。也许您不小心在 WrappedArray
上调用了 toString
(或者成为隐式转换为 String 的牺牲品)。 Word2Vec 实际上在其输入中看到像 "[WrappedArray(coffee, machine)]"
这样的字符串,并根据这些字符串生成一个模型。
更新
如果我的类型正确,f 是一个 DataFrame
,其中每个 Row
包含一个包含 Seq[String]
(实际上是 WrappedArray
)的字段。
所以,而不是
val v=f.map(l=>Seq(l.toString))
提取该字段应该做的是
val v = f.map(r => r.getSeq[String](0))
这会产生一个 Dataset[Seq[String]]
,应该适合输入到 Word2Vec
。
我正在尝试 运行 Spark MLlibs word2vec implementation.I 我正在使用 scala this.My 模型的输入是 strings.It 的序列数组,如下所示
scala> f.take(5)
res11: Array[org.apache.spark.sql.Row] = Array([WrappedArray(0_42)], [WrappedArray(big, baller, shoe, ?)], [WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, tribe, become, future, kal...
val v=f.map(l=>Seq(l.toString))
scala> v.take(5)
res31: Array[Seq[String]] = Array(List([WrappedArray(0_42)]), List ([WrappedArray(big, baller, shoe, ?)]), List([WrappedArray(since, eliud, win, ,, quick, fact, from, runner, from, country, kalenjins, !, write, ., happy, quick, fact, kalenjins, location, :, kenya, (, kenya's, western, highland, rift, valley, ), population, :, 4, ., 9, million, ;, compose, 11, subtribes, language, :, kalenjin, ;, swahili, ;, english, church, :, christianity, ~, africa, inland, church, [, aic, ],, church, province, kenya, [, cpk, ],, roman, catholic, church, ;, islam, translation, :, kalenjin, translate, ", tell, ", formation, :, wwii, ,, gikuyu, tribal, member, wish, separate, create, identity, ., later, ,, student, attend, alliance, high, school, (, first, british, public, school, kenya, ), form, ....
每个句子都在一个单独的列表中,如图所示 above.I 运行 将 v 作为输入的模型
scala> val model = word2vec.fit(v)
但是这个模型的输出看起来不太合适。当我保存模型并尝试读取其镶木地板文件 (a) 时,我得到以下结果。
model.save(sc, "myModelPath")
val a=sqlContext.read.parquet("myModelPath")
a.show(20,false)
+--------------------------------------------------------------------+
|word |
+--------------------------------------------------------------------+
|[WrappedArray(coffee, machine)] |
|[WrappedArray(good, experience)] |
|[WrappedArray(love, room, !)] |
|[WrappedArray(parking, .)] |
|[WrappedArray(breakfast, great, !)] |
|[WrappedArray(bed, comfortable, room, spacious, .)] |
这个 word2vec 模型不是为每个单词创建向量,而是为单词数组创建向量。 我不确定向该模型提供输入的正确方法是什么,以及它如何打断句子或单词。
我敢打赌,如果您查看 v.first
,您会看到 List([WrappedArray(0_42)])
,如果您查看 v.first.head
,您会看到 [WrappedArray(0_42)]
。但是 v.first.head
是一个字符串,您实际看到的是 "[WrappedArray(0_42)]"
。没有 WrappedArray,只有一个字符串。也许您不小心在 WrappedArray
上调用了 toString
(或者成为隐式转换为 String 的牺牲品)。 Word2Vec 实际上在其输入中看到像 "[WrappedArray(coffee, machine)]"
这样的字符串,并根据这些字符串生成一个模型。
更新
如果我的类型正确,f 是一个 DataFrame
,其中每个 Row
包含一个包含 Seq[String]
(实际上是 WrappedArray
)的字段。
所以,而不是
val v=f.map(l=>Seq(l.toString))
提取该字段应该做的是
val v = f.map(r => r.getSeq[String](0))
这会产生一个 Dataset[Seq[String]]
,应该适合输入到 Word2Vec
。