如何使用 Java 在 Apache Spark 中正确制作句子的 TF-IDF 向量?
How correctly make TF-IDF vectors of sentences in Apache Spark with Java?
我有这个代码,
public class TfIdfExample {
public static void main(String[] args){
JavaSparkContext sc = SparkSingleton.getContext();
SparkSession spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", "spark-warehouse")
.getOrCreate();
JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
Arrays.asList("this is a sentence".split(" ")),
Arrays.asList("this is another sentence".split(" ")),
Arrays.asList("this is still a sentence".split(" "))), 2);
HashingTF hashingTF = new HashingTF();
documents.cache();
JavaRDD<Vector> featurizedData = hashingTF.transform(documents);
// alternatively, CountVectorizer can also be used to get term frequency vectors
IDF idf = new IDF();
IDFModel idfModel = idf.fit(featurizedData);
featurizedData.cache();
JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData);
System.out.println(tfidfs.collect());
KMeansProcessor kMeansProcessor = new KMeansProcessor();
JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs);
result.collect().forEach(System.out::println);
}
}
我需要获取 k-means 的向量,但我得到的是奇数向量
[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]
在 k-means 工作后我明白了
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
但我认为它不能正常工作,因为tf-idf 必须有另一个视图。
我认为 mllib
有现成的方法,但我测试了文档示例,但没有收到我需要的东西。我还没有找到 Spark 的自定义解决方案。可能有人使用它并给我回答我做错了什么?可能是我没有正确使用 mllib 功能?
您在 TF-IDF 之后得到的是 SparseVector。
为了更好地理解这些值,让我从 TF 向量开始:
(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0])
比如第一句对应的TF向量是一个1048576
(= 2^20
)分量向量,有4个non-zero值对应索引489554,540177,736740
和894973
,所有其他值都是零,因此不存储在稀疏向量表示中。
特征向量的维数等于您散列到的桶数:在您的案例中是 1048576 = 2^20
个桶。
对于这种规模的语料库,您应该考虑减少桶的数量:
HashingTF hashingTF = new HashingTF(32);
建议使用 2 的幂,以尽量减少散列冲突的次数。
接下来,您应用 IDF 权重:
(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0])
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])
如果我们再次查看第一句话,我们会得到 3 个零 - 这是意料之中的,因为术语 "this"、"is" 和 "sentence" 出现在语料库,因此 by definition of IDF 将等于零。
为什么零值仍然在 (sparse) 向量中?因为在当前的实现中,the size of the vector is kept the same 并且只有值乘以 IDF。
我有这个代码,
public class TfIdfExample {
public static void main(String[] args){
JavaSparkContext sc = SparkSingleton.getContext();
SparkSession spark = SparkSession.builder()
.config("spark.sql.warehouse.dir", "spark-warehouse")
.getOrCreate();
JavaRDD<List<String>> documents = sc.parallelize(Arrays.asList(
Arrays.asList("this is a sentence".split(" ")),
Arrays.asList("this is another sentence".split(" ")),
Arrays.asList("this is still a sentence".split(" "))), 2);
HashingTF hashingTF = new HashingTF();
documents.cache();
JavaRDD<Vector> featurizedData = hashingTF.transform(documents);
// alternatively, CountVectorizer can also be used to get term frequency vectors
IDF idf = new IDF();
IDFModel idfModel = idf.fit(featurizedData);
featurizedData.cache();
JavaRDD<Vector> tfidfs = idfModel.transform(featurizedData);
System.out.println(tfidfs.collect());
KMeansProcessor kMeansProcessor = new KMeansProcessor();
JavaPairRDD<Vector,Integer> result = kMeansProcessor.Process(tfidfs);
result.collect().forEach(System.out::println);
}
}
我需要获取 k-means 的向量,但我得到的是奇数向量
[(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])]
在 k-means 工作后我明白了
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),0)
((1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),0)
((1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0]),1)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),0)
((1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0]),1)
但我认为它不能正常工作,因为tf-idf 必须有另一个视图。
我认为 mllib
有现成的方法,但我测试了文档示例,但没有收到我需要的东西。我还没有找到 Spark 的自定义解决方案。可能有人使用它并给我回答我做错了什么?可能是我没有正确使用 mllib 功能?
您在 TF-IDF 之后得到的是 SparseVector。
为了更好地理解这些值,让我从 TF 向量开始:
(1048576,[489554,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[455491,540177,736740,894973],[1.0,1.0,1.0,1.0])
(1048576,[489554,540177,560488,736740,894973],[1.0,1.0,1.0,1.0,1.0])
比如第一句对应的TF向量是一个1048576
(= 2^20
)分量向量,有4个non-zero值对应索引489554,540177,736740
和894973
,所有其他值都是零,因此不存储在稀疏向量表示中。
特征向量的维数等于您散列到的桶数:在您的案例中是 1048576 = 2^20
个桶。
对于这种规模的语料库,您应该考虑减少桶的数量:
HashingTF hashingTF = new HashingTF(32);
建议使用 2 的幂,以尽量减少散列冲突的次数。
接下来,您应用 IDF 权重:
(1048576,[489554,540177,736740,894973],[0.28768207245178085,0.0,0.0,0.0])
(1048576,[455491,540177,736740,894973],[0.6931471805599453,0.0,0.0,0.0])
(1048576,[489554,540177,560488,736740,894973],[0.28768207245178085,0.0,0.6931471805599453,0.0,0.0])
如果我们再次查看第一句话,我们会得到 3 个零 - 这是意料之中的,因为术语 "this"、"is" 和 "sentence" 出现在语料库,因此 by definition of IDF 将等于零。
为什么零值仍然在 (sparse) 向量中?因为在当前的实现中,the size of the vector is kept the same 并且只有值乘以 IDF。