Apache spark 文本相似度

Apache spark text similarity

我在 java

中尝试以下示例

Efficient string matching in Apache Spark

这是我的代码

public class App {
    public static void main(String[] args) {
        System.out.println("Hello World!");

        System.setProperty("hadoop.home.dir", "D:\del");

        List<MyRecord> firstRow = new ArrayList<MyRecord>();
        firstRow.add(new App().new MyRecord("1", "Love is blind"));

        List<MyRecord> secondRow = new ArrayList<MyRecord>();
        secondRow.add(new App().new MyRecord("1", "Luv is blind"));

        SparkSession spark = SparkSession.builder().appName("LSHExample").config("spark.master", "local")
                .getOrCreate();

        Dataset firstDataFrame = spark.createDataFrame(firstRow, MyRecord.class);
        Dataset secondDataFrame = spark.createDataFrame(secondRow, MyRecord.class);

        firstDataFrame.show(20, false);
        secondDataFrame.show(20, false);

        RegexTokenizer regexTokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")
                .setPattern("\W");
        NGram ngramTransformer = new NGram().setN(3).setInputCol("words").setOutputCol("ngrams");
        HashingTF hashingTF = new HashingTF().setInputCol("ngrams").setOutputCol("vectors");
        MinHashLSH minHashLSH = new MinHashLSH().setInputCol("vectors").setOutputCol("lsh");

        Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] { regexTokenizer, ngramTransformer, hashingTF, minHashLSH });

        PipelineModel model = pipeline.fit(firstDataFrame);

        Dataset dataset1 = model.transform(firstDataFrame);
        dataset1.show(20,false);

        Dataset dataset2 = model.transform(secondDataFrame);
        dataset2 .show(20,false);

        Transformer[] transformers = model.stages();
        MinHashLSHModel temp = (MinHashLSHModel) transformers[transformers.length - 1];
        temp.approxSimilarityJoin(dataset1, dataset2, 0.01).show(20,false);

    }

    protected class MyRecord {
        private String id;
        private String text;

        private MyRecord(String id, String text) {
            this.id = id;
            this.text = text;
        }

        public String getId() {
            return id;
        }

        public String getText() {
            return text;
        }

    }

}

在调用 approxSimilarityJoin 之前,两个数据集如下所示。

转换后的数据集 A

+---+-------------+-----------------+---------------+-----------------------+----------------+
|id |text         |words            |ngrams         |vectors                |lsh             |
+---+-------------+-----------------+---------------+-----------------------+----------------+
|1  |Love is blind|[love, is, blind]|[love is blind]|(262144,[243005],[1.0])|[[2.02034596E9]]|
+---+-------------+-----------------+---------------+-----------------------+----------------+

转换数据集 B

+---+------------+----------------+--------------+----------------------+----------------+
|id |text        |words           |ngrams        |vectors               |lsh             |
+---+------------+----------------+--------------+----------------------+----------------+
|2  |Luv is blind|[luv, is, blind]|[luv is blind]|(262144,[57733],[1.0])|[[7.79808048E8]]|
+---+------------+----------------+--------------+----------------------+----------------+

虽然 "Love is blind" 和 "Luv is blind" 这两个文本几乎相似,但我得到以下空白输出。

+--------+--------+-------+
|datasetA|datasetB|distCol|
+--------+--------+-------+
+--------+--------+-------+

以上代码如有错误欢迎回复。

我通过为两个数据集提供相同的输入进行测试,下面是输出。当两个数据集具有相同的文本时,distCol 为零。

+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|datasetA                                                                                                                        |datasetB                                                                                                                        |distCol|
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+
|[1,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|[2,Love is blind,WrappedArray(love, is, blind),WrappedArray(love is blind),(262144,[243005],[1.0]),WrappedArray([2.02034596E9])]|0.0    |
+--------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------+-------+

下面的例子也使用了相同的概念。

https://databricks.com/blog/2017/05/09/detecting-abuse-scale-locality-sensitive-hashing-uber-engineering.html

我想我在这个程序中遗漏了一些基本方面。请回复。


它根据用户 8371915 给出的建议工作。

我删除了 ngram 并增加了 numHashTables

MinHashLSH minHashLSH = new MinHashLSH().setInputCol("features").setOutputCol("hashValues").setNumHashTables(20);

现在我能够将这种匹配的工作原理联系起来

下面是我的两个数据集

数据集 A

+---+-------------+
|id |text         |
+---+-------------+
|1  |Love is blind|
+---+-------------+

数据集 B

+---+-------------------------+
|id |text                     |
+---+-------------------------+
|1  |Love is blind            |
|2  |Luv is blind             |
|3  |Lov is blind             |
|4  |This is totally different|
|5  |God is love              |
|6  |blind love is divine     |
+---+-------------------------+

最终输出如下


|datasetA                                                                                                                                                                                                                                                                                                                                                                                                                                                             |datasetB                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |distCol|

|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]                            |0.0    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[2,Luv is blind,WrappedArray(luv, is, blind),(262144,[15889,48831,84987],[1.0,1.0,1.0]),WrappedArray([-2.021501434E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-6.70773282E8], [-6.93210471E8], [-1.205754635E9], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [4.46435174E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]                              |0.5    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[5,God is love,WrappedArray(god, is, love),(262144,[15889,57304,186480],[1.0,1.0,1.0]),WrappedArray([-7.6253133E7], [-2.6669178E7], [-1.590526534E9], [-2.83593282E8], [-1.060055906E9], [-1.411500923E9], [-9.83191394E8], [-8.0411681E7], [-1.04032919E9], [-1.373403353E9], [-5.63413619E8], [-1.240833109E9], [-1.48476096E8], [-1.7390215E9], [-1.745820849E9], [8.1559665E7], [-1.997519365E9], [-1.635066748E9], [6.38995945E8], [-1.59718287E9])]                                        |0.5    |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[6,blind love is divine,WrappedArray(blind, love, is, divine),(262144,[15889,25596,48831,186480],[1.0,1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-1.627956291E9], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.93451596E9], [-1.882820721E9], [-7.50906814E8], [-1.152091375E9], [-1.997519365E9], [-1.380314819E9], [-8.50494401E8], [-1.869738298E9])]|0.25   |
|[1,Love is blind,WrappedArray(love, is, blind),(262144,[15889,48831,186480],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.815588486E9], [-1.411500923E9], [-6.93210471E8], [-8.0411681E7], [-1.713286948E9], [-1.698342316E9], [-9.33829921E8], [-1.240833109E9], [-1.48476096E8], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.997519365E9], [-1.380314819E9], [-5.92484283E8], [-1.869738298E9])]|[3,Lov is blind,WrappedArray(lov, is, blind),(262144,[15889,48831,81946],[1.0,1.0,1.0]),WrappedArray([-1.06555007E9], [-1.557513224E9], [-1.590526534E9], [-2.83593282E8], [-1.88316392E9], [-1.776275893E9], [-6.93210471E8], [-1.39927757E8], [-1.713286948E9], [-1.698342316E9], [-1.164990332E9], [-1.240833109E9], [-1.519529732E9], [-1.882820721E9], [-7.50906814E8], [1.99715481E8], [-1.036250081E9], [-1.380314819E9], [-1.808919173E9], [-1.869738298E9])]                            |0.5    |


我有几点建议:

  • 如果您使用 NGrams,请考虑使用更精细的分词器。这里的目标是纠正拼写错误:

    RegexTokenizer regexTokenizer = new RegexTokenizer()
       .setInputCol("text")
       .setOutputCol("words")
       .setPattern("");
    
    NGram ngramTransformer = new NGram()
      .setN(3)
      .setInputCol("words")
      .setOutputCol("ngrams");
    

    使用您当前的代码(NGram(3) 和句子中的三个单词被 \W 分割)三个,您将只得到一个标记并且没有相似性。

  • 增加 LSH 的哈希表 (setNumHashTables) 数量。默认值 (1) 对于简单的示例来说太小了。

  • 标准化 Unicode 字符串。

  • 里面有个Scala Transformer
  • 删除大写。您可以使用 SQLTransformer:

    import org.apache.spark.ml.feature.SQLTransformer
    
    val sqlTrans = new SQLTransformer().setStatement(
       "SELECT *, lower(normalized_text) FROM __THIS__")