KeywordAnalyzer 处理带有变音符号的单词的不同拼写

KeywordAnalyzer to handle different spellings of words with umlauts

如何让 KeywordAnalyzer 识别像 Müller 这样的名字而不考虑拼写?

KeywordAnalyzer 期望完全匹配,我希望它匹配 Müller 但也匹配 Muellerue digram) 和 Muller.

以下自定义分析器可以解决问题:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;

public final class KeywordAnalyzerDE extends Analyzer {
    public KeywordAnalyzerDE() {
    }

    @Override
    protected TokenStreamComponents createComponents(final String fieldName) {
        final Tokenizer source = new KeywordTokenizer();

        TokenStream result;
        result = new GermanNormalizationFilter(source);
        result = new ASCIIFoldingFilter(result);

        return new TokenStreamComponents(source, result);
    }
}

关键是GermanNormalizationFilter:

It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.

  • 'ß' is replaced by 'ss'
  • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
  • 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
  • 'ue' is replaced by 'u', when not following a vowel or q.

我添加了 ASCIIFoldingFilter 以防处理后的文本中有其他变音符号。

查看源代码真的很有帮助: