KeywordAnalyzer 处理带有变音符号的单词的不同拼写
KeywordAnalyzer to handle different spellings of words with umlauts
如何让 KeywordAnalyzer
识别像 Müller 这样的名字而不考虑拼写?
KeywordAnalyzer
期望完全匹配,我希望它匹配 Müller 但也匹配 Mueller (ue digram) 和 Muller.
以下自定义分析器可以解决问题:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
public final class KeywordAnalyzerDE extends Analyzer {
public KeywordAnalyzerDE() {
}
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer source = new KeywordTokenizer();
TokenStream result;
result = new GermanNormalizationFilter(source);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
关键是GermanNormalizationFilter
:
It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
- 'ß' is replaced by 'ss'
- 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
- 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
- 'ue' is replaced by 'u', when not following a vowel or q.
我添加了 ASCIIFoldingFilter
以防处理后的文本中有其他变音符号。
查看源代码真的很有帮助:
如何让 KeywordAnalyzer
识别像 Müller 这样的名字而不考虑拼写?
KeywordAnalyzer
期望完全匹配,我希望它匹配 Müller 但也匹配 Mueller (ue digram) 和 Muller.
以下自定义分析器可以解决问题:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.core.KeywordTokenizer;
import org.apache.lucene.analysis.de.GermanNormalizationFilter;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
public final class KeywordAnalyzerDE extends Analyzer {
public KeywordAnalyzerDE() {
}
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
final Tokenizer source = new KeywordTokenizer();
TokenStream result;
result = new GermanNormalizationFilter(source);
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
关键是GermanNormalizationFilter
:
It allows for the fact that ä, ö and ü are sometimes written as ae, oe and ue.
- 'ß' is replaced by 'ss'
- 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
- 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
- 'ue' is replaced by 'u', when not following a vowel or q.
我添加了 ASCIIFoldingFilter
以防处理后的文本中有其他变音符号。
查看源代码真的很有帮助: