Elasticsearch修改asciifolding

Question

ASCII 折叠标记过滤器将“Ə”/“ə”(U+018F / U+0259) 个字符折叠为 "A"/"a"。我需要修改或添加 fold 到 "E"/"e"。 char_filter 没有帮助，也没有保留原始内容

添加分析器：

curl -XPUT 'localshot:9200/myix/_settings?pretty' -H 'Content-Type: application/json' -d'
{
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["standard", "my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
}
'

测试结果：

http://localhost:9200/myix/_analyze?text=üöğıəçşi_ÜÖĞIƏÇŞİ&filter=my_ascii_folding

{
  "tokens": [
    {
      "token": "uogiacsi_UOGIACSI",
      "start_offset": 0,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "üöğıəçşi_ÜÖĞIƏÇŞİ",
      "start_offset": 0,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

Answer 1

当查看 Lucene 的 ASCIIFoldingFilter.java source file, it doesn indeed seem like Ə gets folded into an E and not a A. Even the ICU folding filter 时，它是 asciifolding 类固醇，做同样的折叠。

但是，关于这个主题有一个 interesting discussion，而且根据发音，它似乎应该折叠成 a 而不是 e:

A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).

还有人甚至认为a和e都没有意义：

That seems like a really bad decision. I don't think ə should fold to either of a or e.

反正我觉得没有办法了，只能用char_filter或者extending the ASCIIFoldingFilter，自己打包成ES分析插件。

Elasticsearch修改asciifolding

Elasticsearch modify asciifolding

unicode

ascii

full-text-search

elasticsearch