Elasticsearch修改asciifolding
Elasticsearch modify asciifolding
ASCII 折叠标记过滤器将“Ə”/“ə”(U+018F / U+0259) 个字符折叠为 "A"/"a"。我需要修改或添加 fold 到 "E"/"e"。 char_filter
没有帮助,也没有保留原始内容
添加分析器:
curl -XPUT 'localshot:9200/myix/_settings?pretty' -H 'Content-Type: application/json' -d'
{
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
'
测试结果:
http://localhost:9200/myix/_analyze?text=üöğıəçşi_ÜÖĞIƏÇŞİ&filter=my_ascii_folding
{
"tokens": [
{
"token": "uogiacsi_UOGIACSI",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "üöğıəçşi_ÜÖĞIƏÇŞİ",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
}
]
}
当查看 Lucene 的 ASCIIFoldingFilter.java
source file, it doesn indeed seem like Ə
gets folded into an E
and not a A
. Even the ICU folding filter 时,它是 asciifolding
类固醇,做同样的折叠。
但是,关于这个主题有一个 interesting discussion,而且根据发音,它似乎应该折叠成 a
而不是 e
:
A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).
还有人甚至认为a
和e
都没有意义:
That seems like a really bad decision. I don't think ə should fold to either of a or e.
反正我觉得没有办法了,只能用char_filter或者extending the ASCIIFoldingFilter
,自己打包成ES分析插件。
ASCII 折叠标记过滤器将“Ə”/“ə”(U+018F / U+0259) 个字符折叠为 "A"/"a"。我需要修改或添加 fold 到 "E"/"e"。 char_filter
没有帮助,也没有保留原始内容
添加分析器:
curl -XPUT 'localshot:9200/myix/_settings?pretty' -H 'Content-Type: application/json' -d'
{
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "my_ascii_folding"]
}
},
"filter" : {
"my_ascii_folding" : {
"type" : "asciifolding",
"preserve_original" : true
}
}
}
}
'
测试结果:
http://localhost:9200/myix/_analyze?text=üöğıəçşi_ÜÖĞIƏÇŞİ&filter=my_ascii_folding
{
"tokens": [
{
"token": "uogiacsi_UOGIACSI",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "üöğıəçşi_ÜÖĞIƏÇŞİ",
"start_offset": 0,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 0
}
]
}
当查看 Lucene 的 ASCIIFoldingFilter.java
source file, it doesn indeed seem like Ə
gets folded into an E
and not a A
. Even the ICU folding filter 时,它是 asciifolding
类固醇,做同样的折叠。
但是,关于这个主题有一个 interesting discussion,而且根据发音,它似乎应该折叠成 a
而不是 e
:
A quick search on English or French Wikipedia, where it currently gets folded, shows that it gets folded to an a! I would have expected an e based on orthography, but a makes sense in terms of pronunciation (in English, at least).
还有人甚至认为a
和e
都没有意义:
That seems like a really bad decision. I don't think ə should fold to either of a or e.
反正我觉得没有办法了,只能用char_filter或者extending the ASCIIFoldingFilter
,自己打包成ES分析插件。