Elasticsearch 自定义分析器在数值上随机崩溃
Elasticsearch custom analyzer randomly crashing on numeric value
我有一个由三个节点组成的集群,配置为具有三个分片和 2 个副本。
我创建了一个名为 "name" 的自定义分析器,其定义如下:
"analyzer": {
"name": {
"tokenizer": "lowercase",
"filter": [
"base_elision",
"name_synonym",
"unique_tokens"
],
"char_filter": [
"acronym_filter"
]
}
}
它是适用于 myIndex
的映射的一部分。
当我打电话给
POST myIndex/_analyze
{
"analyzer": "name",
"text": "90000175",
"explain": true
}
它随机给我一个 500 错误(我会说大约 3 次中有 2 次,但我没有做任何统计)。
{
"error": {
"root_cause": [
{
"type": "null_pointer_exception",
"reason": null
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}
当它确实应答时,响应为空:
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [
{
"name" : "acronym_filter",
"filtered_text" : [
"90000175"
]
}
],
"tokenizer" : {
"name" : "lowercase",
"tokens" : [ ]
},
"tokenfilters" : [
{
"name" : "base_elision",
"tokens" : [ ]
},
{
"name" : "name_synonym",
"tokens" : [ ]
},
{
"name" : "unique_tokens",
"tokens" : [ ]
}
]
}
}
此外,如果 text
值不是数字(例如:“90000175a”),它也能正常工作。使用该示例,我将获得所有包含 a
的文档,数字部分将被忽略。
编辑:
"char_filter": {
"acronym_filter": {
"type": "pattern_replace",
"pattern": "(?<=([\. ][a-z])|(^[a-z]))[\. ]+(?=([a-z][\. ])|([a-z]$))",
"replacement": ""
}
}
"base_elision": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c"
]
}
"name_synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"mairie, commune, ville",
"etab => etablissement",
"ets => etablissement, entreprise",
"ste, soc, societe",
"exploi, exploitation",
"electricite de france => electricite de france, edf",
"pdts, produits"
]
}
"unique_tokens": {
"type": "unique"
}
确实是 6.8.2
中修复的错误。
我有一个由三个节点组成的集群,配置为具有三个分片和 2 个副本。 我创建了一个名为 "name" 的自定义分析器,其定义如下:
"analyzer": {
"name": {
"tokenizer": "lowercase",
"filter": [
"base_elision",
"name_synonym",
"unique_tokens"
],
"char_filter": [
"acronym_filter"
]
}
}
它是适用于 myIndex
的映射的一部分。
当我打电话给
POST myIndex/_analyze
{
"analyzer": "name",
"text": "90000175",
"explain": true
}
它随机给我一个 500 错误(我会说大约 3 次中有 2 次,但我没有做任何统计)。
{
"error": {
"root_cause": [
{
"type": "null_pointer_exception",
"reason": null
}
],
"type": "null_pointer_exception",
"reason": null
},
"status": 500
}
当它确实应答时,响应为空:
{
"detail" : {
"custom_analyzer" : true,
"charfilters" : [
{
"name" : "acronym_filter",
"filtered_text" : [
"90000175"
]
}
],
"tokenizer" : {
"name" : "lowercase",
"tokens" : [ ]
},
"tokenfilters" : [
{
"name" : "base_elision",
"tokens" : [ ]
},
{
"name" : "name_synonym",
"tokens" : [ ]
},
{
"name" : "unique_tokens",
"tokens" : [ ]
}
]
}
}
此外,如果 text
值不是数字(例如:“90000175a”),它也能正常工作。使用该示例,我将获得所有包含 a
的文档,数字部分将被忽略。
编辑:
"char_filter": {
"acronym_filter": {
"type": "pattern_replace",
"pattern": "(?<=([\. ][a-z])|(^[a-z]))[\. ]+(?=([a-z][\. ])|([a-z]$))",
"replacement": ""
}
}
"base_elision": {
"type": "elision",
"articles": [
"l",
"m",
"t",
"qu",
"n",
"s",
"j",
"d",
"c"
]
}
"name_synonym": {
"type": "synonym",
"lenient": true,
"synonyms": [
"mairie, commune, ville",
"etab => etablissement",
"ets => etablissement, entreprise",
"ste, soc, societe",
"exploi, exploitation",
"electricite de france => electricite de france, edf",
"pdts, produits"
]
}
"unique_tokens": {
"type": "unique"
}
确实是 6.8.2
中修复的错误。