如何通过在 Elasticsearch 1.7 中使用 "mclaren" 搜索来获取具有 "name" - "McLaren" 的索引项?
How to get index item that has : "name" - "McLaren" by searching with "mclaren" in Elasticsearch 1.7?
这是分词器 -
"tokenizer": {
"filename" : {
"pattern" : "[^\p{L}\d]+",
"type" : "pattern"
}
},
映射-
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
分析仪-
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
我想通过搜索-mclaren 获取索引项,但索引的名称是McLaren。
我想坚持 query_string 因为很多其他功能都是基于它的。这是我无法获得预期结果的查询 -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
我怎样才能做到这一点?谢谢!
我明白了!问题肯定出在 word_delimiter
标记过滤器周围。
默认情况下:
Split tokens at letter case transitions. For example: PowerShot →
Power, Shot
所以 macLaren 生成两个令牌 -> [mac, Laren] 当 maclaren 只生成一个令牌 ['maclaren'].
分析例子:
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
回复:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
所以我认为一种选择是将 word_delimiter 选项 split_on_case_change
配置为 false(参见 parameters 文档)
Ps:记得删除您之前添加的设置(参见评论),因为使用此设置,您的查询字符串查询将仅针对 name
字段不存在。
这是分词器 -
"tokenizer": {
"filename" : {
"pattern" : "[^\p{L}\d]+",
"type" : "pattern"
}
},
映射-
"name": {
"type": "string",
"analyzer": "filename_index",
"include_in_all": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
},
"lower_case_sort": {
"type": "string",
"analyzer": "naturalsort"
}
}
},
分析仪-
"filename_index" : {
"tokenizer" : "filename",
"filter" : [
"word_delimiter",
"lowercase",
"russian_stop",
"russian_keywords",
"russian_stemmer",
"czech_stop",
"czech_keywords",
"czech_stemmer"
]
},
我想通过搜索-mclaren 获取索引项,但索引的名称是McLaren。 我想坚持 query_string 因为很多其他功能都是基于它的。这是我无法获得预期结果的查询 -
{
"query": {
"filtered": {
"query": {
"query_string" : {
"query" : "mclaren",
"default_operator" : "AND",
"analyze_wildcard" : true,
}
}
}
},
"size" :50,
"from" : 0,
"sort": {}
}
我怎样才能做到这一点?谢谢!
我明白了!问题肯定出在 word_delimiter
标记过滤器周围。
默认情况下:
Split tokens at letter case transitions. For example: PowerShot → Power, Shot
所以 macLaren 生成两个令牌 -> [mac, Laren] 当 maclaren 只生成一个令牌 ['maclaren'].
分析例子:
POST _analyze
{
"tokenizer": {
"pattern": """[^\p{L}\d]+""",
"type": "pattern"
},
"filter": [
"word_delimiter"
],
"text": ["macLaren", "maclaren"]
}
回复:
{
"tokens" : [
{
"token" : "mac",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "Laren",
"start_offset" : 3,
"end_offset" : 8,
"type" : "word",
"position" : 1
},
{
"token" : "maclaren",
"start_offset" : 9,
"end_offset" : 17,
"type" : "word",
"position" : 102
}
]
}
所以我认为一种选择是将 word_delimiter 选项 split_on_case_change
配置为 false(参见 parameters 文档)
Ps:记得删除您之前添加的设置(参见评论),因为使用此设置,您的查询字符串查询将仅针对 name
字段不存在。