Elasticsearch跨域、边缘ngram分析器
Elastic search cross fields, edge ngram analyzer
我有 999 个文档用于试验弹性搜索。
我的类型映射中有一个字段 f4 已被分析,分析器具有以下设置:
"myNGramAnalyzer" => [
"type" => "custom",
"char_filter" => ["html_strip"],
"tokenizer" => "standard",
"filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
]
我的过滤器如下:
"filter" => [
"ngram_filter" => [
"type" => "edgeNGram",
"min_gram" => "2",
"max_gram" => "20"
]
]
我将字段 f4 的值设置为 "Proj1"、"Proj2"、"Proj3"......等等。
现在,当我尝试使用 "proj1" 字符串的交叉字段进行搜索时,我希望带有 "Proj1" 的文档以最高分 return 出现在响应的顶部.但事实并非如此。其余所有数据内容几乎相同。
我也不明白为什么它匹配所有999文件?
以下是我的搜索:
{
"index": "myindex",
"type": "mytype",
"body": {
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
}
我的搜索结果是:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 999,
"max_score": 1,
"hits": [{
"_index": "myindex",
"_type": "mytype",
"_id": "42",
"_score": 1,
"_source": {
"f1": "396","f2": "125650","f3": "BH.1511AI.001",
"f4": "Proj42",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
}, {
"_index": "myindex",
"_type": "mytype",
"_id": "47",
"_score": 1,
"_source": {
"f1": "396","f2": "137946","f3": "BH.152096.001",
"f4": "Proj47",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
},
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
{
"_index": myindex,
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"f1": "396","f2": "142095","f3": "BH.705215.001",
"f4": "Proj1",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
}]
}
}
我做错了什么或遗漏了什么? (为冗长的问题道歉,但我想提供所有可能的信息,丢弃不必要的其他代码)。
已编辑:
词向量响应
{
"_index": "myindex",
"_type": "mytype",
"_id": "10",
"_version": 1,
"found": true,
"took": 9,
"term_vectors": {
"f4": {
"field_statistics": {
"sum_doc_freq": 5886,
"doc_count": 999,
"sum_ttf": 5886
},
"terms": {
"pr": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"pro": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj1": {
"doc_freq": 111,
"ttf": 111,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj10": {
"doc_freq": 11,
"ttf": 11,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
}
}
}
}
}
已编辑 2
字段 f4 的映射
"f4" : {
"type" : "string",
"index_analyzer" : "myNGramAnalyzer",
"search_analyzer" : "standard"
}
我已经更新为在查询时间使用标准分析器,这改进了结果,但仍然没有达到我的预期。
现在是 return 111 个文档,例如 "Proj1"、"Proj11"、"Proj111"......"Proj1",而不是 999(所有文档) , "Proj181".........等等
仍然 "Proj1" 介于结果之间而不是顶部。
没有index_analyzer
(至少Elasticsearch
1.7 版没有)。对于 mapping parameters,您可以使用 analyzer
和 search_analyzer
。
请尝试以下步骤以使其正常工作。
使用分析器设置创建 myindex:
PUT /myindex
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": [
"lowercase",
"standard",
"asciifolding",
"stop",
"snowball",
"ngram_filter"
]
}
}
}
}
}
将映射添加到 mytype(为简短起见,我只是映射了相关字段):
PUT /myindex/_mapping/mytype
{
"properties": {
"f1": {
"type": "string"
},
"f4": {
"type": "string",
"analyzer": "myNGramAnalyzer",
"search_analyzer": "standard"
},
"deleted": {
"type": "string"
}
}
}
索引一些数据:
PUT myindex/mytype/1
{
"f1":"396",
"f4":"Proj12" ,
"deleted": "0"
}
PUT myindex/mytype/2
{
"f1":"42",
"f4":"Proj22" ,
"deleted": "1"
}
现在尝试查询:
GET myindex/mytype/_search
{
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
它应该 return 文档 #1
。它对我有用 Sense
。我正在使用 Elasticsearch 2.X
个版本。
希望我能帮上忙 :)
经过数小时的时间寻找解决方案,我终于成功了。
所以我保留了问题中提到的所有内容,在索引数据时使用 n gram 分析器。我唯一需要更改的是,将搜索查询中的 all
字段用作现有 multi-match
查询的 bool 查询。
现在我的搜索文本 Proj1
的结果会 return 我的结果是 Proj1
、Proj121
、Proj11
等顺序
虽然这与 return 的顺序不完全相同,例如 Proj1
、Proj11
、Proj121
等,但它仍然与我想要的结果非常相似。
我有 999 个文档用于试验弹性搜索。
我的类型映射中有一个字段 f4 已被分析,分析器具有以下设置:
"myNGramAnalyzer" => [
"type" => "custom",
"char_filter" => ["html_strip"],
"tokenizer" => "standard",
"filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
]
我的过滤器如下:
"filter" => [
"ngram_filter" => [
"type" => "edgeNGram",
"min_gram" => "2",
"max_gram" => "20"
]
]
我将字段 f4 的值设置为 "Proj1"、"Proj2"、"Proj3"......等等。
现在,当我尝试使用 "proj1" 字符串的交叉字段进行搜索时,我希望带有 "Proj1" 的文档以最高分 return 出现在响应的顶部.但事实并非如此。其余所有数据内容几乎相同。
我也不明白为什么它匹配所有999文件?
以下是我的搜索:
{
"index": "myindex",
"type": "mytype",
"body": {
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
}
我的搜索结果是:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 999,
"max_score": 1,
"hits": [{
"_index": "myindex",
"_type": "mytype",
"_id": "42",
"_score": 1,
"_source": {
"f1": "396","f2": "125650","f3": "BH.1511AI.001",
"f4": "Proj42",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
}, {
"_index": "myindex",
"_type": "mytype",
"_id": "47",
"_score": 1,
"_source": {
"f1": "396","f2": "137946","f3": "BH.152096.001",
"f4": "Proj47",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
},
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
{
"_index": myindex,
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"f1": "396","f2": "142095","f3": "BH.705215.001",
"f4": "Proj1",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
}]
}
}
我做错了什么或遗漏了什么? (为冗长的问题道歉,但我想提供所有可能的信息,丢弃不必要的其他代码)。
已编辑:
词向量响应
{
"_index": "myindex",
"_type": "mytype",
"_id": "10",
"_version": 1,
"found": true,
"took": 9,
"term_vectors": {
"f4": {
"field_statistics": {
"sum_doc_freq": 5886,
"doc_count": 999,
"sum_ttf": 5886
},
"terms": {
"pr": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"pro": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj1": {
"doc_freq": 111,
"ttf": 111,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj10": {
"doc_freq": 11,
"ttf": 11,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
}
}
}
}
}
已编辑 2
字段 f4 的映射
"f4" : {
"type" : "string",
"index_analyzer" : "myNGramAnalyzer",
"search_analyzer" : "standard"
}
我已经更新为在查询时间使用标准分析器,这改进了结果,但仍然没有达到我的预期。
现在是 return 111 个文档,例如 "Proj1"、"Proj11"、"Proj111"......"Proj1",而不是 999(所有文档) , "Proj181".........等等
仍然 "Proj1" 介于结果之间而不是顶部。
没有index_analyzer
(至少Elasticsearch
1.7 版没有)。对于 mapping parameters,您可以使用 analyzer
和 search_analyzer
。
请尝试以下步骤以使其正常工作。
使用分析器设置创建 myindex:
PUT /myindex
{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"myNGramAnalyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": "html_strip",
"filter": [
"lowercase",
"standard",
"asciifolding",
"stop",
"snowball",
"ngram_filter"
]
}
}
}
}
}
将映射添加到 mytype(为简短起见,我只是映射了相关字段):
PUT /myindex/_mapping/mytype
{
"properties": {
"f1": {
"type": "string"
},
"f4": {
"type": "string",
"analyzer": "myNGramAnalyzer",
"search_analyzer": "standard"
},
"deleted": {
"type": "string"
}
}
}
索引一些数据:
PUT myindex/mytype/1
{
"f1":"396",
"f4":"Proj12" ,
"deleted": "0"
}
PUT myindex/mytype/2
{
"f1":"42",
"f4":"Proj22" ,
"deleted": "1"
}
现在尝试查询:
GET myindex/mytype/_search
{
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
它应该 return 文档 #1
。它对我有用 Sense
。我正在使用 Elasticsearch 2.X
个版本。
希望我能帮上忙 :)
经过数小时的时间寻找解决方案,我终于成功了。
所以我保留了问题中提到的所有内容,在索引数据时使用 n gram 分析器。我唯一需要更改的是,将搜索查询中的 all
字段用作现有 multi-match
查询的 bool 查询。
现在我的搜索文本 Proj1
的结果会 return 我的结果是 Proj1
、Proj121
、Proj11
等顺序
虽然这与 return 的顺序不完全相同,例如 Proj1
、Proj11
、Proj121
等,但它仍然与我想要的结果非常相似。