如何使 Elasticsearch sort/prefer 命中与完全匹配的字符串优先
How to make Elasticsearch sort/prefer hits with exactly matching strings first
我正在使用默认分析器和索引。所以假设我有这个简单的映射:
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"properties": {
"text": {
"type": "string"
}
}
}
}
}
(这是一个例子。如果有错别字,请见谅)
现在,我执行以下搜索。
GET _search
{
"query": {
"query_string": {
"query": "yes correct",
"fields": ["answer.text"]
}
}
}
结果将得到一个 text
值,如 "yes correct."(doc id 值 1
)高于简单的 "yes correct"(没有句点,doc id 值 181
).两个命中具有相同的分值,但命中数组首先列出具有较小文档 ID 的那个。我了解默认索引选项包括按文档 ID 排序,那么如何排除该属性并仍然使用其余默认选项?
我没有设置任何自定义分析器,所以一切都使用 Elasticsearch 2.0 的默认值。
Elasticsearch 或更确切地说 Lucene 评分不考虑令牌的相对定位。它利用 3 个不同的标准来做同样的事情
- 词频 - 搜索词出现的频率
文档
- 反向文档频率 - 搜索词出现的次数
在整个数据库中。越多越常见
是搜索字词,它在搜索中的重要性较低
- 字段长度归一化 - 目标中存在的标记数
领域。
您可以了解更多 here。
这可能是 Dis Max Query
的用例
A query that generates the union of documents produced by its
subqueries, and that scores each document with the maximum score for
that document as produced by any subquery, plus a tie breaking
increment for any additional matching subqueries.
因此,接下来,您需要使您的答案分数完全匹配并给予最高的提升。您必须为此使用自定义分析器。那就是你的映射:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"type": "object",
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
}
}
您的测试数据:
PUT /test/question/1
{
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
PUT /test/question/2
{
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
现在,当您使用这样的查询查询 "yes correct."
时:
POST /test/_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"answer.text": {
"query": "yes correct.",
"type": "phrase"
}
}
},
{
"match": {
"answer.text.stemmed": {
"query": "yes correct.",
"operator": "and"
}
}
}
]
}
}
}
你得到这个输出:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.37919715,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.11261705,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
}
]
}
}
如果你 运行 没有尾随点的完全相同的查询,然后变成 "yes correct"
,你会得到这个结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.37919715,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.11261705,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
}
]
}
}
希望这就是您要找的。
顺便说一下,我建议在执行文本搜索时始终使用 Match 查询。摘自文档:
Comparison to query_string / field
The match family of queries
does not go through a "query parsing" process. It does not support
field name prefixes, wildcard characters, or other "advanced"
features. For this reason, chances of it failing are very small / non
existent, and it provides an excellent behavior when it comes to just
analyze and run that text as a query behavior (which is usually what a
text search box does). Also, the phrase_prefix type can provide a
great "as you type" behavior to automatically load search results.
我正在使用默认分析器和索引。所以假设我有这个简单的映射:
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"properties": {
"text": {
"type": "string"
}
}
}
}
}
(这是一个例子。如果有错别字,请见谅)
现在,我执行以下搜索。
GET _search
{
"query": {
"query_string": {
"query": "yes correct",
"fields": ["answer.text"]
}
}
}
结果将得到一个 text
值,如 "yes correct."(doc id 值 1
)高于简单的 "yes correct"(没有句点,doc id 值 181
).两个命中具有相同的分值,但命中数组首先列出具有较小文档 ID 的那个。我了解默认索引选项包括按文档 ID 排序,那么如何排除该属性并仍然使用其余默认选项?
我没有设置任何自定义分析器,所以一切都使用 Elasticsearch 2.0 的默认值。
Elasticsearch 或更确切地说 Lucene 评分不考虑令牌的相对定位。它利用 3 个不同的标准来做同样的事情
- 词频 - 搜索词出现的频率 文档
- 反向文档频率 - 搜索词出现的次数 在整个数据库中。越多越常见 是搜索字词,它在搜索中的重要性较低
- 字段长度归一化 - 目标中存在的标记数 领域。
您可以了解更多 here。
这可能是 Dis Max Query
的用例A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.
因此,接下来,您需要使您的答案分数完全匹配并给予最高的提升。您必须为此使用自定义分析器。那就是你的映射:
PUT /test
{
"settings": {
"analysis": {
"analyzer": {
"my_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
}
}
}
},
"mappings": {
"question": {
"properties": {
"title": {
"type": "string"
},
"answer": {
"type": "object",
"properties": {
"text": {
"type": "string",
"analyzer": "my_keyword",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "standard"
}
}
}
}
}
}
}
}
}
您的测试数据:
PUT /test/question/1
{
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
PUT /test/question/2
{
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
现在,当您使用这样的查询查询 "yes correct."
时:
POST /test/_search
{
"query": {
"dis_max": {
"tie_breaker": 0.7,
"boost": 1.2,
"queries": [
{
"match": {
"answer.text": {
"query": "yes correct.",
"type": "phrase"
}
}
},
{
"match": {
"answer.text.stemmed": {
"query": "yes correct.",
"operator": "and"
}
}
}
]
}
}
}
你得到这个输出:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.37919715,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.11261705,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
}
]
}
}
如果你 运行 没有尾随点的完全相同的查询,然后变成 "yes correct"
,你会得到这个结果:
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.37919715,
"hits": [
{
"_index": "test",
"_type": "question",
"_id": "2",
"_score": 0.37919715,
"_source": {
"title": "title nr2",
"answer": [
{
"text": "yes correct"
}
]
}
},
{
"_index": "test",
"_type": "question",
"_id": "1",
"_score": 0.11261705,
"_source": {
"title": "title nr1",
"answer": [
{
"text": "yes correct."
}
]
}
}
]
}
}
希望这就是您要找的。
顺便说一下,我建议在执行文本搜索时始终使用 Match 查询。摘自文档:
Comparison to query_string / field
The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advanced" features. For this reason, chances of it failing are very small / non existent, and it provides an excellent behavior when it comes to just analyze and run that text as a query behavior (which is usually what a text search box does). Also, the phrase_prefix type can provide a great "as you type" behavior to automatically load search results.