如何使 Elasticsearch sort/prefer 命中与完全匹配的字符串优先

How to make Elasticsearch sort/prefer hits with exactly matching strings first

我正在使用默认分析器和索引。所以假设我有这个简单的映射:

"question": {
    "properties": {
        "title": {
            "type": "string"
        },
        "answer": {
            "properties": {
                "text": {
                    "type": "string"
                }
            }
        }
    }
}

(这是一个例子。如果有错别字,请见谅)

现在,我执行以下搜索。

GET _search
{
    "query": {
        "query_string": {
            "query": "yes correct",
            "fields": ["answer.text"]
        }
    }
}

结果将得到一个 text 值,如 "yes correct."(doc id 值 1)高于简单的 "yes correct"(没有句点,doc id 值 181).两个命中具有相同的分值,但命中数组首先列出具有较小文档 ID 的那个。我了解默认索引选项包括按文档 ID 排序,那么如何排除该属性并仍然使用其余默认选项?

我没有设置任何自定义分析器,所以一切都使用 Elasticsearch 2.0 的默认值。

Elasticsearch 或更确切地说 Lucene 评分不考虑令牌的相对定位。它利用 3 个不同的标准来做同样的事情

  1. 词频 - 搜索词出现的频率 文档
  2. 反向文档频率 - 搜索词出现的次数 在整个数据库中。越多越常见 是搜索字词,它在搜索中的重要性较低
  3. 字段长度归一化 - 目标中存在的标记数 领域。

您可以了解更多 here

这可能是 Dis Max Query

的用例

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

因此,接下来,您需要使您的答案分数完全匹配并给予最高的提升。您必须为此使用自定义分析器。那就是你的映射:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

您的测试数据:

PUT /test/question/1
{
  "title": "title nr1",
  "answer": [
    {
      "text": "yes correct."
    }
  ]
}

PUT /test/question/2
{
  "title": "title nr2",
  "answer": [
    {
      "text": "yes correct"
    }
  ]
}

现在,当您使用这样的查询查询 "yes correct." 时:

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "yes correct.",
              "type": "phrase"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "yes correct.",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

你得到这个输出:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         }
      ]
   }
}

如果你 运行 没有尾随点的完全相同的查询,然后变成 "yes correct",你会得到这个结果:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         }
      ]
   }
}

希望这就是您要找的。

顺便说一下,我建议在执行文本搜索时始终使用 Match 查询。摘自文档:

Comparison to query_string / field


The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advanced" features. For this reason, chances of it failing are very small / non existent, and it provides an excellent behavior when it comes to just analyze and run that text as a query behavior (which is usually what a text search box does). Also, the phrase_prefix type can provide a great "as you type" behavior to automatically load search results.