如何使 Elasticsearch sort/prefer 命中与完全匹配的字符串优先

Question

我正在使用默认分析器和索引。所以假设我有这个简单的映射：

"question": {
    "properties": {
        "title": {
            "type": "string"
        },
        "answer": {
            "properties": {
                "text": {
                    "type": "string"
                }
            }
        }
    }
}

（这是一个例子。如果有错别字，请见谅）

现在，我执行以下搜索。

GET _search
{
    "query": {
        "query_string": {
            "query": "yes correct",
            "fields": ["answer.text"]
        }
    }
}

结果将得到一个 text 值，如 "yes correct."（doc id 值 1）高于简单的 "yes correct"（没有句点，doc id 值 181).两个命中具有相同的分值，但命中数组首先列出具有较小文档 ID 的那个。我了解默认索引选项包括按文档 ID 排序，那么如何排除该属性并仍然使用其余默认选项？

我没有设置任何自定义分析器，所以一切都使用 Elasticsearch 2.0 的默认值。

Answer 1

Elasticsearch 或更确切地说 Lucene 评分不考虑令牌的相对定位。它利用 3 个不同的标准来做同样的事情

词频 - 搜索词出现的频率文档
反向文档频率 - 搜索词出现的次数在整个数据库中。越多越常见是搜索字词，它在搜索中的重要性较低
字段长度归一化 - 目标中存在的标记数领域。

您可以了解更多 here。

Answer 2

这可能是 Dis Max Query

的用例

A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries.

因此，接下来，您需要使您的答案分数完全匹配并给予最高的提升。您必须为此使用自定义分析器。那就是你的映射：

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_keyword": {
          "type": "custom",
          "tokenizer": "keyword",
          "filter": [
            "asciifolding",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "question": {
      "properties": {
        "title": {
          "type": "string"
        },
        "answer": {
          "type": "object",
          "properties": {
            "text": {
              "type": "string",
              "analyzer": "my_keyword",
              "fields": {
                "stemmed": {
                  "type": "string",
                  "analyzer": "standard"
                }
              }
            }
          }
        }
      }
    }
  }
}

您的测试数据：

PUT /test/question/1
{
  "title": "title nr1",
  "answer": [
    {
      "text": "yes correct."
    }
  ]
}

PUT /test/question/2
{
  "title": "title nr2",
  "answer": [
    {
      "text": "yes correct"
    }
  ]
}

现在，当您使用这样的查询查询 "yes correct." 时：

POST /test/_search
{
  "query": {
    "dis_max": {
      "tie_breaker": 0.7,
      "boost": 1.2,
      "queries": [
        {
          "match": {
            "answer.text": {
              "query": "yes correct.",
              "type": "phrase"
            }
          }
        },
        {
          "match": {
            "answer.text.stemmed": {
              "query": "yes correct.",
              "operator": "and"
            }
          }
        }
      ]
    }
  }
}

你得到这个输出：

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         }
      ]
   }
}

如果你运行没有尾随点的完全相同的查询，然后变成 "yes correct"，你会得到这个结果：

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.37919715,
      "hits": [
         {
            "_index": "test",
            "_type": "question",
            "_id": "2",
            "_score": 0.37919715,
            "_source": {
               "title": "title nr2",
               "answer": [
                  {
                     "text": "yes correct"
                  }
               ]
            }
         },
         {
            "_index": "test",
            "_type": "question",
            "_id": "1",
            "_score": 0.11261705,
            "_source": {
               "title": "title nr1",
               "answer": [
                  {
                     "text": "yes correct."
                  }
               ]
            }
         }
      ]
   }
}

希望这就是您要找的。

顺便说一下，我建议在执行文本搜索时始终使用 Match 查询。摘自文档：

Comparison to query_string / field
The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advanced" features. For this reason, chances of it failing are very small / non existent, and it provides an excellent behavior when it comes to just analyze and run that text as a query behavior (which is usually what a text search box does). Also, the phrase_prefix type can provide a great "as you type" behavior to automatically load search results.

如何使 Elasticsearch sort/prefer 命中与完全匹配的字符串优先

How to make Elasticsearch sort/prefer hits with exactly matching strings first

ruby-on-rails

ruby-on-rails-3

elasticsearch