我如何 select 基于 Elasticsearch 中的重新评分函数的顶级术语桶

Question

考虑以下针对 Elasticsearch 5.6 的查询：

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "rescore": [
    {
      "window_size": 10000,
      "query": {
        "rescore_query": {
          "function_score": {
            "boost_mode": "replace",
            "script_score": {
              "script": {
                "source": "doc['topic_score'].value"
              }
            }
          }
        },
        "query_weight": 0,
        "rescore_query_weight": 1
      }
    }
  ],
  "aggs": {
    "distinct": {
      "terms": {
        "field": "identical_id",
        "order": {
          "top_score": "desc"
        }
      },
      "aggs": {
        "best_unique_result": {
          "top_hits": {
            "size": 1
          }
        },
        "top_score": {
          "max": {
            "script": {
              "inline": "_score"
            }
          }
        }
      }
    }
  }
}

这是一个简化版本，其中真正的查询具有更复杂的主查询，并且重新评分功能更加密集。

让我先解释一下它的用途，以防万一我要花 1000 个小时开发一支可以用 space 书写的钢笔，而铅笔实际上可以解决我的问题。我正在执行一个快速的初始查询，然后使用更密集的函数重新对顶部结果进行评分。从这些结果中，我想显示最高的不同值，即没有两个结果应该具有相同的 identical_id。如果有更好的方法来做到这一点，我也会认为这是一个答案。

我预计这样的查询会按重新评分查询对结果进行排序，将所有具有相同 identical_id 的结果分组，并显示每个此类不同组的热门结果。我还假设，由于我按最大父项 _score 对这些术语聚合桶进行排序，因此将对它们进行排序以反映它们包含的最佳结果，这些结果是根据原始 rescore 查询确定的。

实际情况是术语桶是按最大查询分数而不是重新查询分数排序的。奇怪的是，桶中的热门歌曲似乎确实使用了重新评分。

是否有更好的方法来实现我想要的最终结果，或者我可以通过某种方式修复此查询以使其按我期望的方式工作？

Answer 1

来自 documentation：

The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.

由于 rescore query 在 post_filter 阶段之后开始，我假设术语聚合桶已经固定。

我不知道如何结合重新评分和聚合。对不起:(

Answer 2

我认为我有一个非常好的解决这个问题的方法，但我会让赏金继续到期，以防有人想出更好的方法。

{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10000
      },
      "aggs": {
        "distinct": {
          "terms": {
            "field": "identical_id",
            "order": {
              "top_score": "desc"
            }
          },
          "aggs": {
            "best_unique_result": {
              "top_hits": {
                "size": 1,
                "sort": [
                  {
                    "_script": {
                      "type": "number",
                      "script": {
                        "source": "doc['topic_score'].value"
                      },
                      "order": "desc"
                    }
                  }
                ]
              }
            },
            "top_score": {
              "max": {
                "script": {
                  "source": "doc['topic_score'].value"
                }
              }
            }
          }
        }
      }
    }
  }
}

sampler 聚合将从核心查询中获取每个分片的前 N 个命中，并在这些查询上进行运行聚合。然后在定义存储桶顺序的最大聚合器中，我使用与我用来从存储桶中挑选最高命中的脚本完全相同的脚本。现在，桶和热门命中运行ning 在相同的前 N 组项目上，并且桶将按同一脚本生成的相同分数的最大值排序。不幸的是，我仍然需要运行脚本一次来排序桶，一次来选择桶中的最高命中，并且您可以使用 rescore 代替最高命中排序，但无论哪种方式都必须运行两次，我发现它作为排序脚本比作为重新评分更快

我如何 select 基于 Elasticsearch 中的重新评分函数的顶级术语桶

How do I select the top term buckets based on a rescore function in Elasticsearch

lucene

search

relevance

elasticsearch

elasticsearch-5