Elasticsearch 是否对具有相同 IDF 的不同长度的带状疱疹进行评分?
Does Elasticsearch score different length shingles with the same IDF?
在 Elasticsearch 5.6.5 中,我正在搜索应用了以下过滤器的字段:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
当我针对具有该确切文本的文档执行 depreciation tax
搜索时,我看到了以下分数解释:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
如果我将搜索更改为 depreciation taffy
针对内容中包含 depreciation tax
的完全相同的文档,我会得到以下解释:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
这不是我所期望的。我认为 depreciation tax
的二元组标记匹配比一元组 depreciation
的匹配得分高得多。然而,这个评分似乎反映了一个简单的 unigram 匹配。差异非常小,进一步挖掘是因为 depreciation taffy
匹配下的 termFreq=28
和 depreciation tax
匹配下的 termFreq=29
。我也不确定这有什么关系,正如我想象的那样,在持有这份文件的分片中,depreciation
、depreciation tax
和 depreciation tafffy
的计数非常不同
这是预期的行为吗? ES 是否使用相同的 IDF 值处理所有不同大小的带状疱疹,包括 unigrams?我是否需要使用不同的分析器将每个 shingle 大小拆分为不同的子字段以获得我期望的行为?
TL;DR
Shingles 和 Synonyms 在 Elastic/Lucene 中被破坏,并且在发布修复程序之前需要应用大量 hack(从 ES 6 开始准确)。
- 将 unigrams、bigrams 等放在各个子字段中并分别搜索它们,将分数组合起来进行整体匹配。不要在执行多个 n-gram 配置的字段上使用单个 shingle 过滤器
- 不要在同一个字段上组合同义词和 shingle 过滤器。
在我的例子中,我在 unigram 字段上与同义词进行 must
匹配,然后进行一系列应该匹配以提高每种大小的带状疱疹的分数,没有同义词
详情
我在弹性支持论坛上得到了答案:
https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact
that SynonymQueries do the frequency blending behavior that you're
seeing. They use frequency of the original token for all the
subsequent 'synonym' tokens, as a way to help prevent skewing the
score results. Synonyms are often relatively rare, and would
drastically affect the scoring if they each used their individual
df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you
had indexed them as one term: it will match any of the terms but only
invoke the similarity a single time, scoring the sum of all term
frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum
docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df
== 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate
shingles from synonyms... they both use tokens that overlap the
position of other tokens in the token stream. So if unigrams are mixed
with bi-(or larger)-grams, Lucene is tricked into thinking it's
actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different
fields. That way Lucene won't attempt to use SynonymQueries in these
situations, because the positions won't be overlapping anymore.
这是我问的另一个相关问题,它涉及实际同义词在与 shingles 结合时如何被破坏。 https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene 扩展同义词集,将它们注入令牌流,然后创建带状疱疹。例如。查询:econ supply and demand => econ, economics, supply, demand
。文档:`... econ foo ... => econ, foo '。现在我们从查询 "econ economics" 中得到了 shingle,它以某种方式与文档相匹配。不知道为什么,因为我只将同义词应用于查询,而不是文档,所以我看不到匹配项。此外,根据查询创建带状疱疹的方式也是错误的。
This is a known problem, and it is still not fully resolved. A number
of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed
shingle filter, and also an idea to have a sub-field for indexing
shingles.
在 Elasticsearch 5.6.5 中,我正在搜索应用了以下过滤器的字段:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
当我针对具有该确切文本的文档执行 depreciation tax
搜索时,我看到了以下分数解释:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
如果我将搜索更改为 depreciation taffy
针对内容中包含 depreciation tax
的完全相同的文档,我会得到以下解释:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
这不是我所期望的。我认为 depreciation tax
的二元组标记匹配比一元组 depreciation
的匹配得分高得多。然而,这个评分似乎反映了一个简单的 unigram 匹配。差异非常小,进一步挖掘是因为 depreciation taffy
匹配下的 termFreq=28
和 depreciation tax
匹配下的 termFreq=29
。我也不确定这有什么关系,正如我想象的那样,在持有这份文件的分片中,depreciation
、depreciation tax
和 depreciation tafffy
这是预期的行为吗? ES 是否使用相同的 IDF 值处理所有不同大小的带状疱疹,包括 unigrams?我是否需要使用不同的分析器将每个 shingle 大小拆分为不同的子字段以获得我期望的行为?
TL;DR Shingles 和 Synonyms 在 Elastic/Lucene 中被破坏,并且在发布修复程序之前需要应用大量 hack(从 ES 6 开始准确)。
- 将 unigrams、bigrams 等放在各个子字段中并分别搜索它们,将分数组合起来进行整体匹配。不要在执行多个 n-gram 配置的字段上使用单个 shingle 过滤器
- 不要在同一个字段上组合同义词和 shingle 过滤器。
在我的例子中,我在 unigram 字段上与同义词进行 must
匹配,然后进行一系列应该匹配以提高每种大小的带状疱疹的分数,没有同义词
详情
我在弹性支持论坛上得到了答案: https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact that SynonymQueries do the frequency blending behavior that you're seeing. They use frequency of the original token for all the subsequent 'synonym' tokens, as a way to help prevent skewing the score results. Synonyms are often relatively rare, and would drastically affect the scoring if they each used their individual df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you had indexed them as one term: it will match any of the terms but only invoke the similarity a single time, scoring the sum of all term frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df == 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate shingles from synonyms... they both use tokens that overlap the position of other tokens in the token stream. So if unigrams are mixed with bi-(or larger)-grams, Lucene is tricked into thinking it's actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different fields. That way Lucene won't attempt to use SynonymQueries in these situations, because the positions won't be overlapping anymore.
这是我问的另一个相关问题,它涉及实际同义词在与 shingles 结合时如何被破坏。 https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene 扩展同义词集,将它们注入令牌流,然后创建带状疱疹。例如。查询:econ supply and demand => econ, economics, supply, demand
。文档:`... econ foo ... => econ, foo '。现在我们从查询 "econ economics" 中得到了 shingle,它以某种方式与文档相匹配。不知道为什么,因为我只将同义词应用于查询,而不是文档,所以我看不到匹配项。此外,根据查询创建带状疱疹的方式也是错误的。
This is a known problem, and it is still not fully resolved. A number of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed shingle filter, and also an idea to have a sub-field for indexing shingles.