如何在 ElasticSearch 中查询带有停用词的短语

Question

我正在为一些启用了停用词的文本编制索引，我想使用 "match phrase" 查询来搜索这些文本，但看起来停用词仍在考虑术语位置。

建筑指数：

PUT /fr_articles
{
   "settings": {
      "analysis": {
         "analyzer": {
            "stop": {
               "type": "standard",
               "stopwords" : ["the"]
            }
         }
      }
   },
   "mappings": {
      "test": {
         "properties": {
            "title": {
               "type": "string",
               "analyzer": "stop"
            }
         }
      }
   }
}

添加文档：

POST /fr_articles/test/1
{
    "title" : "Tom the king of Toulon!"
}

搜索：

POST /fr_articles/_search
{
   "fields": [
      "title"
   ],
   "explain": true,
   "query": {
      "match": {
         "title": {
            "query": "tom king",
            "type" : "phrase"
         }
      }
   }
}

没有找到 ;-(

有办法解决吗？或者可能有多个跨度查询，但我希望术语彼此接近。

谢谢你，

Answer 1

有一个选项 enable_position_increments: false 您可以设置，例如在停止过滤器中，但自 Lucene 4.4

以来已被弃用

这是相关的 Lucene 问题：https://issues.apache.org/jira/browse/LUCENE-4065

换句话说，目前最好的方法可能是使用 slop 选项，直到 Lucene 问题得到解决

Answer 2

位置增量导致这个问题，是的。虽然停用词可能已经消失且不可搜索，但它仍然不会将这两个词推到一起，因此查询 "tom the king" 既找不到 "tom king" 也找不到 "such that tom will not be their king".

通常，当您使用过滤器删除分析中的某些内容时，它并不像从未存在过一样。 StopFilter 的目的尤其在于删除由不感兴趣的字词产生的搜索结果。不是改变文档或句子的结构。

您曾经能够在 StopFilter 上禁用位置增量，但从 Lucene 4.4 开始，该选项已被删除。

好吧，忘了那个 CharFilter 傻瓜吧。丑陋的黑客，不要那样做。

要在不使用位置增量的情况下进行查询，您需要在查询解析器中而不是在分析中进行配置。这可以在 elasticsearch 中完成，使用 Query String Query，enable_position_increments 设置为 false。

类似于：

{
    "query_string" : {
        "default_field" : "title",
        "query" : "\"tom king\""
        "enable_position_increments" : false
    }
}

作为兴趣点，原始 Lucene 中的类似解决方案，通过设置 QueryParser.setEnablePositionIncrements。

如何在 ElasticSearch 中查询带有停用词的短语

How to query a phrase with stopwords in ElasticSearch

lucene

full-text-search

elasticsearch