处理 ElasticSearch 中的点

Handling the dot in ElasticSearch

我有一个名为 summary 的字符串 属性,其中 analyzer 设置为 trigramssearch_analyzer 设置为 words

"filter": {
    "words_splitter": {
        "type": "word_delimiter",
        "preserve_original": "true"
    },
    "english_words_filter": {
        "type": "stop",
        "stop_words": "_english_"
    },
    "trigrams_filter": {
        "type": "ngram",
        "min_gram": "2",
        "max_gram": "20"
    }
},
"analyzer": {
    "words": {
        "filter": [
            "lowercase",
            "words_splitter",
            "english_words_filter"
        ],
        "type": "custom",
        "tokenizer": "whitespace"
    },
    "trigrams": {
        "filter": [
            "lowercase",
            "words_splitter",
            "trigrams_filter",
            "english_words_filter"
        ],
        "type": "custom",
        "tokenizer": "whitespace"
    }
}

我需要将输入中给出的查询字符串(如 React and HTML(或 React, html)匹配到 summary 中包含单词 Reactreactjsreact.jshtmlhtml5。随着他们拥有的匹配关键字越多,他们的分数就越高(我希望在理想情况下,只有一个词匹配度甚至达不到 100% 的文档的分数会更低)。

问题是,我猜目前 react.js 分为 reactjs,因为我也得到了包含 js 的所有文档。另一方面,Reactjs returns 什么都没有。我还认为需要 words_splitter 才能忽略逗号。

您可以使用关键字标记过滤器解决 react.js 等名称的问题,并通过定义分析器使其使用关键字过滤器。这将防止 react.js 被拆分为 reactjs 令牌。

下面是 过滤器 的示例配置:

     "filter": {
        "keywords": {
           "type": "keyword_marker",
           "keywords": [
              "react.js",
           ]
        }
     }

分析仪

     "analyzer": {
        "main_analyzer": {
           "type": "custom",
           "tokenizer": "standard",
           "filter": [
              "lowercase",
              "keywords",
              "synonym_filter",
              "german_stop",
              "german_stemmer"
           ]
        }
     }

您可以使用分析命令查看您的分析器是否按要求运行:

GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"

这应该 return 以下 react.js 未标记的标记:

{
   "tokens": [
      {
         "token": "react.js",
         "start_offset": 1,
         "end_offset": 9,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "is",
         "start_offset": 10,
         "end_offset": 12,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "a",
         "start_offset": 13,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "nice",
         "start_offset": 15,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "library",
         "start_offset": 20,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

对于与 React.jsReactjs 相似但不完全相同的词,您可以使用同义词筛选。您是否有一组固定的要匹配的关键字?

我找到了解决办法。

基本上我将定义 word_delimiter 过滤器 catenate_all active

"words_splitter": {
  "catenate_all": "true",
  "type": "word_delimiter",
  "preserve_original": "true"
}

使用 keyword 分词器

将其交给 words 分析器
"words": {
  "filter": [
      "words_splitter"
  ],
  "type": "custom",
  "tokenizer": "keyword"
}

调用 http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js 我得到以下标记:

{
"tokens": [
    {
        "token": "react.js",
        "start_offset": 0,
        "end_offset": 8,
        "type": "word",
        "position": 0
    },
    {
        "token": "react",
        "start_offset": 0,
        "end_offset": 5,
        "type": "word",
        "position": 0
    },
    {
        "token": "reactjs",
        "start_offset": 0,
        "end_offset": 8,
        "type": "word",
        "position": 0
    },
    {
        "token": "js",
        "start_offset": 6,
        "end_offset": 8,
        "type": "word",
        "position": 1
    }
  ]
}