Elasticsearch - at&t 和 procter&gamble 案例

Elasticsearch - at&t and procter&gamble cases

默认情况下,带有英语分析器的 Elasticsearch 将 at&t 分成标记 att,然后删除 at 作为停用词。

POST _analyze
{
  "analyzer": "english", 
  "text": "A word AT&T Procter&Gamble"
}

因此令牌看起来像:

{
  "tokens" : [
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "gambl",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "<ALPHANUM>",
      "position" : 5
    }
  ]
}

我希望能够精确匹配 at&t 并且同时能够精确搜索 procter&gamble 并且能够搜索例如只有 procter.

所以我想构建一个创建两个标记的分析器 at&tt 用于 at&t 字符串 和 proctergamblprocter&gamble 对于 procter&gamble

有没有办法创建这样的分析器?或者我应该创建 2 个索引字段 - 一个用于常规英语分析器,另一个用于 English except tokenization by & ?

映射:您可以对空格进行分词并使用单词分隔符过滤器为 at&t

创建分词
{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        }
      }
    }
  }
}

代币:

{
  "analyzer": "whitespace_with_acronymns", 
  "text": "A word AT&T Procter&Gamble"
}

结果: at&t 被标记为 at,t,att,因此您可以通过 at,t 和 at&t 进行搜索。

{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "word",
      "start_offset" : 2,
      "end_offset" : 6,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "at",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "att",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "t",
      "start_offset" : 10,
      "end_offset" : 11,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "procter",
      "start_offset" : 12,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "proctergamble",
      "start_offset" : 12,
      "end_offset" : 26,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "gamble",
      "start_offset" : 20,
      "end_offset" : 26,
      "type" : "word",
      "position" : 5
    }
  ]
}

如果要删除停用词"at",可以添加停用词过滤器

{
  "settings": {
    "analysis": {
      "analyzer": {
        "whitespace_with_acronymns": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "acronymns",
            "english_possessive_stemmer",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      },
      "filter": {
        "acronymns": {
          "type": "word_delimiter_graph",
          "catenate_all": true
        },
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" 
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] 
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      }
    }
  }
}