在弹性搜索查询中获取数据而不匹配完整字符串

Question

我的数据存储在以下格式的弹性搜索中

 {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca963f84470d852",
            "_score": 0.69321066,
            "_source": {
                "email": "test20011@gmail.com",
                "wallet": "test20011@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "7916318809",
                "name": "test20011"
            }
        },
        {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca9634d1c70d856",
            "_score": 0.69321066,
            "_source": {
                "email": "test50011@gmail.com",
                "wallet": "test50011@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "3483330496",
                "name": "test50011"
            }
        },
        {
            "_index": "wallet",
            "_type": "wallet",
            "_id": "5dfcbe0a6ca96304b370d857",
            "_score": 0.69321066,
            "_source": {
                "email": "test110021@gmail.com",
                "wallet": "test110021@operatorqa2.akeodev.com",
                "countryCode": "+91",
                "phone": "2744697207",
                "name": "test110021"
            }
        }

如果我们使用下面的查询，应该找不到记录

   {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet": {
                            "query": "operatorqa2.akeodev.com",
                             "operator": "and"
                        }
                    }
                },
                {
                    "match": {
                        "email": {
                            "query": "operatorqa2.akeodev.com",
                                "operator": "and"
                        }
                    }
                }
            ]
        }
    }
}

如果我在查询下方传递，记录应该会找到

    {
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet": {
                            "query": "test20011@operatorqa2.akeodev.com",
                             "operator": "and"
                        }
                    }
                },
                {
                    "match": {
                        "email": {
                            "query": "test20011@operatorqa2.akeodev.com",
                                "operator": "and"
                        }
                    }
                }
            ]
        }
    }
}

我已经在电子邮件和钱包字段上创建了索引。每当用户通过电子邮件或钱包搜索数据时，我不确定用户发送的字符串是电子邮件还是钱包，所以我使用 bool.

如果用户发送了完整的电子邮件地址或完整的电子钱包地址，记录应该会发现。请帮我找到解决办法

Answer 1

因为我没有到你的索引模式的映射，我假设你使用的是 ES 默认值（你可以使用 mapping API 获得它）并且在你的情况下，wallet 和 email 字段将被定义为 text，默认分析器是标准分析器。

此分析器不会将这些文本识别为邮件 ID，并且会为 test50011@operatorqa2.akeodev.com 创建三个标记，您可以使用 analyze APIs.

进行检查

http://localhost:9200/_analyze?text=test50011@operatorqa2.akeodev.com&tokenizer=standard

{
  "tokens": [
    {
      "token": "test50011",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "operatorqa2",
      "start_offset": 10,
      "end_offset": 21,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "akeodev.com",
      "start_offset": 22,
      "end_offset": 33,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

这里你需要的是custom analyzer for mails using UAX URI Mail tokenizer，它用于电子邮件字段。这将为 test50011@operatorqa2.akeodev.com 生成一个正确的标记（仅 1），如下所示：

http://localhost:9200/_analyze?text=test50011@operatorqa2.akeodev.com&tokenizer=uax_url_email

{
  "tokens": [
    {
      "token": "test50011@operatorqa2.akeodev.com",
      "start_offset": 0,
      "end_offset": 33,
      "type": "<EMAIL>",
      "position": 1
    }
  ]
}

现在您可以看到它没有拆分 test50011@operatorqa2.akeodev.com，因此当您使用相同的查询进行搜索时，它也会生成相同的标记，并且 ES 在标记到标记匹配上工作。

如果您需要任何帮助，请告诉我，它的设置和使用非常简单。

Answer 2

正如其他社区成员所提到的，在提出此类问题时，您应该指定您使用的 Elasticsearch 版本并提供映射。

从带有默认映射的 Elasticsearch 版本 5 开始，您只需更改查询以查询字段的确切版本，而不是分析的版本。默认情况下，Elasticsearch 将字符串映射到 text（已分析，用于全文搜索）和 keyword（未分析，用于精确匹配搜索）类型的多字段。在您的查询中，您将查询 <fieldname>.keyword-fields:

{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "wallet.keyword": "test20011@operatorqa2.akeodev.com"
                    }
                },
                {
                    "match": {
                        "email.keyword": "test20011@operatorqa2.akeodev.com"
                    }
                }
            ]
        }
    }
}

如果您使用的是版本 5 之前的 Elasticsearch 版本，请将索引-属性从 analyzed 更改为 not_analyzed 并重新索引您的数据。

映射片段：

{
  "email": {
    "type" "string",
    "index": "not_analyzed"
  }
}

您的查询仍然不需要使用 and 运算符。它看起来与我在上面 post 编辑的查询相同，除了您必须查询 email 和 wallet 字段，而不是 email.keyword 和 wallet.keyword.

我可以向您推荐以下来自 Elastic 的与该主题相关的博客 post：Strings are dead, long live strings!

在弹性搜索查询中获取数据而不匹配完整字符串

getting data without matching full string in elastic search query

tokenize

elasticsearch