Elasticsearch path_hierarchy 标记了一半路径

Question

我正在尝试使用 path_hierarchy 分词器为路径编制索引，但它似乎只对我提供的路径的一半进行了分词。我尝试了不同的路径，结果似乎是一样的。

我的设置是 -

{
    "settings" : { 
        "number_of_shards" : 5,
        "number_of_replicas" : 0,
        "analysis":{
            "analyzer":{
                "keylower":{
                    "type": "custom",
                    "tokenizer":"keyword",
                    "filter":"lowercase"
                },
                "path_analyzer": {
                    "type": "custom",
                    "tokenizer": "path_tokenizer",
                    "filter": [ "lowercase", "asciifolding", "path_ngrams" ]
                },
                "code_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                },
                "not_analyzed": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "lowercase", "asciifolding", "code_stemmer" ]
                }
            },
            "tokenizer": {
                "path_tokenizer": {
                  "type": "path_hierarchy"
                }
            },
            "filter": {
                "path_ngrams": {
                    "type": "edgeNGram",
                    "min_gram": 3,
                    "max_gram": 15
                },
                "code_stemmer": {
                    "type": "stemmer",
                    "name": "minimal_english"
                }
            }
        }
    }
}

我的映射如下-

{
  "dynamic": "strict",
  "properties": {
    "depot_path": {
      "type": "string",
      "analyzer": "path_analyzer"
    }
  },
  "_all": {
      "store": "yes",
      "analyzer": "english"
  }
}

我在分析时提供 "//cm/mirror/v1.2/Kolkata/ixin-packages/builds/" 作为 depot_path 我发现令牌形成如下 -

               "key": "//c",
               "key": "//cm",
               "key": "//cm/",
               "key": "//cm/m",
               "key": "//cm/mi",
               "key": "//cm/mir",
               "key": "//cm/mirr",
               "key": "//cm/mirro",
               "key": "//cm/mirror",
               "key": "//cm/mirror/",
               "key": "//cm/mirror/v",
               "key": "//cm/mirror/v1",
               "key": "//cm/mirror/v1.",

为什么整个路径没有被标记化？

我的预期输出是形成一直到 //cm/mirror/v1.2/Kolkata/ixin-packages/builds/

的标记

我试过增加缓冲区大小，但没有成功。有谁知道我做错了什么？

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html,

Answer 1

"max_gram": 15 将标记大小限制为 15。如果增加 "max_gram" ，您会看到更多路径将被标记化。

以下是我的环境中的示例。

"max_gram" :15 
input path : /var/log/www/html/web/
path_analyser tokenized this path upto /var/log/www/ht i.e. 15 characters

 "max_gram" :100
    input path : /var/log/www/html/web/WANTED
    path_analyser tokenized this path upto /var/log/www/html/web/WANTED i.e. 28  characters <100

Answer 2

这是因为您已将 "max_gram" 的值设置为 15。因此，您会注意到生成的最大标记 ("//cm/mirror/v1.") 的长度为 15。把它改成一个很大的数字，你就会得到你想要的代币。

Elasticsearch path_hierarchy 标记了一半路径

Elasticsearch path_hierarchy tokenizes half of the path

tokenize

elasticsearch