带有数字标记的弹性搜索映射

Question

我有下面的映射并且可以正常工作

{
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "stemmer_plural_portugues": {
            "name": "minimal_portuguese",
            "stopwords" : ["http", "https", "ftp", "www"],
            "type": "stemmer"
          },
          
          
            "synonym_filter": {
            "type": "synonym",
            "lenient": true,
            "synonyms_path": "analysis/synonym.txt",
            "updateable" : true

          },
          
       
          "shingle_filter": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }

        },
        
        "analyzer": {
          "analyzer_customizado": {
            "filter": [
              "lowercase",
              "stemmer_plural_portugues",
              "asciifolding",
              "synonym_filter",
              "shingle_filter"
              
            ],
            "tokenizer": "lowercase"
          }
        }

      }
    }
  },
  "mappings": {
      "properties": {

        "id": {
         "type": "long"
        },
         "data": {
          "type": "date"
        },
         "quebrado": {
          "type": "byte"
          
        },
         "pgrk": {
           "type":  "integer" 
        },
         "url_length": {
           "type":  "integer" 
        },
        "title": {
          "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "description": {
        "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        },
        "url": {
          "analyzer": "analyzer_customizado",
          "type": "text",
          "fields": {
            "keyword": {
              "ignore_above": 256,
              "type": "keyword"
            }
          }
        }
      }
    }
  }

我在下面插入文档

{
    "title": "rocket 1960",
    "description": "space",
    "url": "www.nasa.com"
}

如果我使用 AND 运算符执行下面的查询，它会正常找到文档，因为搜索的所有词都存在于文档中。

{
    "from": 0,
    "size": 10,
    
    "query": {
      
            
                "multi_match": {
                    "query": "space nasa rocket",
                    "type": "cross_fields",
                    "fields": [
                        "title",
                        "description",
                        "url"
                    ],
                    "operator": "and"
              }

    }
}

但是如果我把它放在搜索中也是“1960”，因为下面的查询不会 return 任何东西

{
        "from": 0,
        "size": 10,
        
        "query": {
          
                
                    "multi_match": {
                        "query": "1960 space nasa rocket",
                        "type": "cross_fields",
                        "fields": [
                            "title",
                            "description",
                            "url"
                        ],
                        "operator": "and"
                  }
    
        }
    }

我发现我的“小写”分词器没有生成数字分词。因此，我将分词器更改为“标准”，并生成了 1960 数字分词。

但是查询没有找到任何东西，因为 URL 字段有 link www.nasa.com no longer generates the token "www nasa com" the generated token is the entire link www.nasa.com.

仅当我输入完整的 URL www.nasa.com 时查询才有效，如下所示

{
            "from": 0,
            "size": 10,
            
            "query": {
              
                    
                        "multi_match": {
                            "query": "1960 space www.nasa.com rocket",
                            "type": "cross_fields",
                            "fields": [
                                "title",
                                "description",
                                "url"
                            ],
                            "operator": "and"
                      }
        
            }
        }

如果我只为 URL 字段生成另一个“小写”分词器，link www.nasa.com 会再次生成单独的分词“www nasa com”

但我在下面的查询没有找到任何内容，因为 URL 字段与其他字段的标题和描述具有不同的分词器。下面的查询仅在我使用 OR 运算符时有效，但我需要 AND 运算符，

{
                "from": 0,
                "size": 10,
                
                "query": {
                  
                        
                            "multi_match": {
                                "query": "1960 space nasa rocket",
                                "type": "cross_fields",
                                "fields": [
                                    "title",
                                    "description",
                                    "url"
                                ],
                                "operator": "and"
                          }
            
                }
            }

我不能在我的映射中使用 Ngram，因为我使用“短语建议器”，当我使用 Ngram 时，生成的建议有数百个标记，这些标记会在建议中产生错误。

有人知道我的映射的任何解决方案能够在我的“标题和描述”字段中生成数字标记，但我的 URL 字段将继续网站 links分成几个标记“www nasa com”而不是 link 是整个“www .nasa.com”，并且我的查询作为 AND 运算符同时搜索所有字段。

Answer 1

If I put it in the search also "1960" as the query below does not return anything

在下面的Index Mapping中，我去掉了synonym_filter。删除它并为示例文档编制索引后，运行与您在问题中提到的相同的搜索查询，我能够得到所需的结果

索引映射：

 {
  "settings": {
    "index": {
      "number_of_shards": "5",
      "number_of_replicas": "0",
      "analysis": {
        "filter": {
          "stemmer_plural_portugues": {
            "name": "minimal_portuguese",
            "stopwords": [
              "http",
              "https",
              "ftp",
              "www"
            ],
            "type": "stemmer"
          },
          "shingle_filter": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        },
        "analyzer": {
          "analyzer_customizado": {
            "filter": [
              "lowercase",
              "stemmer_plural_portugues",
              "asciifolding",
              "shingle_filter"
            ],
            "tokenizer": "lowercase"
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id": {
        "type": "long"
      },
      "data": {
        "type": "date"
      },
      "quebrado": {
        "type": "byte"
      },
      "pgrk": {
        "type": "integer"
      },
      "url_length": {
        "type": "integer"
      },
      "title": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      },
      "description": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      },
      "url": {
        "analyzer": "analyzer_customizado",
        "type": "text",
        "fields": {
          "keyword": {
            "ignore_above": 256,
            "type": "keyword"
          }
        }
      }
    }
  }
}

搜索查询：

    {
  "from": 0,
  "size": 10,
  "query": {
    "multi_match": {
      "query": "1960 space nasa rocket",
      "type": "cross_fields",
      "fields": [
        "title",
        "description",
        "url"
      ],
      "operator": "and"
    }
  }
}

搜索结果：

"hits": [
        {
            "_index": "my-index",
            "_type": "_doc",
            "_id": "1",
            "_score": 0.9370217,
            "_source": {
                "title": "rocket 1960",
                "description": "space",
                "url": "www.nasa.com"
            }
        }
    ]

正如@Gibbs 所述，我认为 synonym_filter 中存在一些问题，因此如果您分享 synonym.txt 会更好，否则，搜索查询是运行完美。

更新 1：（包括 synonym_filter）

如果您想包含同义词标记过滤器，请保持索引映射与您的相同，只需在映射中进行一些更改即可：

 "synonym_filter": {
        "type": "synonym",
        "lenient": true,
        "synonyms_path": "analysis/synonym.txt",
        "updateable" : false  --> set this to false

      },

You set your synonym filter to "updateable", presumably because you want to change synonyms without having to close and reopen the index but instead use the reload API. Updatable synonyms restrict the analyzer they are used in to be only used at search time .

完整的解释可以参考这个ES discussion

使用与上面相同的搜索查询（在映射中进行更改后 ), 你会得到你想要的结果。

但是如果你还想设置"updateable" : true，那么你可以参考Reload search analyzers API

的官方文档

带有数字标记的弹性搜索映射

elasticsearch mapping with numeric token

elasticsearch

elasticsearch-5