Elasticsearch + NEST:仅将令牌过滤器用于比较而不用于分析器结果

Elasticsearch + NEST: Use token-filter only for comparisation but not on analyzers result

我想在 elasticsearch 中构建一个分析器,它在比较时忽略输入的大小写,但 returns 结果区分大小写。

这是我的真实状态:

我的 NEST 创建分析器的代码

{ "MySynonymFilter", new SynonymTokenFilter { SynonymsPath = "Path/SynonymFile.txt", Lenient = true} },

{
    "MySynonymizer", new CustomAnalyzer
    {
        Tokenizer = "whitespace",
        Filter = new List<string> {"lowercase", "MySynonymFilter"}
    }
},

这是上面创建的分析器的样子:

"Synonymizer": {
    "filter": [
        "lowercase",
        "MySynonymFilter"
     ],
    "type": "custom",
    "tokenizer": "whitespace"
},

我的同义词文件("Path/SynonymFile.txt"):

one, two, three, four => FIVE

这是实际结果和期望的结果:

示例查询:

localhost:port/index/_analyze
{
  "analyzer": "MySynonymizer",
  "text":      "Input"
}

实际结果:

Input: "one"              Output: ["five"]
Input: "One tWo THREE"    Output: ["five", "five", "five"]
Input: "one TWO foo"      Output: ["five", "five", "foo"]

移除小写过滤器后的结果:

Input: "one"              Output: ["FIVE"]
Input: "One tWo THREE"    Output: ["One", "tWo", "THREE"]
Input: "one TWO foo"      Output: ["FIVE", "TWO", "foo"]

想要的结果:

Input: "one"              Output: ["FIVE"]
Input: "One tWo THREE"    Output: ["FIVE", "FIVE", "FIVE"]
Input: "one TWO foo"      Output: ["FIVE", "FIVE", "foo"]

请注意,Analyze API 对您的输入文本和 returns 标记进行分析 。这些标记是分析器的输出,但这些不是最终输出,我们将使用这些标记来执行实际搜索。


你想要的在早期版本的 Elasticsearch 中可以实现,使用 ignore_case 参数:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "ignore_case": "true", // <-- deprecated
                        "synonyms" : ["one, two, three => FIVE"]
                    }
                }
            }
        }
    }
}

然后您可以在不使用 "lowercase" 标记过滤器的情况下分析文本:

GET /test_index/_analyze
{
  "tokenizer" : "whitespace",
  "filter" : ["synonym"] ,
  "text" : "One two three" // --> result: "FIVE", "FIVE", "FIVE"
}

所以您的同义词会忽略大小写并且分析器不会将任何内容转换为小写...但是 ignore_case 已被弃用。如果您尝试此代码,您将收到以下消息:

Deprecation: The ignore_case option on the synonym_graph filter is deprecated. Instead, insert a lowercase filter in the filter chain before the synonym_graph filter.

你想要实现的目标不再可能(而且它是有道理的)。如果您的搜索区分大小写,那么您的同义词也区分大小写...如果您想忽略大小写,请使用 "lowercase" 标记过滤器...