处理 ElasticSearch 中的点
Handling the dot in ElasticSearch
我有一个名为 summary
的字符串 属性,其中 analyzer
设置为 trigrams
,search_analyzer
设置为 words
。
"filter": {
"words_splitter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"english_words_filter": {
"type": "stop",
"stop_words": "_english_"
},
"trigrams_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"words": {
"filter": [
"lowercase",
"words_splitter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"trigrams": {
"filter": [
"lowercase",
"words_splitter",
"trigrams_filter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
我需要将输入中给出的查询字符串(如 React and HTML
(或 React, html
)匹配到 summary
中包含单词 React
、reactjs
、react.js
、html
、html5
。随着他们拥有的匹配关键字越多,他们的分数就越高(我希望在理想情况下,只有一个词匹配度甚至达不到 100% 的文档的分数会更低)。
问题是,我猜目前 react.js
分为 react
和 js
,因为我也得到了包含 js
的所有文档。另一方面,Reactjs
returns 什么都没有。我还认为需要 words_splitter
才能忽略逗号。
您可以使用关键字标记过滤器解决 react.js 等名称的问题,并通过定义分析器使其使用关键字过滤器。这将防止 react.js 被拆分为 react 和 js 令牌。
下面是 过滤器 的示例配置:
"filter": {
"keywords": {
"type": "keyword_marker",
"keywords": [
"react.js",
]
}
}
和分析仪:
"analyzer": {
"main_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keywords",
"synonym_filter",
"german_stop",
"german_stemmer"
]
}
}
您可以使用分析命令查看您的分析器是否按要求运行:
GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"
这应该 return 以下 react.js 未标记的标记:
{
"tokens": [
{
"token": "react.js",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 13,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "nice",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "library",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
对于与 React.js 和 Reactjs 相似但不完全相同的词,您可以使用同义词筛选。您是否有一组固定的要匹配的关键字?
我找到了解决办法。
基本上我将定义 word_delimiter
过滤器 catenate_all
active
"words_splitter": {
"catenate_all": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
使用 keyword
分词器
将其交给 words
分析器
"words": {
"filter": [
"words_splitter"
],
"type": "custom",
"tokenizer": "keyword"
}
调用 http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js
我得到以下标记:
{
"tokens": [
{
"token": "react.js",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "react",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "reactjs",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "js",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}
我有一个名为 summary
的字符串 属性,其中 analyzer
设置为 trigrams
,search_analyzer
设置为 words
。
"filter": {
"words_splitter": {
"type": "word_delimiter",
"preserve_original": "true"
},
"english_words_filter": {
"type": "stop",
"stop_words": "_english_"
},
"trigrams_filter": {
"type": "ngram",
"min_gram": "2",
"max_gram": "20"
}
},
"analyzer": {
"words": {
"filter": [
"lowercase",
"words_splitter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
},
"trigrams": {
"filter": [
"lowercase",
"words_splitter",
"trigrams_filter",
"english_words_filter"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
我需要将输入中给出的查询字符串(如 React and HTML
(或 React, html
)匹配到 summary
中包含单词 React
、reactjs
、react.js
、html
、html5
。随着他们拥有的匹配关键字越多,他们的分数就越高(我希望在理想情况下,只有一个词匹配度甚至达不到 100% 的文档的分数会更低)。
问题是,我猜目前 react.js
分为 react
和 js
,因为我也得到了包含 js
的所有文档。另一方面,Reactjs
returns 什么都没有。我还认为需要 words_splitter
才能忽略逗号。
您可以使用关键字标记过滤器解决 react.js 等名称的问题,并通过定义分析器使其使用关键字过滤器。这将防止 react.js 被拆分为 react 和 js 令牌。
下面是 过滤器 的示例配置:
"filter": {
"keywords": {
"type": "keyword_marker",
"keywords": [
"react.js",
]
}
}
和分析仪:
"analyzer": {
"main_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"keywords",
"synonym_filter",
"german_stop",
"german_stemmer"
]
}
}
您可以使用分析命令查看您的分析器是否按要求运行:
GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library"
这应该 return 以下 react.js 未标记的标记:
{
"tokens": [
{
"token": "react.js",
"start_offset": 1,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "is",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "a",
"start_offset": 13,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "nice",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "library",
"start_offset": 20,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 4
}
]
}
对于与 React.js 和 Reactjs 相似但不完全相同的词,您可以使用同义词筛选。您是否有一组固定的要匹配的关键字?
我找到了解决办法。
基本上我将定义 word_delimiter
过滤器 catenate_all
active
"words_splitter": {
"catenate_all": "true",
"type": "word_delimiter",
"preserve_original": "true"
}
使用 keyword
分词器
words
分析器
"words": {
"filter": [
"words_splitter"
],
"type": "custom",
"tokenizer": "keyword"
}
调用 http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js
我得到以下标记:
{
"tokens": [
{
"token": "react.js",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "react",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "reactjs",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
},
{
"token": "js",
"start_offset": 6,
"end_offset": 8,
"type": "word",
"position": 1
}
]
}