Elasticsearch - at&t 和 procter&gamble 案例
Elasticsearch - at&t and procter&gamble cases
默认情况下,带有英语分析器的 Elasticsearch 将 at&t
分成标记 at
、t
,然后删除 at
作为停用词。
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
因此令牌看起来像:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
我希望能够精确匹配 at&t
并且同时能够精确搜索 procter&gamble
并且能够搜索例如只有 procter
.
所以我想构建一个创建两个标记的分析器
at&t
和 t
用于 at&t
字符串
和
procter
、gambl
、procter&gamble
对于 procter&gamble
。
有没有办法创建这样的分析器?或者我应该创建 2 个索引字段 - 一个用于常规英语分析器,另一个用于 English except tokenization by &
?
映射:您可以对空格进行分词并使用单词分隔符过滤器为 at&t
创建分词
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
代币:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
结果: at&t 被标记为 at,t,att,因此您可以通过 at,t 和 at&t 进行搜索。
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
如果要删除停用词"at",可以添加停用词过滤器
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}
默认情况下,带有英语分析器的 Elasticsearch 将 at&t
分成标记 at
、t
,然后删除 at
作为停用词。
POST _analyze
{
"analyzer": "english",
"text": "A word AT&T Procter&Gamble"
}
因此令牌看起来像:
{
"tokens" : [
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "gambl",
"start_offset" : 20,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
}
]
}
我希望能够精确匹配 at&t
并且同时能够精确搜索 procter&gamble
并且能够搜索例如只有 procter
.
所以我想构建一个创建两个标记的分析器
at&t
和 t
用于 at&t
字符串
和
procter
、gambl
、procter&gamble
对于 procter&gamble
。
有没有办法创建这样的分析器?或者我应该创建 2 个索引字段 - 一个用于常规英语分析器,另一个用于 English except tokenization by &
?
映射:您可以对空格进行分词并使用单词分隔符过滤器为 at&t
创建分词{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
}
}
}
}
}
代币:
{
"analyzer": "whitespace_with_acronymns",
"text": "A word AT&T Procter&Gamble"
}
结果: at&t 被标记为 at,t,att,因此您可以通过 at,t 和 at&t 进行搜索。
{
"tokens" : [
{
"token" : "a",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "word",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 1
},
{
"token" : "at",
"start_offset" : 7,
"end_offset" : 9,
"type" : "word",
"position" : 2
},
{
"token" : "att",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 2
},
{
"token" : "t",
"start_offset" : 10,
"end_offset" : 11,
"type" : "word",
"position" : 3
},
{
"token" : "procter",
"start_offset" : 12,
"end_offset" : 19,
"type" : "word",
"position" : 4
},
{
"token" : "proctergamble",
"start_offset" : 12,
"end_offset" : 26,
"type" : "word",
"position" : 4
},
{
"token" : "gamble",
"start_offset" : 20,
"end_offset" : 26,
"type" : "word",
"position" : 5
}
]
}
如果要删除停用词"at",可以添加停用词过滤器
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_with_acronymns": {
"tokenizer": "whitespace",
"filter": [
"lowercase",
"acronymns",
"english_possessive_stemmer",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
},
"filter": {
"acronymns": {
"type": "word_delimiter_graph",
"catenate_all": true
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
}
}
}
}