如何使用重新索引、摄取管道和处理器构建反向 1:n elasticsearch 索引
How to build an inverted 1:n elasticsearch index using reindex, ingest pipeline and processors
我已经开始尝试使用 Elasticsearch 摄取管道和处理器作为一种可能更快的方式来构建我可以描述为 "inverted index" 的东西。
这是我正在尝试做的事情:我有一个文档索引。每个文档类似于以下内容:
{
"id": "DOC1",
"title": "Quiz no. 1",
"questions": [
{
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
}
我想了解是否存在重建索引、管道和处理器的神奇组合,可以让我自动构建 questions 索引。这是该索引的示例:
[
{
"question_id": "<randomly-generated-value-1>",
"document_id": "DOC1",
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question_id": "<randomly-generated-value-2>",
"document_id": "DOC1",
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
在 Elasticsearch 文档中,提到您可以执行 REINDEX using a specific pipeline. Looking up the simulate pipeline docs,我正在尝试一些处理器,包括 foreach,但我不能了解从管道生成的文档是否仍然 1:1 到原始索引或 1 个源文档可以生成多个目标文档(这是我需要的)。
这是我正在尝试的模拟管道:
{
"pipeline": {
"description": "Inverts the documents index into a questions index",
"processors": [
{
"rename": {
"field": "id",
"target_field": "document_id",
"ignore_missing": false
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.question",
"target_field": "question"
}
}
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.choices",
"target_field": "choices"
}
}
}
},
{
"remove": {
"field": "questions"
}
}
]
}
}
这几乎 有效。这种方法的问题是只有一个结果文档对应第一个问题。第二个问题不存在于模拟管道的输出中,
因此我怀疑处理器管道是否可以输出多个目标文档来读取 1 个源文档,或者我们被迫保持 1:1 关系。
This answer 似乎暗示我想要实现的目标是不可能的。
我已经开始尝试使用 Elasticsearch 摄取管道和处理器作为一种可能更快的方式来构建我可以描述为 "inverted index" 的东西。
这是我正在尝试做的事情:我有一个文档索引。每个文档类似于以下内容:
{
"id": "DOC1",
"title": "Quiz no. 1",
"questions": [
{
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
}
我想了解是否存在重建索引、管道和处理器的神奇组合,可以让我自动构建 questions 索引。这是该索引的示例:
[
{
"question_id": "<randomly-generated-value-1>",
"document_id": "DOC1",
"question": "Who was the first person to walk on the Moon?",
"choices": [
{ "answer": "Michael Jackson", "correct": false },
{ "answer": "Neil Armstrong", "correct": true }
]
},
{
"question_id": "<randomly-generated-value-2>",
"document_id": "DOC1",
"question": "Who wrote the Macbeth?",
"choices": [
{ "answer": "William Shakespeare", "correct": true },
{ "answer": "Dante Alighieri", "correct": false },
{ "answer": "Arthur Conan Doyle", "correct": false }
]
}
]
在 Elasticsearch 文档中,提到您可以执行 REINDEX using a specific pipeline. Looking up the simulate pipeline docs,我正在尝试一些处理器,包括 foreach,但我不能了解从管道生成的文档是否仍然 1:1 到原始索引或 1 个源文档可以生成多个目标文档(这是我需要的)。
这是我正在尝试的模拟管道:
{
"pipeline": {
"description": "Inverts the documents index into a questions index",
"processors": [
{
"rename": {
"field": "id",
"target_field": "document_id",
"ignore_missing": false
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.question",
"target_field": "question"
}
}
}
},
{
"foreach": {
"field": "questions",
"processor": {
"rename": {
"field": "_ingest._value.choices",
"target_field": "choices"
}
}
}
},
{
"remove": {
"field": "questions"
}
}
]
}
}
这几乎 有效。这种方法的问题是只有一个结果文档对应第一个问题。第二个问题不存在于模拟管道的输出中, 因此我怀疑处理器管道是否可以输出多个目标文档来读取 1 个源文档,或者我们被迫保持 1:1 关系。
This answer 似乎暗示我想要实现的目标是不可能的。