如何在 elasticsearch 中真正重新索引数据

Question

我添加了新映射（主要是现有字段的 not_analyzed 版本）我现在必须弄清楚如何重新索引现有数据。我尝试按照弹性搜索网站上的指南进行操作，但这太令人困惑了。我也尝试过使用插件（elasticsearch-reindex，allegro/elasticsearch-reindex-tool）。我看过 ElasticSearch - Reindexing your data with zero downtime 这是一个类似的问题。我希望不必依赖外部工具（如果可能的话）并尝试使用批量 API（与原始插入一样）

我可以轻松地重建整个索引，因为它确实是一个只读数据，但如果我在生产中使用它时想要添加更多字段等，那么从长远来看，这不会真正起作用。我想知道是否有人知道一个易于 understand/follow 的解决方案或步骤，适合 ES 的相对新手。我在版本 2 上使用 Windows.

Answer 1

Re-indexing就是读取数据，删除elasticsearch中的数据，重新摄取数据。没有这样的东西 "change the mapping of existing data in place." 你提到的所有重新索引工具都只是 read->delete->ingest 的包装器。
您始终可以调整新索引的映射并稍后添加字段。所有新字段都将根据此映射编制索引。如果您无法控制新字段，也可以使用动态映射。
查看 Change default mapping of string to "not analyzed" in Elasticsearch 了解如何使用动态映射获取字符串的 not_analyzed 字段。

重新编制索引非常昂贵。更好的方法是创建一个新索引并删除旧索引。要实现零停机时间，请为所有客户使用索引别名。想想一个名为 "data-version1" 的索引。步骤：

创建您的索引 "data-version1" 并为其指定一个名为 "data"
仅在您的所有客户端应用程序中使用别名 "data"
更新您的映射：创建一个名为 "data-version2" 的新索引（使用新映射）并将所有数据放入
从版本 1 切换到版本 2：删除版本 1 上的别名 "data" 并在版本 2 上创建别名 "data"（或先创建，然后删除）。在这两个步骤之间的时间，您的客户将没有（或双重）数据。但是删除和创建别名之间的时间应该很短，您的客户应该不会识别它。

始终使用别名是一种很好的做法。

Answer 2

我遇到了同样的问题。但是我找不到任何资源来更新当前的索引映射和分析器。我的建议是使用 scroll and scan api 并使用新映射和新字段将您的数据重新索引到新索引。

Answer 3

在 2.3.4 版中，新的 api _reindex 可用，它将完全按照其说明进行操作。基本用法是

{
    "source": {
        "index": "currentIndex"
    },
    "dest": {
        "index": "newIndex"
    }
}

Answer 4

Elasticsearch 从 Remote 主机重新索引到 Local 主机示例（2020 年 1 月更新）

# show indices on this host
curl 'localhost:9200/_cat/indices?v'

# edit elasticsearch configuration file to allow remote indexing
sudo vi /etc/elasticsearch/elasticsearch.yml

## copy the line below somewhere in the file
>>>
# --- whitelist for remote indexing ---
reindex.remote.whitelist: my-remote-machine.my-domain.com:9200
<<<

# restart elaticsearch service
sudo systemctl restart elasticsearch

# run reindex from remote machine to copy the index named filebeat-2016.12.01
curl -H 'Content-Type: application/json' -X POST 127.0.0.1:9200/_reindex?pretty -d'{
  "source": {
    "remote": {
      "host": "http://my-remote-machine.my-domain.com:9200"
    },
    "index": "filebeat-2016.12.01"
  },
  "dest": {
    "index": "filebeat-2016.12.01"
  }
}'

# verify index has been copied
curl 'localhost:9200/_cat/indices?v'

Answer 5

如果你像我一样想要直接回答这个常见的基本问题，而这个问题在 elastic 和整个社区中都没有得到很好的解决，这里是适合我的代码。

假设您只是在调试，而不是在生产环境中，并且添加或删除字段是绝对合法的，因为您绝对不关心停机时间或延迟：

# First of all: enable blocks write to enable clonage
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": true
  }
}

# clone index into a temporary index
POST /my_index/_clone/my_index-000001  

# Copy back all documents in the original index to force their reindexetion
POST /_reindex
{
  "source": {
    "index": "my_index-000001"
  },
  "dest": {
    "index": "my_index"
  }
}

# Disable blocks write
PUT /my_index/_settings
{
  "settings": {
    "index.blocks.write": false
  }
}

# Finaly delete the temporary index
DELETE my_index-000001

如何在 elasticsearch 中真正重新索引数据

How to really reindex data in elasticsearch

elasticsearch

reindex