ElasticSearch IO 如何在写入前从 JSON 文档中删除 id

Question

我有一个 Apache Beam 流作业，它从 Kafka 读取数据并使用 ElasticSearchIO 写入 ElasticSearch。

我遇到的问题是 Kafka 中的消息已经有 key 字段，并且使用 ElasticSearchIO.Write.withIdFn() 我将该字段映射到 ElasticSearch 中的文档 _id 字段。

拥有大量数据我不希望 key 字段也作为 _source 的一部分写入 ElasticSearch。

是否有 option/workaround 允许这样做？

Answer 1

使用 Ingest API and the remove processor you´ll be able to solve this pretty easy only using your elasticsearch cluster. You can also simulate 摄取管道和结果。

我准备了一个示例，可能会涵盖您的情况：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "remove id form incoming docs",
    "processors": [
      {"remove": {
        "field": "id",
        "ignore_failure": true
      }}
    ]
  },
  "docs": [
      {"_source":{"id":"123546", "other_field":"other value"}}
    ]
}

你看，有一个测试文档包含一个归档"id"。此字段不再出现在 response/result 中：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_type",
        "_id" : "_id",
        "_source" : {
          "other_field" : "other value"
        },
        "_ingest" : {
          "timestamp" : "2018-12-03T16:33:33.885909Z"
        }
      }
    }
  ]
}

Answer 2

我已经在 Apache Beam JIRA 中创建了一个工单来描述这个问题。

目前，原始问题无法作为使用 Apache Beam 的索引过程的一部分得到解决 API。

维护者之一 Etienne Chauchot 提出的解决方法是有单独的任务将清除索引数据后记。

例如参见。

将来，如果有人也想利用此类功能，您可能需要关注链接的工单。

ElasticSearch IO 如何在写入前从 JSON 文档中删除 id

ElasticSearch IO How to remove id from JSON document before writing

java

elasticsearch

apache-kafka

google-cloud-dataflow

apache-beam