GCP Dataproc 提供 Alpha 版 Druid。如何加载段？

Question

描述德鲁伊支持的 dataproc 页面没有关于如何将数据加载到集群中的部分。我一直在尝试使用 GC 存储来执行此操作，但不知道如何为其设置有效的规范。我希望 "firehose" 部分对存储桶有一些 google 特定引用，但没有示例如何执行此操作。

直接在 GCP dataproc 上将数据加载到 Druid 运行的方法是什么？

Answer 1

我没有使用过 Druid 的 Dataproc 版本，但在 Google Compute VM 中有一个小集群运行。我从 GCS 获取数据的方式是使用 Google Cloud Storage Druid 扩展 - https://druid.apache.org/docs/latest/development/extensions-core/google.html

要启用扩展，您需要将其添加到 Druid common.properties 文件中的扩展列表中：

druid.extensions.loadList=["druid-google-extensions", "postgresql-metadata-storage"]

为了从 GCS 获取数据，我将 HTTP POST 请求发送到 http://druid-overlord-host:8081/druid/indexer/v1/task

POST 请求正文包含 JSON 具有摄取规范的文件（请参阅 ["ioConfig"]["firehose"] 部分）：

{
    "type": "index_parallel",
    "spec": {
        "dataSchema": {
            "dataSource": "daily_xport_test",
            "granularitySpec": {
                "type": "uniform",
                "segmentGranularity": "MONTH",
                "queryGranularity": "NONE",
                "rollup": false
            },
            "parser": {
                "type": "string",
                "parseSpec": {
                    "format": "json",
                    "timestampSpec": {
                        "column": "dateday",
                        "format": "auto"
                    },
                    "dimensionsSpec": {
                        "dimensions": [{
                                "type": "string",
                                "name": "id",
                                "createBitmapIndex": true
                            },
                            {
                                "type": "long",
                                "name": "clicks_count_total"
                            },
                            {
                                "type": "long",
                                "name": "ctr"
                            },
                            "deleted",
                            "device_type",
                            "target_url"
                        ]
                    }
                }
            }
        },
        "ioConfig": {
            "type": "index_parallel",
            "firehose": {
                "type": "static-google-blobstore",
                "blobs": [{
                    "bucket": "data-test",
                    "path": "/sample_data/daily_export_18092019/000000000000.json.gz"
                }],
                "filter": "*.json.gz$"
            },
            "appendToExisting": false
        },
        "tuningConfig": {
            "type": "index_parallel",
            "maxNumSubTasks": 1,
            "maxRowsInMemory": 1000000,
            "pushTimeout": 0,
            "maxRetry": 3,
            "taskStatusCheckPeriodMs": 1000,
            "chatHandlerTimeout": "PT10S",
            "chatHandlerNumRetries": 5
        }
    }
}

在 Druid 中启动摄取任务的示例 cURL 命令（spec.json 包含上一节中的 JSON）：

curl -X 'POST' -H 'Content-Type:application/json' -d @spec.json http://druid-overlord-host:8081/druid/indexer/v1/task

GCP Dataproc 提供 Alpha 版 Druid。如何加载段？

GCP Dataproc has Druid available in alpha. How to load segments?

druid

google-cloud-platform

google-cloud-dataproc