如何将 Hive(avro 表)与 Schema Registry 集成?

How to integrate Hive (avro tables) with Schema Registry?

Hive 提供了两个 table 属性允许定义 Avro 模式:avro.schema.literalavro.schema.url,其中前者可以指定一个 hdfs 路径或服务于模式的 http 端点。我想使用包裹在更大 json 对象中的 Schema Registry as my schema service, but the problem is its endpoints return 模式:

要求:

GET /schemas/ids/1

回复:

HTTP/1.1 200 OK
Content-Type: application/vnd.schemaregistry.v1+json

{
  "schema": "{\"type\": \"string\"}"
}

要求:

GET /subjects/test/versions/1

回复:

HTTP/1.1 200 OK
Content-Type: application/vnd.schemaregistry.v1+json

{
  "name": "test",
  "version": 1,
  "schema": "{\"type\": \"string\"}"
}

Hive 无法解析上述响应。

到目前为止,我的想法是在 Schema Registry 前面放置一个代理服务(提供纯 avro 模式)并使用 HAProxy 对其进行扩展。架构注册表本身似乎有 scalable architecture for reads. To be honest I don't understand the paragraph about avro.schema.url property in AvroSerDe hive documentation:

Specifies a URL to access the schema from. For http schemas, this works for testing and small-scale clusters, but as the schema will be accessed at least once from each task in the job, this can quickly turn the job into a DDOS attack against the URL provider (a web server, for instance). Use caution when using this parameter for anything other than testing.

我认为我的提议是一个可行的解决方案。

在集中式存储库中拥有模式允许模式演化和检查 backward/forward 兼容性,因此它比定义 hdfs 路径更好,这在 AvroSerDe 文档中被推荐。

我想做和你一样的事情。我记录了 https://github.com/confluentinc/schema-registry/issues/629 以增强架构注册表以简化此操作。希望该项目能够接受这个想法。看起来应该是一个简单的增强实现。