如何为实时数据流配置 Apache Flink Cluster (flink-conf.yml)

How to configure Apache Flink Cluster (flink-conf.yml) for real time data stream

请帮帮我，我有一个 Apache Flink 集群（2 个作业管理器，3 个任务管理器），但我不知道在 flink-conf.yml:

中为该参数设置哪些值

jobmanager.heap.size

taskmanager.heap.size

taskmanager.numberOfTaskSlots

parallelism.default

Job Manager 机器有：8CPU，32GB RAM
任务管理器机器有：8CPU，32GB RAM

我计划运行在此集群上执行 15..20 个 Apache Flink 作业。由于隐私政策，我不能在这里写java代码，所以我会尽量用文字来表达。

1)我从 Apache Kafka broker №1 读取数据（它是 JSON 条消息）
2)反序列化POJO中的字节数组
3)使用 FilterFunction 检查 POJO 事件中的某些字段
4)通过 id-field 使用 KeyBy 运算符
5) 将 KeyedProcessFunction 与状态（valueState 或 mapState）和计时器（我正在使用 HDFS RocksDB 状态后端）
6)将POJO序列化为字节数组并发送到Apache Kafka 经纪人 №2

预计每天将有超过5000万个事件到来。所有职位都将有一个数据源。

我会考虑使用资源管理器来点赞YARN, Mesos, or Kubernetes in order to have high availability. In a nutshell, this is what they do for you:

When deploying a Flink application, Flink automatically identifies the required resources based on the application’s configured parallelism and requests them from the resource manager. In case of a failure, Flink replaces the failed container by requesting new resources. All communication to submit or control an application happens via REST calls. This eases the integration of Flink in many environments.

换句话说，他们可以将需要的集群资源提供给link引擎。并且您可以更轻松地配置您要查找的参数。

如何为实时数据流配置 Apache Flink Cluster (flink-conf.yml)

How to configure Apache Flink Cluster (flink-conf.yml) for real time data stream

java

apache-flink

flink-streaming