在 docker standalone flink 中启动任务时出现问题

Problem when starting tasks in docker standalone flink

我们在 Flink 中开发了一个系统,该系统从目录中读取文件,按客户端对它们进行分组,并根据信息类型将这些文件推送到接收器。 我们通过在我们的机器上本地安装 Flink 来做到这一点,并且它可以正常工作。但是,当我们 docker 化项目时,我们的作业已正确提交并且 UI 显示为 运行,但作业实际上从未启动过。在 UI 中,当我们进入细节时,它是这样显示的: Flink dashboard screenshot

作业中的日志显示如下:

2021-10-15 12:13:50,247 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running initialization on master for job Decibel Processor (d7ea5533f968c1d27e06f45c94e5823a).
2021-10-15 12:13:50,247 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Successfully ran initialization on master in 0 ms.
2021-10-15 12:13:50,269 INFO  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 1 ms
2021-10-15 12:13:50,279 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - No state backend has been configured, using default (HashMap) org.apache.flink.runtime.state.hashmap.HashMapStateBackend@3ae62e37
2021-10-15 12:13:50,280 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Checkpoint storage is set to 'jobmanager'
2021-10-15 12:13:50,296 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - No checkpoint found during restore.
2021-10-15 12:13:50,304 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using failover strategy org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy@3370d640 for Decibel Processor (d7ea5533f968c1d27e06f45c94e5823a).
2021-10-15 12:13:50,319 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting execution of job Decibel Processor (d7ea5533f968c1d27e06f45c94e5823a) under job master id 00000000000000000000000000000000.
2021-10-15 12:13:50,322 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Starting scheduling with scheduling strategy [org.apache.flink.runtime.scheduler.strategy.PipelinedRegionSchedulingStrategy]
2021-10-15 12:13:50,323 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Job Decibel Processor (d7ea5533f968c1d27e06f45c94e5823a) switched from state CREATED to RUNNING.
2021-10-15 12:13:50,333 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: Property Config Stream (1/1) (1577d0fa1357c72c49057d4cdd7e2ddd) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,333 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source: Custom File Source (1/1) (73a31f7f635a874d2f8df49e7184b83d) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,333 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Collection File Input -> Map -> Collection Messages -> Filter Collection Messages -> WaterMark Filter Message (1/1) (55cf6f5740276160f44cc651c0971af8) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,333 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Session Window -> Filter -> Map -> Sink: Unnamed (1/1) (0d5400e22e1a0b8491f0a98ffcf3ac12) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,333 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Keyed Reduce -> Stats output -> Sink: Unnamed (1/1) (f2fefa969c0f63f43c817e4c981b6edf) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,334 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: Storage side output (1/1) (b3204e02a57a8bc0b04a63f044050fd6) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,334 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: In progress sessions side output (1/1) (9def69110da478c06dfc1636c76349aa) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,334 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Sink: In progress page views side output (1/1) (cec4378c07fe32c56c9db45ae144079b) switched from CREATED to SCHEDULED.
2021-10-15 12:13:50,361 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Connecting to ResourceManager akka.tcp://flink@jobmanager:6123/user/rpc/resourcemanager_*(00000000000000000000000000000000)
2021-10-15 12:13:50,366 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Resolved ResourceManager address, beginning registration
2021-10-15 12:13:50,369 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registering job manager 00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_2 for job d7ea5533f968c1d27e06f45c94e5823a.
2021-10-15 12:13:50,375 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager [] - Registered job manager 00000000000000000000000000000000@akka.tcp://flink@jobmanager:6123/user/rpc/jobmanager_2 for job d7ea5533f968c1d27e06f45c94e5823a.
2021-10-15 12:13:50,378 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - JobManager successfully registered at ResourceManager, leader id: 00000000000000000000000000000000.
2021-10-15 12:13:50,381 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Received resource requirements from job d7ea5533f968c1d27e06f45c94e5823a: [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, numberOfRequiredSlots=1}]

但是如果我们将一个文件添加到应该处理的文件夹中,则不会发生任何事情。

我们为此使用 docker-compose,因为它必须与与 Flink 无关的其他代码部分一起部署(我省略了不相关的部分)

short-term-processor:
    # Define the name and version of the docker image
    image: short-term-processor
    container_name: short_term_processor
    command: standalone-job --job-classname com.decibel.flink.processor.Main
    restart: always
    ports:
     - "8081:8081"
    volumes:
      - /flink/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
      - /flink/processor.properties:/opt/flink/conf/processor.properties
      - /flink/data:/data1

配置 yaml 和属性与在我们的本地集群中正常工作的那些完全相同。 知道为什么会这样吗?

经过几次尝试,我们找到了问题和解决方案:独立 docker 作业只是提交作业但从未启动。

为了解决这个问题,我们需要创建 2 个额外的容器,一个用于作业管理器,一个用于任务管理器:

service:
    # Define the name and version of the docker image
    image: service
    container_name: service
    command: "flink run /opt/flink/usrlib/artifacts/service.jar --port 9000"
    # Logs are pushed to syslog
    logging:
      driver: syslog
    restart: always
    ports:
     - "9000:9000"
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /var/log/decibel/:/opt/flink/log/
      - /conf/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
      - /conf/processor.properties:/opt/flink/conf/processor.properties
      - /data1:/data1/decibelinsightdata
    depends_on:
      - jobmanager
      - taskmanager
  jobmanager:
    image: apache/flink:1.13.0-scala_2.11-java11
    command: "jobmanager.sh start-foreground"
    ports:
      - 8081:8081
    volumes:
      - /data1:/data1/decibelinsightdata
      - /conf/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager
  taskmanager:
    image: apache/flink:1.13.0-scala_2.11-java11
    command: "taskmanager.sh start-foreground"
    volumes:
      - /data1:/data1/decibelinsightdata
      - /conf/flink-conf.yaml:/opt/flink/conf/flink-conf.yaml
    environment:
      - JOB_MANAGER_RPC_ADDRESS=jobmanager

在此之后,作业开始没有任何问题