为什么在 kubernetes cron 作业中可能会创建两个作业，或者可能不会创建任何作业？

Question

在k8s中Cron Job Limitations提到不能保证一个作业会恰好执行一次：

A cron job creates a job object about once per execution time of its schedule. We say “about” because there are certain circumstances where two jobs might be created, or no job might be created. We attempt to make these rare, but do not completely prevent them. Therefore, jobs should be idempotent

谁能解释一下：

为什么会这样？
这可能会发生什么probabilities/statistic？
它会在合理的将来在 k8s 中修复吗？
是否有任何解决方法来防止这种行为（如果运行作业不能作为幂等实现）？
其他 cron 相关 服务是否遇到同样的问题？也许这是核心 cron 问题？

Answer 1

控制器：

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go

从为解释奠定基础的评论开始：

I did not use watch or expectations. Those add a lot of corner cases, and we aren't expecting a large volume of jobs or scheduledJobs. (We are favoring correctness over scalability.)  

If we find a single controller thread is too slow because there are a lot of Jobs or CronJobs, we we can parallelize by Namespace. If we find the load on the API server is too high, we can use a watch and UndeltaStore.) 

Just periodically list jobs and SJs, and then reconcile them.

定期表示每 10 秒：

https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cronjob/cronjob_controller.go#L105

引用的限制后的文档也对某些情况下的某些情况有一些有用的颜色，在这些情况下，可能会在特定的时间表上启动 2 个作业或没有作业：

If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrentPolicy is set to AllowConcurrent, the jobs will always run at least once.

Jobs may fail to run if the CronJob controller is not running or broken for a span of time from before the start time of the CronJob to start time plus startingDeadlineSeconds, or if the span covers multiple start times and concurrencyPolicy does not allow concurrency. For example, suppose a cron job is set to start at exactly 08:30:00 and its startingDeadlineSeconds is set to 10, if the CronJob controller happens to be down from 08:29:00 to 08:42:00, the job will not start. Set a longer startingDeadlineSeconds if starting later is better than not starting at all.

更高级别，在分布式系统中解决only-once很难：

https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/

分布式系统中的时钟和时间同步也很难：

https://8thlight.com/blog/rylan-dirksen/2013/10/04/synchronization-in-a-distributed-system.html

对问题：

为什么会这样？

例如，托管 CronJobController 的节点在作业应该运行时失败。
这可能会发生什么probabilities/statistic？

任何给定的运行都不太可能。对于足够多的运行s，很难避免不得不面对这个问题。
它会在合理的将来在 k8s 中修复吗？

k8s 仓库中 area/batch 标签下没有 idemopotency-related issues，所以可以猜到没有。

https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Aarea%2Fbatch
是否有任何解决方法来防止这种行为（如果运行ning 作业不能作为幂等实现）？

多想想幂等的具体定义，以及作业中有提交的特定点。例如，如果作业将状态保存到临时区域，则可以使作业支持 more-than-once 执行，然后有一个选举过程来确定谁的工作获胜。
其他与 cron 相关的服务是否也遇到同样的问题？也许这是核心 cron 问题？

是的，这是一个核心分布式系统问题。

对于大多数用户来说，k8s 文档可能给出了比必要的更精确和细致的答案。如果您安排的工作是控制一些关键的医疗程序，那么为失败案例做好计划就非常重要。如果它只是进行一些系统清理，那么错过预定的运行并不重要。根据定义，几乎所有 k8s CronJobs 的用户都属于后一类。

为什么在 kubernetes cron 作业中可能会创建两个作业，或者可能不会创建任何作业？

Why in kubernetes cron job two jobs might be created, or no job might be created?

cron

jobs

distributed-computing

kubernetes