Azure Data Factory - 同时限制 Databricks 管道的数量运行

Question

我正在使用 ADF 执行 Databricks notebook。此时我有6条流水线，依次执行。

具体来说，前者完成后，后者由循环框带多个参数执行，如此循环往复。例如，第一个流水线完成后，会触发3个不同参数的第二个流水线实例，而这些实例中的每一个都会触发第三个流水线的多个实例。结果，我越深入，我需要的管道就越多运行.

我的问题是：当每个管道执行时，它会要求Databricks分配一个集群给运行。但是，Databricks 限制了每个工作区要使用的核心数，这导致管道实例无法运行。

我的问题是：有没有办法控制同时运行ning管道实例的数量，或者有什么办法可以解决我的问题？

提前致谢:-)

Answer 1

Why this issue occurs?

注意：创建 Databricks 集群始终依赖于订阅中可用的核心数。

Before creating any databricks cluster, make sure number of cores are available in the region selected and the VM Family vCPUs.

您可以通过转至 Azure 门户 => 订阅 => [ 来检查订阅的核心限制=82=] 您的订阅 => 设置“使用+报价” => 查看每个地区可用的使用配额。

示例： 如果您的订阅有 > 72 个核心，这导致 ADF 成功运行其他结果失败。

Activity Validate failed: Databricks execution failed with error message: Unexpected failure while waiting for the cluster to be ready. Cause Unexpected state for cluster (job-200-run-1):  Could not launch cluster due to cloud provider failures. azure_error_code: OperationNotAllowed, azure_error_message: Operation results in exceeding quota limits of Core. Maximum allowed: 350, Current in use: 344

我正在尝试使用 Databricks 集群创建 6 个管道，每个管道有 2 个工作节点。这意味着它需要

（6 个管道）*（1 个驱动节点 + 2 个工作节点）*（4 个核心）= 72 个核心。

以上计算使用了 VM 大小 Standard_DS3_v2，它有 4 个核心 。

Note: To create a databricks spark cluster which requires more than 4 cores i.e. (Minimum 4 cores for Driver type and 4 cores for Worker type).

此问题的解决方案：

通过将票务和订阅团队的票证提高到更高的限制来增加核心限制。使用此选项后，您只需为使用的内核付费。
限制您的作业频率，以便限制集群数量/考虑使用单个作业复制多个文件，这样您就可以限制集群创建，这会耗尽您订阅的内核。

要请求增加一项或多项支持此类增加的资源，请提交 Azure support request（select "Quota" 问题类型）。

问题类型：服务和订阅限制（配额）

参考： Total regional vCPU limit increases

希望这对您有所帮助。如果您有任何疑问，请告诉我们。

请点击 "Mark as Answer" 并在对您有帮助的 post 上投票，这可能对其他社区成员有益。

Answer 2

您可以通过设置 - 批计数参数来限制在每个 foreach 级别并行运行的活动数。（在 foreach 循环的设置选项卡下找到）

batchCount- 用于控制并行执行数量的批计数（当 isSequential 设置为 false 时）。

https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

如果无法在整个流水线级别设置限制，请尝试在每个嵌套的 foreach 循环中达到批计数的最小值。

Azure Data Factory - 同时限制 Databricks 管道的数量运行

Azure Data Factory - Limit the number of Databricks pipeline running at the same time

azure