为什么我在 Azure 容器实例中的 ML 模型部署仍然失败并显示 "current service state: Transitioning"?

Why does my ML model deployment in Azure Container Instance still fail with "current service state: Transitioning"?

我正在使用 Azure 机器学习服务将 ML 模型部署为 Web 服务。

and now would like to deploy it as an ACI web service as in the guide.

为此我定义

from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage

aciconfig = AciWebservice.deploy_configuration(cpu_cores=4, 
                      memory_gb=32, 
                      tags={"data": "text",  "method" : "NB"}, 
                      description='Predict something')

image_config = ContainerImage.image_configuration(execution_script="score.py", 
                      docker_file="Dockerfile",
                      runtime="python", 
                      conda_file="myenv.yml")

并使用

创建图像
image = ContainerImage.create(name = "scorer-image",
                      models = [model],
                      image_config = image_config,
                      workspace = ws
                      )

图像创建成功

Creating image Image creation operation finished for image scorer-image:5, operation "Succeeded"

此外,通过 运行 在 Azure VM 上使用

在本地对图像进行故障排除
sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5

允许我 运行(本地)成功查询 http://localhost:8002/score

但是,部署

service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
                                        image = image,
                                        name = service_name,
                                        workspace = ws)

失败

Creating service
Running.
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning

我尝试在 aciconfig 中设置更慷慨的 memory_gb,但无济于事:部署停留在 transitioning 状态(如图所示如果在 Azure 门户上进行监控,则在下方):

此外,运行ning service.get_logs() 给了我

WebserviceException: Received bad response from Model Management Service: Response Code: 404

罪魁祸首是什么?

如果 ACI 部署失败,一种解决方案是尝试分配较少 资源,例如

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                  memory_gb=8, 
                  tags={"data": "text",  "method" : "NB"}, 
                  description='Predict something')

虽然抛出的错误消息不是特别有用,但实际上在 documentation:

中明确说明了这一点

When a region is under heavy load, you may experience a failure when deploying instances. To mitigate such a deployment failure, try deploying instances with lower resource settings [...]

文档还说明了不同区域可用的 CPU/RAM 资源的最大值(在撰写本文时,要求使用 memory_gb=32 进行部署可能会在所有区域失败,因为资源不足)。

需要较少的资源后,部署应该会成功

Creating service
Running......................................................
SucceededACI service creation operation finished, operation
"Succeeded" Healthy