为什么我在 Azure 容器实例中的 ML 模型部署仍然失败并显示 "current service state: Transitioning"？

Question

我正在使用 Azure 机器学习服务将 ML 模型部署为 Web 服务。

我 and now would like to deploy it as an ACI web service as in the guide.

为此我定义

from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage

aciconfig = AciWebservice.deploy_configuration(cpu_cores=4, 
                      memory_gb=32, 
                      tags={"data": "text",  "method" : "NB"}, 
                      description='Predict something')

和

image_config = ContainerImage.image_configuration(execution_script="score.py", 
                      docker_file="Dockerfile",
                      runtime="python", 
                      conda_file="myenv.yml")

并使用

创建图像

image = ContainerImage.create(name = "scorer-image",
                      models = [model],
                      image_config = image_config,
                      workspace = ws
                      )

图像创建成功

Creating image Image creation operation finished for image scorer-image:5, operation "Succeeded"

此外，通过运行在 Azure VM 上使用

在本地对图像进行故障排除

sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5

允许我运行（本地）成功查询 http://localhost:8002/score。

但是，部署

service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
                                        image = image,
                                        name = service_name,
                                        workspace = ws)

失败

Creating service
Running.
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning

我尝试在 aciconfig 中设置更慷慨的 memory_gb，但无济于事：部署停留在 transitioning 状态（如图所示如果在 Azure 门户上进行监控，则在下方）：

此外，运行ning service.get_logs() 给了我

WebserviceException: Received bad response from Model Management Service: Response Code: 404

罪魁祸首是什么？

Answer 1

如果 ACI 部署失败，一种解决方案是尝试分配较少资源，例如

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                  memory_gb=8, 
                  tags={"data": "text",  "method" : "NB"}, 
                  description='Predict something')

虽然抛出的错误消息不是特别有用，但实际上在 documentation:

中明确说明了这一点

When a region is under heavy load, you may experience a failure when deploying instances. To mitigate such a deployment failure, try deploying instances with lower resource settings [...]

文档还说明了不同区域可用的 CPU/RAM 资源的最大值（在撰写本文时，要求使用 memory_gb=32 进行部署可能会在所有区域失败，因为资源不足）。

需要较少的资源后，部署应该会成功

Creating service
Running......................................................
SucceededACI service creation operation finished, operation
"Succeeded" Healthy

为什么我在 Azure 容器实例中的 ML 模型部署仍然失败并显示 "current service state: Transitioning"？

Why does my ML model deployment in Azure Container Instance still fail with "current service state: Transitioning"?

python

deployment

docker

azure-container-instances

azure-machine-learning-service