为什么我在 Azure 容器实例中的 ML 模型部署仍然失败并显示 "current service state: Transitioning"?
Why does my ML model deployment in Azure Container Instance still fail with "current service state: Transitioning"?
我正在使用 Azure 机器学习服务将 ML 模型部署为 Web 服务。
我 and now would like to deploy it as an ACI web service as in the guide.
为此我定义
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage
aciconfig = AciWebservice.deploy_configuration(cpu_cores=4,
memory_gb=32,
tags={"data": "text", "method" : "NB"},
description='Predict something')
和
image_config = ContainerImage.image_configuration(execution_script="score.py",
docker_file="Dockerfile",
runtime="python",
conda_file="myenv.yml")
并使用
创建图像
image = ContainerImage.create(name = "scorer-image",
models = [model],
image_config = image_config,
workspace = ws
)
图像创建成功
Creating image Image creation operation finished for image
scorer-image:5, operation "Succeeded"
此外,通过 运行 在 Azure VM 上使用
在本地对图像进行故障排除
sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5
允许我 运行(本地)成功查询 http://localhost:8002/score
。
但是,部署
service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
image = image,
name = service_name,
workspace = ws)
失败
Creating service
Running.
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning
我尝试在 aciconfig
中设置更慷慨的 memory_gb
,但无济于事:部署停留在 transitioning 状态(如图所示如果在 Azure 门户上进行监控,则在下方):
此外,运行ning service.get_logs()
给了我
WebserviceException: Received bad response from Model Management
Service: Response Code: 404
罪魁祸首是什么?
如果 ACI 部署失败,一种解决方案是尝试分配较少 资源,例如
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
memory_gb=8,
tags={"data": "text", "method" : "NB"},
description='Predict something')
虽然抛出的错误消息不是特别有用,但实际上在 documentation:
中明确说明了这一点
When a region is under heavy load, you may experience a failure when
deploying instances. To mitigate such a deployment failure, try
deploying instances with lower resource settings [...]
文档还说明了不同区域可用的 CPU/RAM 资源的最大值(在撰写本文时,要求使用 memory_gb=32
进行部署可能会在所有区域失败,因为资源不足)。
需要较少的资源后,部署应该会成功
Creating service
Running......................................................
SucceededACI service creation operation finished, operation
"Succeeded" Healthy
我正在使用 Azure 机器学习服务将 ML 模型部署为 Web 服务。
我
为此我定义
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core.image import ContainerImage
aciconfig = AciWebservice.deploy_configuration(cpu_cores=4,
memory_gb=32,
tags={"data": "text", "method" : "NB"},
description='Predict something')
和
image_config = ContainerImage.image_configuration(execution_script="score.py",
docker_file="Dockerfile",
runtime="python",
conda_file="myenv.yml")
并使用
创建图像image = ContainerImage.create(name = "scorer-image",
models = [model],
image_config = image_config,
workspace = ws
)
图像创建成功
Creating image Image creation operation finished for image scorer-image:5, operation "Succeeded"
此外,通过 运行 在 Azure VM 上使用
在本地对图像进行故障排除sudo docker run -p 8002:5001 myscorer0588419434.azurecr.io/scorer-image:5
允许我 运行(本地)成功查询 http://localhost:8002/score
。
但是,部署
service_name = 'scorer-svc'
service = Webservice.deploy_from_image(deployment_config = aciconfig,
image = image,
name = service_name,
workspace = ws)
失败
Creating service
Running.
FailedACI service creation operation finished, operation "Failed"
Service creation polling reached terminal state, current service state: Transitioning
Service creation polling reached terminal state, unexpected response received. Transitioning
我尝试在 aciconfig
中设置更慷慨的 memory_gb
,但无济于事:部署停留在 transitioning 状态(如图所示如果在 Azure 门户上进行监控,则在下方):
此外,运行ning service.get_logs()
给了我
WebserviceException: Received bad response from Model Management Service: Response Code: 404
罪魁祸首是什么?
如果 ACI 部署失败,一种解决方案是尝试分配较少 资源,例如
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1,
memory_gb=8,
tags={"data": "text", "method" : "NB"},
description='Predict something')
虽然抛出的错误消息不是特别有用,但实际上在 documentation:
中明确说明了这一点When a region is under heavy load, you may experience a failure when deploying instances. To mitigate such a deployment failure, try deploying instances with lower resource settings [...]
文档还说明了不同区域可用的 CPU/RAM 资源的最大值(在撰写本文时,要求使用 memory_gb=32
进行部署可能会在所有区域失败,因为资源不足)。
需要较少的资源后,部署应该会成功
Creating service
Running......................................................
SucceededACI service creation operation finished, operation
"Succeeded" Healthy