在云机器学习引擎上 运行 时,Tensorflow 对象检测 train.py 失败
Tensorflow object detection train.py fails when running on cloud machine learning engine
我有一个在本地工作的 tensorflow 对象检测的小示例 api。一切看起来都很棒。我的目标是将他们的脚本用于 Google 机器学习引擎中的 运行,我过去曾广泛使用过它。我正在关注这些 docs.
声明一些相关变量
declare PROJECT=$(gcloud config list project --format "value(core.project)")
declare BUCKET="gs://${PROJECT}-ml"
declare MODEL_NAME="DeepMeerkatDetection"
declare FOLDER="${BUCKET}/${MODEL_NAME}"
declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"
declare TRAIN_DIR="${FOLDER}/${JOB_ID}"
declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"
declare PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"
declare PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"
我的 yaml 看起来像
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
配置中设置了相关路径,例如
fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
我使用 setup.py
打包了对象检测和瘦身
运行
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
产生张量流(导入?)错误。有点神秘
insertId: "1inuq6gg27fxnkc"
logName: "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train"
receiveTimestamp: "2017-10-17T21:38:34.435293164Z"
resource: {…}
severity: "ERROR"
textPayload: "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main
model_config, train_config, input_config = get_configs_from_multiple_files()
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files
text_format.Merge(f.read(), train_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
FailedPreconditionError: .
我在机器学习引擎预测相关的其他中看到过这个错误,提示这个错误可能(?)与对象检测代码没有直接关系,但感觉不是正确打包,缺少依赖项?我已将我的 gcloud 更新到最新版本。
Bens-MacBook-Pro:research ben$ gcloud --version
Google Cloud SDK 175.0.0
bq 2.0.27
core 2017.10.09
gcloud
gsutil 4.27
很难看出它与这里的问题有什么关系
为什么代码需要在云中进行不同的初始化?
更新 #1。
奇怪的是 eval.py 工作正常,所以它不能是配置文件的路径,或者任何 train.py 和 eval.py 共享的路径。 Eval.py 耐心地坐着等待创建模型检查点。
另一个想法可能是检查点在上传过程中以某种方式被破坏。我们可以从头开始测试这种绕过和训练。
在 .config 中
from_detection_checkpoint: false
产生相同的前提条件错误,因此它不可能是模型。
根本原因是一个轻微的错字:
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
有一个额外的space。试试这个:
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path=${PIPELINE_CONFIG_PATH}
我有一个在本地工作的 tensorflow 对象检测的小示例 api。一切看起来都很棒。我的目标是将他们的脚本用于 Google 机器学习引擎中的 运行,我过去曾广泛使用过它。我正在关注这些 docs.
声明一些相关变量
declare PROJECT=$(gcloud config list project --format "value(core.project)")
declare BUCKET="gs://${PROJECT}-ml"
declare MODEL_NAME="DeepMeerkatDetection"
declare FOLDER="${BUCKET}/${MODEL_NAME}"
declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"
declare TRAIN_DIR="${FOLDER}/${JOB_ID}"
declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"
declare PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"
declare PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"
我的 yaml 看起来像
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
配置中设置了相关路径,例如
fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
我使用 setup.py
打包了对象检测和瘦身运行
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
产生张量流(导入?)错误。有点神秘
insertId: "1inuq6gg27fxnkc"
logName: "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train"
receiveTimestamp: "2017-10-17T21:38:34.435293164Z"
resource: {…}
severity: "ERROR"
textPayload: "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main
model_config, train_config, input_config = get_configs_from_multiple_files()
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files
text_format.Merge(f.read(), train_config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
FailedPreconditionError: .
我在机器学习引擎预测相关的其他
Bens-MacBook-Pro:research ben$ gcloud --version
Google Cloud SDK 175.0.0
bq 2.0.27
core 2017.10.09
gcloud
gsutil 4.27
很难看出它与这里的问题有什么关系
为什么代码需要在云中进行不同的初始化?
更新 #1。
奇怪的是 eval.py 工作正常,所以它不能是配置文件的路径,或者任何 train.py 和 eval.py 共享的路径。 Eval.py 耐心地坐着等待创建模型检查点。
另一个想法可能是检查点在上传过程中以某种方式被破坏。我们可以从头开始测试这种绕过和训练。
在 .config 中
from_detection_checkpoint: false
产生相同的前提条件错误,因此它不可能是模型。
根本原因是一个轻微的错字:
--pipeline_config_path= ${PIPELINE_CONFIG_PATH}
有一个额外的space。试试这个:
gcloud ml-engine jobs submit training "${JOB_ID}_train" \
--job-dir=${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config ${PIPELINE_YAML} \
-- \
--train_dir=${TRAIN_DIR} \
--pipeline_config_path=${PIPELINE_CONFIG_PATH}