运行在 google 云 ML 中的作业后出错

Question

我在 Google Cloud ML 上尝试了运行来自 github 的词 RNN 模型。提交作业后，我在日志文件中收到错误。

这是我提交的培训内容

gcloud ml-engine jobs submit training word_pred_7 \
    --package-path trainer \
    --module-name trainer.train \
    --runtime-version 1.0 \
    --job-dir $JOB_DIR \
    --region $REGION \
    -- \
    --data_dir gs://model-development/arpit/word-rnn-tensorflow-master/data/tinyshakespeare/real1.txt \
    --save_dir gs://model-development/arpit/word-rnn-tensorflow-master/save

这是我在日志文件中得到的。

Answer 1

您需要修改 train.py 以接受“--job-dir”命令行参数。

当您在 gcloud 中指定 --job-dir 时，该服务会将其作为参数传递给您的程序，因此您的 argparser（或 tf.flags，具体取决于您使用的是哪个）将需要相应修改。

Answer 2

~~除了添加 --job-dir 作为可接受的参数外，我认为您还应该在 --.~~

之后移动标志
来自getting started：

Run the local train command using the --distribued option. Be sure to place the flag above the -- that separates the user arguments from the command-line arguments

~~在那种情况下，--distribued 是一个命令行参数~~

编辑：

--job-dir 不是用户参数，因此将它放在--

之前是正确的

Answer 3

最后，在向云 ML 提交了 77 个作业后，我能够运行作业并且问题不在于提交作业时的参数。这是关于文件 .npy 生成的 IO 错误，这些文件必须使用 file_io.FileIo 存储并读取为 StringIO。

这些 IO 错误在任何地方都没有被提及，如果他们发现任何错误说没有这样的文件或目录，就应该检查它们。

Answer 4

我遇到了同样的问题，似乎 google 云在加载您自己的脚本时以某种方式使用了 --job-dir（即使您将它放在 gcloud 命令的 -- 之前)

我在153行和183行像官方gcloud census example一样修复它的方式：

parser.add_argument(
  '--job-dir',
  help='GCS location to write checkpoints and export models',
  required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')

train_model(**arguments)

基本上就是让你的python主程序接受这个--job-dir参数，即使你没有使用它。

运行在 google 云 ML 中的作业后出错

error after running a job in google cloud ML

google-cloud-ml

运行 在 google 云 ML 中的作业后出错

error after running a job in google cloud ML

google-cloud-ml

运行在 google 云 ML 中的作业后出错