Google ML 引擎：无法保存模型

Question

我正在通过 REST API 提交训练作业。该过程能够训练，但是当它到达保存部分时，它会出错并抛出 The replica master 0 exited with a non-zero status of 1. 错误。我检查了服务帐户的 IAM 权限，它具有以下权限：

日志编写器
ML 引擎管理员
存储管理员
存储对象管理员

这是对实际错误的更深入的追溯。

Traceback (most recent call last): 
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) 
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 223, in <module> dispatch(**parse_args.__dict__) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 133, in dispatch callbacks=callbacks) 
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) 
File "/root/.local/lib/python3.5/site-packages/keras/models.py", line 1110, in fit_generator initial_epoch=initial_epoch) 
File "/root/.local/lib/python3.5/site-packages/keras/legacy/interfaces.py", line 88, in wrapper return func(*args, **kwargs) 
File "/root/.local/lib/python3.5/site-packages/keras/engine/training.py", line 1849, in fit_generator callbacks.on_epoch_begin(epoch) 
File "/root/.local/lib/python3.5/site-packages/keras/callbacks.py", line 63, in on_epoch_begin callback.on_epoch_begin(epoch, logs) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 74, in on_epoch_begin copy_file_to_gcs(self.job_dir, checkpoints[-1]) 
File "/root/.local/lib/python3.5/site-packages/trainer/task.py", line 150, in copy_file_to_gcs output_f.write(input_f.read()) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 126, in read pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/file_io.py", line 94, in _prepare_value return compat.as_str_any(val) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 106, in as_str_any return as_str(value) 
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/compat.py", line 84, in as_text return bytes_or_text.decode(encoding) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

我不完全确定为什么会这样。代码取自 Google git 页面上的示例项目。什么都没有改变。这是我的 REST 调用：

{
    "jobId": "training_20",
    "trainingInput": {
        "scaleTier": "BASIC",
        "packageUris": ["gs://MY_BUCKET/census.tar.gz"],
        "pythonModule": "trainer.task",
        "args": [
          "--train-files", 
          "gs://MY_BUCKET/adult.data.csv", 
          "--eval-files", 
          "gs://MY_BUCKET/adult.test.csv", 
          "--job-dir", 
          "gs://MY_BUCKET/models", 
          "--train-steps",
          "100",
          "--eval-steps",
          "10"],
        "region": "europe-west1",
        "jobDir": "gs://MY_BUCKET/models",
        "runtimeVersion": "1.4",
        "pythonVersion": "3.5"
    }
}

这是保存代码部分：

# Unhappy hack to work around h5py not being able to write to GCS.
# Force snapshots and saves to local filesystem, then copy them over to GCS.
  if job_dir.startswith("gs://"):
    census_model.save(CENSUS_MODEL)
    copy_file_to_gcs(job_dir, CENSUS_MODEL)
  else:
    census_model.save(os.path.join(job_dir, CENSUS_MODEL))

  # Convert the Keras model to TensorFlow SavedModel
  model.to_savedmodel(census_model, os.path.join(job_dir, 'export'))

# h5py workaround: copy local models over to GCS if the job_dir is GCS.
def copy_file_to_gcs(job_dir, file_path):
  with file_io.FileIO(file_path, mode='r') as input_f:
    with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
        output_f.write(input_f.read())

Answer 1

经过进一步研究，Google 决定保存文件的方式似乎有问题。最初，它声明类型为 r: 如此处所示... with file_io.FileIO(file_path, mode='r') as input_f: 。通过将模式更改为 rb（二进制），这解决了问题。

由于模式设置为 r python 尝试将此字节数组（假设它是 utf-8）转换为 unicode 字符串。虽然，当它遇到字节序列 0x89 in position 0: invalid start byte 时，它不遵循 utf8 约定，因此会崩溃。 Alfe 在这里发布了更深入的回复：

Google ML 引擎：无法保存模型

Google ML Engine: Unable to save model

google-cloud-ml