Faster RCNN 模型训练在 GCP 上停止 运行,在本地运行没有问题

Faster RCNN Model training stops running on GCP, runs locally without issue

正在尝试 运行 基于 Tensorflow 对象检测 API 的程序。 Faster RCNN 模型停止在 GCP 上进行训练,但 运行 在本地没有问题。对于任何反馈,我们都表示感谢。已按照不同帖子中的建议尝试了服务代理的日志编写器角色权限。一直没能找到更多的反馈。

完整的错误信息:

The replica master 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main worker_job_name, is_chief, FLAGS.train_dir) File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 296, in train saver=saver) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 763, in train sess, train_op, global_step, train_step_kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 487, in train_step run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) UnavailableError: Endpoint read failed To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=1086278442266&resource=ml_job%2Fjob_id%2Fuav_object_detection_1543356760&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22uav_object_detection_1543356760%22

这就是我在终端中运行开始训练的原因:

gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
   --job-dir=gs://my_gcs_bucket/train \
   --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
   --module-name object_detection.train \
   --region us-central1 \
   --config object_detection/samples/cloud/cloud.yml \
   --runtime-version=1.4 \
   -- \
   --train_dir=gs://my_gcs_bucket/train \
   --pipeline_config_path=gs://my_gcs_bucket/data/faster_rcnn_resnet101.config

这是我在 GCP Bucket 中的文件结构

+ data/
  - faster_rcnn_resnet101.config
  - model.ckpt.index
  - model.ckpt.meta
  - model.ckpt.data-00000-of-00001
  - pet_label_map.pbtxt
  - train.record
  - val.record
+ train/

这是我运行来自

的文件夹中的文件结构
+dist/
  -object_detection-0.1.tar.gz
+object_detection/
+object_detection.egg-info/
+slim/
setup.py

配置文件:

# Faster R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  faster_rcnn {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 600
        max_dimension: 1024
      }
    }
    feature_extractor {
      type: 'faster_rcnn_resnet101'
      first_stage_features_stride: 16
    }
    first_stage_anchor_generator {
      grid_anchor_generator {
        scales: [0.25, 0.5, 1.0, 2.0]
        aspect_ratios: [0.5, 1.0, 2.0]
        height_stride: 16
        width_stride: 16
      }
    }
    first_stage_box_predictor_conv_hyperparams {
      op: CONV
      regularizer {
        l2_regularizer {
          weight: 0.0
        }
      }
      initializer {
        truncated_normal_initializer {
          stddev: 0.01
        }
      }
    }
    first_stage_nms_score_threshold: 0.0
    first_stage_nms_iou_threshold: 0.7
    first_stage_max_proposals: 300
    first_stage_localization_loss_weight: 2.0
    first_stage_objectness_loss_weight: 1.0
    initial_crop_size: 14
    maxpool_kernel_size: 2
    maxpool_stride: 2
    second_stage_box_predictor {
      mask_rcnn_box_predictor {
        use_dropout: false
        dropout_keep_probability: 1.0
        fc_hyperparams {
          op: FC
          regularizer {
            l2_regularizer {
              weight: 0.0
            }
          }
          initializer {
            variance_scaling_initializer {
              factor: 1.0
              uniform: true
              mode: FAN_AVG
            }
          }
        }
      }
    }
    second_stage_post_processing {
      batch_non_max_suppression {
        score_threshold: 0.0
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 300
      }
      score_converter: SOFTMAX
    }
    second_stage_localization_loss_weight: 2.0
    second_stage_classification_loss_weight: 1.0
  }
}

train_config: {
  batch_size: 1
  batch_queue_capacity: 1
  num_batch_queue_threads: 1
  prefetch_queue_capacity: 1
  optimizer {
    momentum_optimizer: {
      learning_rate: {
        manual_step_learning_rate {
          initial_learning_rate: 0.0003
          schedule {
            step: 0
            learning_rate: .0003
          }
          schedule {
            step: 900000
            learning_rate: .00003
          }
          schedule {
            step: 1200000
            learning_rate: .000003
          }
        }
      }
      momentum_optimizer_value: 0.9
    }
    use_moving_average: false
  }
  gradient_clipping_by_norm: 10.0
  fine_tune_checkpoint: "gs://my_gcs_bucket/data/model.ckpt"
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps:2000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://my_gcs_bucket/data/data/train.record"
  }
  label_map_path: "gs://my_gcs_bucket/data/data/label_map.pbtxt"
  queue_capacity: 10
  min_after_dequeue: 5
}

eval_config: {
  num_examples: 4
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://my_gcs_bucket/data/data/val.record"
  }
  label_map_path: "gs://my_gcs_bucket/data/data/label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

在 cloud.yml 和初始请求中将运行时版本更改为 1.2。