使用 ML Engine 进行超参数微调:运行 并行试验时出现 Nan 错误
Hyper parameter finetuning with ML Engine: Nan error when running with parallel trials
我在 Google ML Engine 的微调工作中,一些训练配置导致 NaN 损失,从而导致错误。我希望能够忽略这些试验,并继续使用不同的参数进行微调。
我正在使用带有 fail_on_nan_loss=False 的 NanTensorHook,它在没有执行并行试验时在 ML Engine 中成功运行(maxParallelTrials:1),但在多个并行试验(maxParallelTrials:3)中失败。
有没有人遇到过这个错误?关于如何解决它的任何想法?
这是我的配置文件:
trainingInput:
scaleTier: CUSTOM
masterType: standard
workerType: standard
parameterServerType: standard
workerCount: 4
parameterServerCount: 1
hyperparameters:
goal: MAXIMIZE
maxTrials: 5
maxParallelTrials: 3
enableTrialEarlyStopping: False
hyperparameterMetricTag: auc
params:
- parameterName: learning_rate
type: DOUBLE
minValue: 0.0001
maxValue: 0.01
scaleType: UNIT_LOG_SCALE
- parameterName: optimizer
type: CATEGORICAL
categoricalValues:
- Adam
- Adagrad
- Momentum
- SGD
- parameterName: batch_size
type: DISCRETE
discreteValues:
- 128
- 256
- 512
这就是我设置 NanTensorHook 的方式:
hook = tf.train.NanTensorHook(loss,fail_on_nan_loss=False)
train_op = tf.contrib.layers.optimize_loss(
loss=loss, global_step=tf.train.get_global_step(),
learning_rate=lr, optimizer=optimizer)
model_fn = tf.estimator.EstimatorSpec(mode=mode, loss=loss,
eval_metric_ops=eval_metric_ops, train_op=train_op,
training_hooks=[hook])
我收到的错误信息是:
Hyperparameter Tuning Trial #4 Failed before any other successful
trials were completed. The failed trial had parameters: optimizer=SGD,
batch_size=128, learning_rate=0.00075073617775056709, . The trial's ror
message was: The replica worker 1 exited with a non-zero status of 1.
Termination reason: Error. Traceback (most recent call last): [...]
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 421, in
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 532, in
run_worker return self._start_distributed_training() File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 715, in
_start_distributed_training saving_listeners=saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 891, in
_train_model _, loss = mon_sess.run([estimator_spec.train_op,
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 546, in
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1022,
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1113,
in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1098,
in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1178,
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/basic_session_run_hooks.py", line
617, in after_run raise NanLossDuringTrainingError
NanLossDuringTrainingError: NaN loss during training. The replica
worker 3 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last): [...] File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 421, in
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 532, in
run_worker return self._start_distributed_training() File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 715, in
_start_distributed_training saving_listeners=saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 891, in
_train_model _, loss = mon_sess.run([estimator_spec.train_op,
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 546, in
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1022,
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1113,
in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1098,
in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1178,
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/basic_session_run_hooks.py", line
617, in after_run raise NanLossDuringTrainingError
NanLossDuringTrainingError: NaN loss during training.
提前谢谢大家!
超参数调整作业中的不同试验在其 运行 时间内被隔离。所以一次试验添加的钩子不会受到其他试验中其他钩子的影响。
我怀疑这个问题是由试验的超参数的特定组合引起的。为了确认,我建议您 运行 使用失败试验的超参数值进行定期训练,看看错误是否会再次发生。
请将项目编号和工作 ID 发送至 cloudml-feedback@google.com
,我们可以进行更多调查。
我在 Google ML Engine 的微调工作中,一些训练配置导致 NaN 损失,从而导致错误。我希望能够忽略这些试验,并继续使用不同的参数进行微调。
我正在使用带有 fail_on_nan_loss=False 的 NanTensorHook,它在没有执行并行试验时在 ML Engine 中成功运行(maxParallelTrials:1),但在多个并行试验(maxParallelTrials:3)中失败。
有没有人遇到过这个错误?关于如何解决它的任何想法?
这是我的配置文件:
trainingInput:
scaleTier: CUSTOM
masterType: standard
workerType: standard
parameterServerType: standard
workerCount: 4
parameterServerCount: 1
hyperparameters:
goal: MAXIMIZE
maxTrials: 5
maxParallelTrials: 3
enableTrialEarlyStopping: False
hyperparameterMetricTag: auc
params:
- parameterName: learning_rate
type: DOUBLE
minValue: 0.0001
maxValue: 0.01
scaleType: UNIT_LOG_SCALE
- parameterName: optimizer
type: CATEGORICAL
categoricalValues:
- Adam
- Adagrad
- Momentum
- SGD
- parameterName: batch_size
type: DISCRETE
discreteValues:
- 128
- 256
- 512
这就是我设置 NanTensorHook 的方式:
hook = tf.train.NanTensorHook(loss,fail_on_nan_loss=False)
train_op = tf.contrib.layers.optimize_loss(
loss=loss, global_step=tf.train.get_global_step(),
learning_rate=lr, optimizer=optimizer)
model_fn = tf.estimator.EstimatorSpec(mode=mode, loss=loss,
eval_metric_ops=eval_metric_ops, train_op=train_op,
training_hooks=[hook])
我收到的错误信息是:
Hyperparameter Tuning Trial #4 Failed before any other successful
trials were completed. The failed trial had parameters: optimizer=SGD,
batch_size=128, learning_rate=0.00075073617775056709, . The trial's ror
message was: The replica worker 1 exited with a non-zero status of 1.
Termination reason: Error. Traceback (most recent call last): [...]
File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 421, in
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 532, in
run_worker return self._start_distributed_training() File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 715, in
_start_distributed_training saving_listeners=saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 891, in
_train_model _, loss = mon_sess.run([estimator_spec.train_op,
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 546, in
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1022,
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1113,
in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1098,
in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1178,
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/basic_session_run_hooks.py", line
617, in after_run raise NanLossDuringTrainingError
NanLossDuringTrainingError: NaN loss during training. The replica
worker 3 exited with a non-zero status of 1. Termination reason: Error.
Traceback (most recent call last): [...] File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 421, in
train_and_evaluate executor.run() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 522, in run
getattr(self, task_to_run)() File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 532, in
run_worker return self._start_distributed_training() File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/training.py", line 715, in
_start_distributed_training saving_listeners=saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 352, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/estimator/estimator.py", line 891, in
_train_model _, loss = mon_sess.run([estimator_spec.train_op,
estimator_spec.loss]) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 546, in
run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1022,
in run run_metadata=run_metadata) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1113,
in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1098,
in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/monitored_session.py", line 1178,
in run run_metadata=run_metadata)) File "/usr/local/lib/python2.7/dist-
packages/tensorflow/python/training/basic_session_run_hooks.py", line
617, in after_run raise NanLossDuringTrainingError
NanLossDuringTrainingError: NaN loss during training.
提前谢谢大家!
超参数调整作业中的不同试验在其 运行 时间内被隔离。所以一次试验添加的钩子不会受到其他试验中其他钩子的影响。
我怀疑这个问题是由试验的超参数的特定组合引起的。为了确认,我建议您 运行 使用失败试验的超参数值进行定期训练,看看错误是否会再次发生。
请将项目编号和工作 ID 发送至 cloudml-feedback@google.com
,我们可以进行更多调查。