与本地培训相比,在云上获得更差的结果
Getting worse results on gcloud vs local training
我想了解为什么我的本地结果比 gcloud 结果更好。
在本地,我运行这样的工作:
gcloud ml-engine local train --module-name trainer.task --package-path trainer -- --vocabulary-file trainer/data/vocab.txt --class-files $CLASS_FILES --job-dir trainer/lr0001 --num-epochs 5000 --learning-rate 0.0001 --train-batch-size 4 --eval-batch-size 64 --export-format CSV
对于 gcloud 我 运行
gcloud ml-engine jobs submit training $JOBNAME --job-dir gs://.../lr0001 --module-name trainer.task --package-path trainer --region us-west1 --runtime-version 1.10 -- --vocabulary-file gs://.../vocab.txt --class-files $GS_CLASS_FILES --num-epochs 5000 --learning-rate 0.0001 --train-batch-size 4 --eval-batch-size 64 --export-format CSV
我已经修复了种子,运行它多次,检查了python 2 vs python 3,但是gcloud结果仍然比我本地的差运行.
我发现的最后一点线索是本地日志如下所示:
INFO:tensorflow:loss = 0.63639945, step = 401 (0.170 sec)
INFO:tensorflow:global_step/sec: 485.821
INFO:tensorflow:loss = 0.61793035, step = 501 (0.206 sec)
INFO:tensorflow:global_step/sec: 490.795
INFO:tensorflow:loss = 0.5869169, step = 601 (0.204 sec)
INFO:tensorflow:global_step/sec: 619.825
INFO:tensorflow:loss = 0.5738391, step = 701 (0.161 sec)
INFO:tensorflow:global_step/sec: 605.698
INFO:tensorflow:loss = 0.51589084, step = 801 (0.165 sec)
而 gcloud 日志看起来像是加倍了或什么的
I master-replica-0 loss = 0.40115586, step = 2202 (0.367 sec) master-replica-0
I master-replica-0 global_step/sec: 555.434 master-replica-0
I master-replica-0 global_step/sec: 498.601 master-replica-0
I master-replica-0 loss = 0.4367655, step = 2402 (0.470 sec) master-replica-0
I master-replica-0 global_step/sec: 366.906 master-replica-0
I master-replica-0 global_step/sec: 408.556 master-replica-0
I master-replica-0 loss = 0.41198668, step = 2602 (0.492 sec) master-replica-0
I master-replica-0 global_step/sec: 388.73 master-replica-0
I master-replica-0 global_step/sec: 380.982 master-replica-0
I master-replica-0 loss = 0.35386887, step = 2802 (0.522 sec) master-replica-0
I master-replica-0 global_step/sec: 401.002 master-replica-0
I master-replica-0 global_step/sec: 465.647 master-replica-0
I master-replica-0 loss = 0.4420835, step = 3002 (0.417 sec) master-replica-0
如有指点,将不胜感激!
到目前为止,我在互联网上唯一找到的就是这些没有答案的 SO 问题:
Results of training a Keras model different on Google Cloud
Differents outputs from predictions using Tensorflow from same data?
根据日志,我猜测正在发生的事情是云上的 gcloud 正在拾取检查点并恢复训练(参见步骤数)。你能做以下测试吗:运行 从头开始本地并确保不存在输出模型文件夹,然后在云上重复相同的设置并进行比较。
经过一些挖掘,我认为这是由于 Keras 和 Estimator API 在 tensorflow 1.10(当前 gcloud 版本)中的一些错误交互,而不是在 >=1.11(我在本地使用的)中。
我在这里提交了错误报告:https://github.com/tensorflow/tensorflow/issues/24299
我想了解为什么我的本地结果比 gcloud 结果更好。
在本地,我运行这样的工作:
gcloud ml-engine local train --module-name trainer.task --package-path trainer -- --vocabulary-file trainer/data/vocab.txt --class-files $CLASS_FILES --job-dir trainer/lr0001 --num-epochs 5000 --learning-rate 0.0001 --train-batch-size 4 --eval-batch-size 64 --export-format CSV
对于 gcloud 我 运行
gcloud ml-engine jobs submit training $JOBNAME --job-dir gs://.../lr0001 --module-name trainer.task --package-path trainer --region us-west1 --runtime-version 1.10 -- --vocabulary-file gs://.../vocab.txt --class-files $GS_CLASS_FILES --num-epochs 5000 --learning-rate 0.0001 --train-batch-size 4 --eval-batch-size 64 --export-format CSV
我已经修复了种子,运行它多次,检查了python 2 vs python 3,但是gcloud结果仍然比我本地的差运行.
我发现的最后一点线索是本地日志如下所示:
INFO:tensorflow:loss = 0.63639945, step = 401 (0.170 sec)
INFO:tensorflow:global_step/sec: 485.821
INFO:tensorflow:loss = 0.61793035, step = 501 (0.206 sec)
INFO:tensorflow:global_step/sec: 490.795
INFO:tensorflow:loss = 0.5869169, step = 601 (0.204 sec)
INFO:tensorflow:global_step/sec: 619.825
INFO:tensorflow:loss = 0.5738391, step = 701 (0.161 sec)
INFO:tensorflow:global_step/sec: 605.698
INFO:tensorflow:loss = 0.51589084, step = 801 (0.165 sec)
而 gcloud 日志看起来像是加倍了或什么的
I master-replica-0 loss = 0.40115586, step = 2202 (0.367 sec) master-replica-0
I master-replica-0 global_step/sec: 555.434 master-replica-0
I master-replica-0 global_step/sec: 498.601 master-replica-0
I master-replica-0 loss = 0.4367655, step = 2402 (0.470 sec) master-replica-0
I master-replica-0 global_step/sec: 366.906 master-replica-0
I master-replica-0 global_step/sec: 408.556 master-replica-0
I master-replica-0 loss = 0.41198668, step = 2602 (0.492 sec) master-replica-0
I master-replica-0 global_step/sec: 388.73 master-replica-0
I master-replica-0 global_step/sec: 380.982 master-replica-0
I master-replica-0 loss = 0.35386887, step = 2802 (0.522 sec) master-replica-0
I master-replica-0 global_step/sec: 401.002 master-replica-0
I master-replica-0 global_step/sec: 465.647 master-replica-0
I master-replica-0 loss = 0.4420835, step = 3002 (0.417 sec) master-replica-0
如有指点,将不胜感激!
到目前为止,我在互联网上唯一找到的就是这些没有答案的 SO 问题:
Results of training a Keras model different on Google Cloud
Differents outputs from predictions using Tensorflow from same data?
根据日志,我猜测正在发生的事情是云上的 gcloud 正在拾取检查点并恢复训练(参见步骤数)。你能做以下测试吗:运行 从头开始本地并确保不存在输出模型文件夹,然后在云上重复相同的设置并进行比较。
经过一些挖掘,我认为这是由于 Keras 和 Estimator API 在 tensorflow 1.10(当前 gcloud 版本)中的一些错误交互,而不是在 >=1.11(我在本地使用的)中。
我在这里提交了错误报告:https://github.com/tensorflow/tensorflow/issues/24299