Google 云机器学习内存不足
Google Cloud machine learning out of memory
我在选择以下配置 (config.yaml) 时遇到内存不足的问题:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: complex_model_m
parameterServerType: large_model
workerCount: 10
parameterServerCount: 10
我正在关注 Google 关于 "criteo_tft" 的教程:https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/criteo_tft/config-large.yaml
link 说他们能够训练 1TB 的数据!我很想试一试!!!
我的数据集是分类数据集,因此它在一次热编码后创建了一个相当大的矩阵(一个大小为 520000 x 4000 的二维 numpy 数组)。我可以在具有 32GB 内存的本地机器上训练我的数据集,但我无法在云中做同样的事情!!!
这是我的错误:
ERROR 2017-12-18 12:57:37 +1100 worker-replica-1 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:37 +1100 worker-replica-4 Using TensorFlow
backend.
INFO 2017-12-18 12:57:37 +1100 worker-replica-0 Running command:
python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --
job-dir gs://my_bucket/my_bucket_20171218_125645
ERROR 2017-12-18 12:57:38 +1100 worker-replica-2 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:40 +1100 worker-replica-0 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:53 +1100 worker-replica-3 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Module
completed; cleaning up.
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Clean up
finished.
ERROR 2017-12-18 12:57:56 +1100 worker-replica-4 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Module
completed; cleaning up.
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Clean up
finished.
ERROR 2017-12-18 12:57:58 +1100 worker-replica-2 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Module
completed; cleaning up.
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Clean up
finished.
ERROR 2017-12-18 12:57:59 +1100 worker-replica-1 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Module
completed; cleaning up.
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Clean up finished.
ERROR 2017-12-18 12:58:01 +1100 worker-replica-0 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Module
completed; cleaning up.
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Clean up finished.
ERROR 2017-12-18 12:58:43 +1100 service The replica worker 0 ran
out-of-memory and exited with a non-zero status of 247. The replica worker 1
ran out-of-memory and exited with a non-zero status of 247. The replica
worker 2 ran out-of-memory and exited with a non-zero status of 247. The
replica worker 3 ran out-of-memory and exited with a non-zero status of 247.
The replica worker 4 ran out-of-memory and exited with a non-zero status of
247. To find out more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Signal 15
(SIGTERM) was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Clean up finished.
INFO 2017-12-18 12:59:28 +1100 service Finished tearing down
TensorFlow.
INFO 2017-12-18 13:00:17 +1100 service Job failed.##
请不要担心 "Using TensorFlow backend." 错误,因为即使其他较小的数据集的训练工作成功,我也已经知道了。
任何人都可以解释导致 运行 内存不足(错误 247)的原因以及如何编写我的 config.yaml 文件以避免此类问题,并在云中训练我的数据?
我已经解决了这个问题。我需要做几件事:
更改 tensorflow 版本,尤其是我在云中提交训练作业的方式。
我切换到 Feature Hashing
而不是一个热编码 [它为每个添加的新项目创建一个列]
现在它可以训练具有 250 万行和 4200 编码列的分类数据集。
我在选择以下配置 (config.yaml) 时遇到内存不足的问题:
trainingInput:
scaleTier: CUSTOM
masterType: large_model
workerType: complex_model_m
parameterServerType: large_model
workerCount: 10
parameterServerCount: 10
我正在关注 Google 关于 "criteo_tft" 的教程:https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/criteo_tft/config-large.yaml
link 说他们能够训练 1TB 的数据!我很想试一试!!!
我的数据集是分类数据集,因此它在一次热编码后创建了一个相当大的矩阵(一个大小为 520000 x 4000 的二维 numpy 数组)。我可以在具有 32GB 内存的本地机器上训练我的数据集,但我无法在云中做同样的事情!!!
这是我的错误:
ERROR 2017-12-18 12:57:37 +1100 worker-replica-1 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:37 +1100 worker-replica-4 Using TensorFlow
backend.
INFO 2017-12-18 12:57:37 +1100 worker-replica-0 Running command:
python -m trainer.task --train-file gs://my_bucket/my_training_file.csv --
job-dir gs://my_bucket/my_bucket_20171218_125645
ERROR 2017-12-18 12:57:38 +1100 worker-replica-2 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:40 +1100 worker-replica-0 Using TensorFlow
backend.
ERROR 2017-12-18 12:57:53 +1100 worker-replica-3 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Module
completed; cleaning up.
INFO 2017-12-18 12:57:53 +1100 worker-replica-3 Clean up
finished.
ERROR 2017-12-18 12:57:56 +1100 worker-replica-4 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Module
completed; cleaning up.
INFO 2017-12-18 12:57:56 +1100 worker-replica-4 Clean up
finished.
ERROR 2017-12-18 12:57:58 +1100 worker-replica-2 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Module
completed; cleaning up.
INFO 2017-12-18 12:57:58 +1100 worker-replica-2 Clean up
finished.
ERROR 2017-12-18 12:57:59 +1100 worker-replica-1 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Module
completed; cleaning up.
INFO 2017-12-18 12:57:59 +1100 worker-replica-1 Clean up finished.
ERROR 2017-12-18 12:58:01 +1100 worker-replica-0 Command
'['python', '-m', u'trainer.task', u'--train-file',
u'gs://my_bucket/my_training_file.csv', '--job-dir',
u'gs://my_bucket/my_bucket_20171218_125645']' returned non-zero exit status -9
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Module
completed; cleaning up.
INFO 2017-12-18 12:58:01 +1100 worker-replica-0 Clean up finished.
ERROR 2017-12-18 12:58:43 +1100 service The replica worker 0 ran
out-of-memory and exited with a non-zero status of 247. The replica worker 1
ran out-of-memory and exited with a non-zero status of 247. The replica
worker 2 ran out-of-memory and exited with a non-zero status of 247. The
replica worker 3 ran out-of-memory and exited with a non-zero status of 247.
The replica worker 4 ran out-of-memory and exited with a non-zero status of
247. To find out more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=a_project_id........(link to to my cloud log)
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-0 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-1 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Signal 15
(SIGTERM) was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-2 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-3 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-5 Clean up finished.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Signal 15 (SIGTERM)
was caught. Terminated by service. This is normal behavior.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Module completed;
cleaning up.
INFO 2017-12-18 12:58:44 +1100 ps-replica-4 Clean up finished.
INFO 2017-12-18 12:59:28 +1100 service Finished tearing down
TensorFlow.
INFO 2017-12-18 13:00:17 +1100 service Job failed.##
请不要担心 "Using TensorFlow backend." 错误,因为即使其他较小的数据集的训练工作成功,我也已经知道了。
任何人都可以解释导致 运行 内存不足(错误 247)的原因以及如何编写我的 config.yaml 文件以避免此类问题,并在云中训练我的数据?
我已经解决了这个问题。我需要做几件事:
更改 tensorflow 版本,尤其是我在云中提交训练作业的方式。
我切换到 Feature Hashing
而不是一个热编码 [它为每个添加的新项目创建一个列]
现在它可以训练具有 250 万行和 4200 编码列的分类数据集。