Google 云笔记本 VM - 高内存机器,内核因内存错误而崩溃
Google Cloud Notebook VM's - High Memory Machine, Kernel Crashing on Memory Error
我正在使用 GCP 的云笔记本虚拟机。我有一个 200+ gb RAM VM 运行,我正在尝试使用 bigquery 存储引擎将大约 70gb 的数据从 BigQuery 下载到内存中。
一旦达到 50GB 左右,内核就会崩溃 --
跟踪日志,sudo tail -20 /var/log/syslog
- 这是我发现的内容:
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.550367] Task in /system.slice/jupyter.service killed as a result of limit of /system.slice/jupyter.service
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.563843] memory: usage 53350876kB, limit 53350964kB, failcnt 1708893
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.570582] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.578694] kmem: usage 110900kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.585267] Memory cgroup stats for /system.slice/jupyter.service: cache:752KB rss:53239292KB rss_huge:0KB mapped_file:60KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:53239292KB inactive_file:400KB active_file:248KB unevictable:0KB
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.612963] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.621645] [ 787] 1003 787 99396 17005 63 3 0 0 jupyter-lab
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.632295] [ 2290] 1003 2290 4996 966 14 3 0 0 bash
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.642309] [13143] 1003 13143 1272679 26639 156 6 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.652528] [ 5833] 1003 5833 16000467 13268794 26214 61 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.661384] [ 6813] 1003 6813 4996 936 14 3 0 0 bash
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.670033] Memory cgroup out of memory: Kill process 5833 (python) score 996 or sacrifice child
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.680823] Killed process 5833 (python) total-vm:64001868kB, anon-rss:53072876kB, file-rss:4632kB, shmem-rss:0kB
Dec 2 13:38:07 pytorch-20200908-152245 sync_gcs_service.sh[806]: GCS bucket is not specified in GCE metadata, skip GCS sync
Dec 2 13:39:03 pytorch-20200908-152245 bash[787]: [I 13:39:03.463 LabApp] Saving file at /outlog.txt
我按照此指南分配了 100gb 的 RAM ,但它仍然在 55gb 左右崩溃。例如,53350964kB
是日志中的限制。
如何利用机器的可用内存?谢谢!
解决有效的问题 - 更改此配置设置:
/sys/fs/cgroup/memory/system.slice/jupyter.service/memory.limit_in_bytes
到更大的数字。
我在这里看到“Cgroup out of memory”意味着实例有足够的内存和 cgroup 中正在被杀死的进程。对于可视化工作负载,这是可能的,因为 docker 容器可能会导致此问题。
a) 识别cgroup
system-cgtop
b) 检查cgroup的限制
cat /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes
c) 调整限制,请通过编辑 POD 的配置文件来调整,Docker 容器的内存限制。更新原始 cgroup
的限制
回显 [NUMBER_OF_BYTES] > /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes
我正在使用 GCP 的云笔记本虚拟机。我有一个 200+ gb RAM VM 运行,我正在尝试使用 bigquery 存储引擎将大约 70gb 的数据从 BigQuery 下载到内存中。
一旦达到 50GB 左右,内核就会崩溃 --
跟踪日志,sudo tail -20 /var/log/syslog
- 这是我发现的内容:
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.550367] Task in /system.slice/jupyter.service killed as a result of limit of /system.slice/jupyter.service
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.563843] memory: usage 53350876kB, limit 53350964kB, failcnt 1708893
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.570582] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.578694] kmem: usage 110900kB, limit 9007199254740988kB, failcnt 0
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.585267] Memory cgroup stats for /system.slice/jupyter.service: cache:752KB rss:53239292KB rss_huge:0KB mapped_file:60KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:53239292KB inactive_file:400KB active_file:248KB unevictable:0KB
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.612963] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.621645] [ 787] 1003 787 99396 17005 63 3 0 0 jupyter-lab
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.632295] [ 2290] 1003 2290 4996 966 14 3 0 0 bash
Dec 2 13:35:57 pytorch-20200908-152245 kernel: [60783.642309] [13143] 1003 13143 1272679 26639 156 6 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.652528] [ 5833] 1003 5833 16000467 13268794 26214 61 0 0 python
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.661384] [ 6813] 1003 6813 4996 936 14 3 0 0 bash
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.670033] Memory cgroup out of memory: Kill process 5833 (python) score 996 or sacrifice child
Dec 2 13:35:58 pytorch-20200908-152245 kernel: [60783.680823] Killed process 5833 (python) total-vm:64001868kB, anon-rss:53072876kB, file-rss:4632kB, shmem-rss:0kB
Dec 2 13:38:07 pytorch-20200908-152245 sync_gcs_service.sh[806]: GCS bucket is not specified in GCE metadata, skip GCS sync
Dec 2 13:39:03 pytorch-20200908-152245 bash[787]: [I 13:39:03.463 LabApp] Saving file at /outlog.txt
我按照此指南分配了 100gb 的 RAM 53350964kB
是日志中的限制。
如何利用机器的可用内存?谢谢!
解决有效的问题 - 更改此配置设置:
/sys/fs/cgroup/memory/system.slice/jupyter.service/memory.limit_in_bytes
到更大的数字。
我在这里看到“Cgroup out of memory”意味着实例有足够的内存和 cgroup 中正在被杀死的进程。对于可视化工作负载,这是可能的,因为 docker 容器可能会导致此问题。
a) 识别cgroup
system-cgtop
b) 检查cgroup的限制
cat /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes
c) 调整限制,请通过编辑 POD 的配置文件来调整,Docker 容器的内存限制。更新原始 cgroup
的限制回显 [NUMBER_OF_BYTES] > /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes