Google 云笔记本 VM - 高内存机器,内核因内存错误而崩溃

Google Cloud Notebook VM's - High Memory Machine, Kernel Crashing on Memory Error

我正在使用 GCP 的云笔记本虚拟机。我有一个 200+ gb RAM VM 运行,我正在尝试使用 bigquery 存储引擎将大约 70gb 的数据从 BigQuery 下载到内存中。

一旦达到 50GB 左右,内核就会崩溃 --

跟踪日志,sudo tail -20 /var/log/syslog - 这是我发现的内容:

Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.550367] Task in /system.slice/jupyter.service killed as a result of limit of /system.slice/jupyter.service
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.563843] memory: usage 53350876kB, limit 53350964kB, failcnt 1708893
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.570582] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.578694] kmem: usage 110900kB, limit 9007199254740988kB, failcnt 0
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.585267] Memory cgroup stats for /system.slice/jupyter.service: cache:752KB rss:53239292KB rss_huge:0KB mapped_file:60KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:53239292KB inactive_file:400KB active_file:248KB unevictable:0KB
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.612963] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.621645] [  787]  1003   787    99396    17005      63       3        0             0 jupyter-lab
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.632295] [ 2290]  1003  2290     4996      966      14       3        0             0 bash
Dec  2 13:35:57 pytorch-20200908-152245 kernel: [60783.642309] [13143]  1003 13143  1272679    26639     156       6        0             0 python
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.652528] [ 5833]  1003  5833 16000467 13268794   26214      61        0             0 python
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.661384] [ 6813]  1003  6813     4996      936      14       3        0             0 bash
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.670033] Memory cgroup out of memory: Kill process 5833 (python) score 996 or sacrifice child
Dec  2 13:35:58 pytorch-20200908-152245 kernel: [60783.680823] Killed process 5833 (python) total-vm:64001868kB, anon-rss:53072876kB, file-rss:4632kB, shmem-rss:0kB
Dec  2 13:38:07 pytorch-20200908-152245 sync_gcs_service.sh[806]: GCS bucket is not specified in GCE metadata, skip GCS sync
Dec  2 13:39:03 pytorch-20200908-152245 bash[787]: [I 13:39:03.463 LabApp] Saving file at /outlog.txt

我按照此指南分配了 100gb 的 RAM ,但它仍然在 55gb 左右崩溃。例如,53350964kB 是日志中的限制。

如何利用机器的可用内存?谢谢!

解决有效的问题 - 更改此配置设置:

/sys/fs/cgroup/memory/system.slice/jupyter.service/memory.limit_in_bytes 到更大的数字。

我在这里看到“Cgroup out of memory”意味着实例有足够的内存和 cgroup 中正在被杀死的进程。对于可视化工作负载,这是可能的,因为 docker 容器可能会导致此问题。

a) 识别cgroup

system-cgtop

b) 检查cgroup的限制

cat /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes

c) 调整限制,请通过编辑 POD 的配置文件来调整,Docker 容器的内存限制。更新原始 cgroup

的限制

回显 [NUMBER_OF_BYTES] > /sys/fs/cgroup/memory/[CGROUP_NAME]/memory.limit_in_bytes