如何确定 SLURM 中 python 脚本步骤内存超出的时间点

Question

我有一个 python 脚本，我运行在 SLURM 集群上用于多个输入文件：

#!/bin/bash

#SBATCH -p standard
#SBATCH -A overall 
#SBATCH --time=12:00:00
#SBATCH --output=normalize_%A.out
#SBATCH --error=normalize_%A.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem=240000

HDF5_DIR=...
OUTPUT_DIR=...
NORM_SCRIPT=...

norm_func () {
  local file=
  echo "$file"
  python $NORM_SCRIPT -data $file -path $OUTPUT_DIR
}

# Doing normalization in parallel
for file in $HDF5_DIR/*; do norm_func "$file" & done
wait

python 脚本只是加载一个数据集 (scRNAseq)，进行规范化并保存为 .csv 文件。其中的一些主要代码行是：

        f = h5py.File(path_to_file, 'r')
        rawcounts = np.array(rawcounts)

        unique_code = np.unique(split_code)
        for code in unique_code:
            mask = np.equal(split_code, code)
            curr_counts = rawcounts[:,mask]

            # Actual TMM normalization
            mtx_norm = gmn.tmm_normalization(curr_counts)

            # Writing the results into .csv file
            csv_path = path_to_save + "/" + file_name + "_" + str(code) + ".csv"
            with open(csv_path,'w', encoding='utf8') as csvfile:
                writer = csv.writer(csvfile, delimiter=',')
                writer.writerow(["", cell_ids])
                for idx, row in enumerate(mtx_norm):
                    writer.writerow([gene_symbols[idx], row])

对于高于 10Gb 的数据集，我不断收到 step memory exceeded 错误，我不确定为什么。如何更改 .slurm 脚本或 python 代码以减少其内存使用量？我如何才能真正确定导致 memory 问题的原因，在这种情况下是否有调试内存的特定方法？任何建议将不胜感激。

Answer 1

您可以使用 srun 启动 python 脚本来获取更精细的信息：

srun python $NORM_SCRIPT -data $file -path $OUTPUT_DIR

Slurm 然后将为您的 python 脚本的每个实例创建一个 'step'，并为每个步骤独立报告信息（错误、return 代码、使用的内存等）会计，您可以使用 sacct 命令查询。

如果由管理员配置，使用--profile选项获取每个步骤的内存使用时间线。

在您的 python 脚本中，您可以使用 memory_profile 模块来获得有关脚本内存使用情况的反馈。

如何确定 SLURM 中 python 脚本步骤内存超出的时间点

How to determine at which point in python script step memory exceeded in SLURM

python

memory

slurm