与 sys.getsizeof() 的结果相比，整数的大内存占用

Question

Python-[1,2^30) 范围内的整数对象需要 28 字节，由 sys.getsizeof() 提供并在 this SO-post 中进行了解释。

但是，当我使用以下脚本测量内存占用时：

#int_list.py:
import sys

N=int(sys.argv[1])
lst=[0]*N            # no overallocation

for i in range(N):
    lst[i]=1000+i    # ints not from integer pool

通过

/usr/bin/time -fpeak_used_memory:%M python3 int_list.py <N>

我得到以下峰值内存值（Linux-x64，Python 3.6.2）：

   N     Peak memory in Kb        bytes/integer
-------------------------------------------   
   1            9220              
   1e7        404712                40.50 
   2e7        800612                40.52 
   3e7       1196204                40.52
   4e7       1591948                40.52

因此看起来好像每个整数对象需要 40.5 字节，即 12.5 字节比 sys.getsizeof() 产生的多。

额外的 8 字节很容易解释 - 列表 lst 不包含整数对象，而是对它们的引用 - 这意味着一个额外的指针，即 8 字节, 是需要的。

但是，其他 4.5 个字节呢，它们有什么用？

可以排除以下原因：

整数对象的大小是可变的，但是 10^7 小于 2^30 因此所有整数都将是 28 字节大。
列表 lst 中没有过度分配，可以通过 sys.getsizeof(lst) 轻松检查，它产生 8 倍的元素数量，加上非常小的开销。

Answer 1

int 对象只需要 28 个字节，但 Python 使用 8 字节对齐：内存分配在大小为 8 字节倍数的块中。所以每个int对象实际使用的内存是32字节。有关详细信息，请参阅 Python memory management 上的这篇优秀文章。

我还没有对剩余半字节的解释，但如果我找到一个，我会更新它。

Answer 2

@Nathan 的建议令人惊讶地不是解决方案，因为 CPython 的 longint- 实现有一些微妙的细节。有了他的解释，

的内存占用

...
lst[i] = (1<<30)+i

应该仍然是 40.52，因为 sys.sizeof(1<<30) 是 32，但测量显示它是 48.56。另一方面，对于

...
lst[i] = (1<<60)+i

足迹仍然是 48.56，尽管 sys.sizeof(1<<60) 是 36。

原因：sys.getsizeof() 没有告诉我们求和结果的实际内存占用，即 a+b 即

1000+i
36 字节用于 (1<<30)+i
(1<<60)+i

发生这种情况是因为当在 x_add 中添加两个整数时，生成的整数首先有一个 "digit"，即 4 个字节，超过 a 和 b:

static PyLongObject *
x_add(PyLongObject *a, PyLongObject *b)
{
    Py_ssize_t size_a = Py_ABS(Py_SIZE(a)), size_b = Py_ABS(Py_SIZE(b));
    PyLongObject *z;
    ...
    /* Ensure a is the larger of the two: */
    ...
    z = _PyLong_New(size_a+1);  
    ...

相加后结果归一化：

 ...
 return long_normalize(z);

};

即可能的前导零被丢弃，但内存未释放 - 4 个字节不值得，可以找到函数的源代码 here.

现在，我们可以使用@Nathans 的洞察力来解释，为什么 (1<<30)+i 的占用空间是 48.56 而不是 44.xy：使用的 py_malloc-allocator 使用内存- 对齐 8 字节的块，这意味着 36 字节将存储在大小为 40 的块中 - 与 (1<<60)+i 的结果相同（保留额外的 8 -bytes for pointers in mind).

为了解释剩余的 0.5 字节，我们需要更深入地研究 py_malloc-分配器的细节。一个很好的概述是 source-code itself, my last try to describe it can be found in this .

简而言之，分配器管理arenas中的内存，每个256MB。分配竞技场时，会保留内存，但不会提交。我们将内存视为 "used"，只有当所谓的 pool 被触摸时。池是 4Kb 大（POOL_SIZE）并且仅用于具有相同大小的内存块 - 在我们的例子中是 32 字节。这意味着 peak_used_memory 的分辨率是 4Kb，不能对那些 0.5 字节负责。

但是，必须管理这些池，这会导致额外的开销：py_malloc 每个池需要一个 pool_header：

/* Pool for small blocks. */
struct pool_header {
    union { block *_padding;
            uint count; } ref;          /* number of allocated blocks    */
    block *freeblock;                   /* pool's free list head         */
    struct pool_header *nextpool;       /* next pool of this size class  */
    struct pool_header *prevpool;       /* previous pool       ""        */
    uint arenaindex;                    /* index into arenas of base adr */
    uint szidx;                         /* block size class index        */
    uint nextoffset;                    /* bytes to virgin block         */
    uint maxnextoffset;                 /* largest valid nextoffset      */
};

这个结构的大小在我的 Linux_64 机器上是 48（称为 POOL_OVERHEAD）字节。这个 pool_header 是池的一部分（通过 cruntime-memory-allocator 避免额外分配的一种非常聪明的方法）并将取代两个 32 字节块，这意味着池有 place for 126 32 byte integers:

/* Return total number of blocks in pool of size index I, as a uint. */
#define NUMBLOCKS(I) ((uint)(POOL_SIZE - POOL_OVERHEAD) / INDEX2SIZE(I))

这导致：

4Kb/126 = 32.51 字节足迹用于 1000+i，加上额外的 8 字节用于指针。
(30<<1)+i 需要 40 个字节，这意味着 4Kb 有 102 个块的位置，其中一个（池中剩余 16 个字节被分成40字节的块，它们可以用于pool_header)用于pool_header，这导致4Kb/101=40.55字节（加上8字节指针）。

我们还可以看到，还有一些额外的开销，负责 ca。 0.01 每个整数字节 - 不够大，我不关心。

与 sys.getsizeof() 的结果相比，整数的大内存占用

Large memory footprint of integers compared with result of sys.getsizeof()

python

performance

cpython

python-3.x

python-internals