为什么与未屏蔽数组相比,屏蔽数组似乎更小?
Why masked arrays seems to be smaller compared to unmasked array?
我想了解 numpy 掩码数组和带有 nans 的普通数组之间的大小差异。
import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
mask = np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)
我使用以下 answer 来计算对象的大小:
import sys
from types import ModuleType, FunctionType
from gc import get_referents
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType
def getsize(obj):
"""sum size of object & members."""
if isinstance(obj, BLACKLIST):
raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
seen_ids = set()
size = 0
objects = [obj]
while objects:
need_referents = []
for obj in objects:
if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
seen_ids.add(id(obj))
size += sys.getsizeof(obj)
need_referents.append(obj)
objects = get_referents(*need_referents)
return size
这给了我以下结果:
getsize(g)
>>>200000112
getsize(g_mask)
>>>25000924
为什么未屏蔽数组比屏蔽数组大?我如何估计屏蔽数组与未屏蔽数组的实际大小?
numpy.ndarray
没有 tp_traverse
,因此它与您尝试使用的 getsize
函数不兼容。 GC 系统看不到掩码数组的 ndarray
部分所拥有的引用。特别是 g_mask
的 base
未包含在您的输出中。
In [23]: g = np.random.random((5000,5000))
...: indx = np.random.randint(0,4999,(500,2))
...: mask = np.full((5000,5000),False,dtype=bool)
...: mask[indx] = True
...: g_mask = np.ma.array(g,mask=mask)
比较g
数组和g_mask
的_data
属性,我们发现后者只是前者的view
:
In [24]: g.__array_interface__
Out[24]:
{'data': (139821997776912, False),
'strides': None,
'descr': [('', '<f8')],
'typestr': '<f8',
'shape': (5000, 5000),
'version': 3}
In [25]: g_mask._data.__array_interface__
Out[25]:
{'data': (139821997776912, False),
'strides': None,
'descr': [('', '<f8')],
'typestr': '<f8',
'shape': (5000, 5000),
'version': 3}
它们有相同的数据缓冲区,但它们的id
不同:
In [26]: id(g)
Out[26]: 139822758212672
In [27]: id(g_mask._data)
Out[27]: 139822386925440
面具也一样:
In [28]: mask.__array_interface__
Out[28]:
{'data': (139822298669072, False),
'strides': None,
'descr': [('', '|b1')],
'typestr': '|b1',
'shape': (5000, 5000),
'version': 3}
In [29]: g_mask._mask.__array_interface__
Out[29]:
{'data': (139822298669072, False),
'strides': None,
'descr': [('', '|b1')],
'typestr': '|b1',
'shape': (5000, 5000),
'version': 3}
实际上使用这种结构,_mask
是同一个数组:
In [30]: id(mask)
Out[30]: 139822385963056
In [31]: id(g_mask._mask)
Out[31]: 139822385963056
屏蔽数组的 __array_interface__
是 ._data
属性的数组:
In [32]: g_mask.__array_interface__
Out[32]:
{'data': (139821997776912, False),
nbytes
是数组数据缓冲区的大小:
In [34]: g_mask.data.nbytes
Out[34]: 200000000
In [35]: g_mask.mask.nbytes
Out[35]: 25000000
布尔数组每个元素 1 个字节,float64,8 个字节。
我想了解 numpy 掩码数组和带有 nans 的普通数组之间的大小差异。
import numpy as np
g = np.random.random((5000,5000))
indx = np.random.randint(0,4999,(500,2))
mask = np.full((5000,5000),False,dtype=bool)
mask[indx] = True
g_mask = np.ma.array(g,mask=mask)
我使用以下 answer 来计算对象的大小:
import sys
from types import ModuleType, FunctionType
from gc import get_referents
# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType
def getsize(obj):
"""sum size of object & members."""
if isinstance(obj, BLACKLIST):
raise TypeError('getsize() does not take argument of type: '+ str(type(obj)))
seen_ids = set()
size = 0
objects = [obj]
while objects:
need_referents = []
for obj in objects:
if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
seen_ids.add(id(obj))
size += sys.getsizeof(obj)
need_referents.append(obj)
objects = get_referents(*need_referents)
return size
这给了我以下结果:
getsize(g)
>>>200000112
getsize(g_mask)
>>>25000924
为什么未屏蔽数组比屏蔽数组大?我如何估计屏蔽数组与未屏蔽数组的实际大小?
numpy.ndarray
没有 tp_traverse
,因此它与您尝试使用的 getsize
函数不兼容。 GC 系统看不到掩码数组的 ndarray
部分所拥有的引用。特别是 g_mask
的 base
未包含在您的输出中。
In [23]: g = np.random.random((5000,5000))
...: indx = np.random.randint(0,4999,(500,2))
...: mask = np.full((5000,5000),False,dtype=bool)
...: mask[indx] = True
...: g_mask = np.ma.array(g,mask=mask)
比较g
数组和g_mask
的_data
属性,我们发现后者只是前者的view
:
In [24]: g.__array_interface__
Out[24]:
{'data': (139821997776912, False),
'strides': None,
'descr': [('', '<f8')],
'typestr': '<f8',
'shape': (5000, 5000),
'version': 3}
In [25]: g_mask._data.__array_interface__
Out[25]:
{'data': (139821997776912, False),
'strides': None,
'descr': [('', '<f8')],
'typestr': '<f8',
'shape': (5000, 5000),
'version': 3}
它们有相同的数据缓冲区,但它们的id
不同:
In [26]: id(g)
Out[26]: 139822758212672
In [27]: id(g_mask._data)
Out[27]: 139822386925440
面具也一样:
In [28]: mask.__array_interface__
Out[28]:
{'data': (139822298669072, False),
'strides': None,
'descr': [('', '|b1')],
'typestr': '|b1',
'shape': (5000, 5000),
'version': 3}
In [29]: g_mask._mask.__array_interface__
Out[29]:
{'data': (139822298669072, False),
'strides': None,
'descr': [('', '|b1')],
'typestr': '|b1',
'shape': (5000, 5000),
'version': 3}
实际上使用这种结构,_mask
是同一个数组:
In [30]: id(mask)
Out[30]: 139822385963056
In [31]: id(g_mask._mask)
Out[31]: 139822385963056
屏蔽数组的 __array_interface__
是 ._data
属性的数组:
In [32]: g_mask.__array_interface__
Out[32]:
{'data': (139821997776912, False),
nbytes
是数组数据缓冲区的大小:
In [34]: g_mask.data.nbytes
Out[34]: 200000000
In [35]: g_mask.mask.nbytes
Out[35]: 25000000
布尔数组每个元素 1 个字节,float64,8 个字节。