为什么 itertools.groupby 可以将 NaN 分组到列表中而不是 numpy 数组中
Why can itertools.groupby group the NaNs in lists but not in numpy arrays
我很难调试 list
中的浮点数 nan
和 numpy.array
中的 nan
的处理方式不同的问题用于 itertools.groupby
:
给定以下列表和数组:
from itertools import groupby
import numpy as np
lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)
当我遍历列表时,连续的 nan
被分组:
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>
但是,如果我使用数组,它会将连续的 nan
放在不同的组中:
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
即使我将数组转换回列表:
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
我正在使用:
numpy 1.11.3
python 3.5
我知道通常 nan != nan
那么为什么这些操作会给出不同的结果? groupby
怎么可能将 nan
分组?
Python 列表只是指向内存中对象的指针数组。特别是 lst
持有指向对象 np.nan
:
的指针
>>> [id(x) for x in lst]
[139832272211880, # nan
139832272211880, # nan
139832272211880, # nan
139832133974296,
139832270325408,
139832133974296,
139832133974464,
139832133974320,
139832133974296,
139832133974440,
139832272211880, # nan
139832133974296]
(np.nan
在我的电脑上是 139832272211880。)
另一方面,NumPy 数组只是内存的连续区域;它们是由 NumPy 解释为值序列(浮点数、整数等)的位和字节区域。
问题在于,当您要求 Python 遍历包含浮点值的 NumPy 数组时(在 for
循环或 groupby
级别),Python需要将这些字节装箱到适当的 Python 对象中。它在迭代时为数组中的每个单个值在内存中创建一个全新的 Python 对象。
例如,您可以看到在调用 .tolist()
时为每个 nan
值创建了不同的对象:
>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
4355054640, # nan
4355054664, # nan
4355054688,
4355054712,
4355054736,
4355054760,
4355054784,
4355054808,
4355054832,
4355054856, # nan
4355054880]
itertools.groupby
能够对 Python 列表的 np.nan
进行分组,因为它在比较 [=40] 时首先检查 identity =] 对象。因为这些指向 nan
的指针都指向同一个 np.nan
对象,所以分组是可能的。
但是,对 NumPy 数组的迭代不允许此初始身份检查成功,因此 Python 回退到检查相等性和 nan != nan
正如您所说。
我不确定是否是这个原因,但我刚刚注意到 lst
和 arr
中的 nan
:
>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)
即,虽然所有 nan
都不相等,但常规 np.nan
(float
类型)都是 same 实例,而arr
中的 nan
是类型 numpy.float64
的 不同 个实例)。所以我的猜测是,如果没有给出 key
函数,groupby
将在进行更昂贵的相等性检查之前测试身份。
这也与未在 arr.tolist()
中分组的观察结果一致,因为即使那些 nan
现在又是 float
,它们不再是同一个实例.
>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False
and 的答案是正确的,这是因为列表中的nan
具有相同的id
,而在列表中“迭代”时它们具有不同的id numpy-array.
这个答案是对这些答案的补充。
>>> from itertools import groupby
>>> import numpy as np
>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480] # same id as before but these are not consecutive
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]
问题是 Python 使用 PyObject_RichCompare
-operation when comparing values, which only tests for object identity if ==
fails because it's not implemented. itertools.groupby
on the other hand uses PyObject_RichCompareBool
(see Source: 1, 2) 测试对象身份 首先并且在 ==
被测试之前 .
这可以用一个小的 cython 片段来验证:
%load_ext cython
%%cython
from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ
def compare(a, b):
return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)
>>> compare(np.nan, np.nan)
(False, True)
PyObject_RichCompareBool
的源代码是这样写的:
/* Perform a rich comparison with object result. This wraps do_richcompare()
with a check for NULL arguments and a recursion check. */
/* Perform a rich comparison with integer result. This wraps
PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
/**********************That's the difference!****************/
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
res = PyObject_RichCompare(v, w, op);
if (res == NULL)
return -1;
if (PyBool_Check(res))
ok = (res == Py_True);
else
ok = PyObject_IsTrue(res);
Py_DECREF(res);
return ok;
}
对象标识测试 (if (v == w)
) 确实在正常 python 比较 PyObject_RichCompare(v, w, op);
被使用并在 its documentation:
中提到之前完成
Note :
If o1 and o2 are the same object, PyObject_RichCompareBool()
will always return 1 for Py_EQ and 0 for Py_NE.
我很难调试 list
中的浮点数 nan
和 numpy.array
中的 nan
的处理方式不同的问题用于 itertools.groupby
:
给定以下列表和数组:
from itertools import groupby
import numpy as np
lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)
当我遍历列表时,连续的 nan
被分组:
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>
但是,如果我使用数组,它会将连续的 nan
放在不同的组中:
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
即使我将数组转换回列表:
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
我正在使用:
numpy 1.11.3
python 3.5
我知道通常 nan != nan
那么为什么这些操作会给出不同的结果? groupby
怎么可能将 nan
分组?
Python 列表只是指向内存中对象的指针数组。特别是 lst
持有指向对象 np.nan
:
>>> [id(x) for x in lst]
[139832272211880, # nan
139832272211880, # nan
139832272211880, # nan
139832133974296,
139832270325408,
139832133974296,
139832133974464,
139832133974320,
139832133974296,
139832133974440,
139832272211880, # nan
139832133974296]
(np.nan
在我的电脑上是 139832272211880。)
另一方面,NumPy 数组只是内存的连续区域;它们是由 NumPy 解释为值序列(浮点数、整数等)的位和字节区域。
问题在于,当您要求 Python 遍历包含浮点值的 NumPy 数组时(在 for
循环或 groupby
级别),Python需要将这些字节装箱到适当的 Python 对象中。它在迭代时为数组中的每个单个值在内存中创建一个全新的 Python 对象。
例如,您可以看到在调用 .tolist()
时为每个 nan
值创建了不同的对象:
>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
4355054640, # nan
4355054664, # nan
4355054688,
4355054712,
4355054736,
4355054760,
4355054784,
4355054808,
4355054832,
4355054856, # nan
4355054880]
itertools.groupby
能够对 Python 列表的 np.nan
进行分组,因为它在比较 [=40] 时首先检查 identity =] 对象。因为这些指向 nan
的指针都指向同一个 np.nan
对象,所以分组是可能的。
但是,对 NumPy 数组的迭代不允许此初始身份检查成功,因此 Python 回退到检查相等性和 nan != nan
正如您所说。
我不确定是否是这个原因,但我刚刚注意到 lst
和 arr
中的 nan
:
>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)
即,虽然所有 nan
都不相等,但常规 np.nan
(float
类型)都是 same 实例,而arr
中的 nan
是类型 numpy.float64
的 不同 个实例)。所以我的猜测是,如果没有给出 key
函数,groupby
将在进行更昂贵的相等性检查之前测试身份。
这也与未在 arr.tolist()
中分组的观察结果一致,因为即使那些 nan
现在又是 float
,它们不再是同一个实例.
>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False
nan
具有相同的id
,而在列表中“迭代”时它们具有不同的id numpy-array.
这个答案是对这些答案的补充。
>>> from itertools import groupby
>>> import numpy as np
>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)
>>> for key, group in groupby(lst):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]
>>> for key, group in groupby(arr):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480] # same id as before but these are not consecutive
>>> for key, group in groupby(arr.tolist()):
... if np.isnan(key):
... print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]
问题是 Python 使用 PyObject_RichCompare
-operation when comparing values, which only tests for object identity if ==
fails because it's not implemented. itertools.groupby
on the other hand uses PyObject_RichCompareBool
(see Source: 1, 2) 测试对象身份 首先并且在 ==
被测试之前 .
这可以用一个小的 cython 片段来验证:
%load_ext cython
%%cython
from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ
def compare(a, b):
return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)
>>> compare(np.nan, np.nan)
(False, True)
PyObject_RichCompareBool
的源代码是这样写的:
/* Perform a rich comparison with object result. This wraps do_richcompare()
with a check for NULL arguments and a recursion check. */
/* Perform a rich comparison with integer result. This wraps
PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
PyObject *res;
int ok;
/* Quick result when objects are the same.
Guarantees that identity implies equality. */
/**********************That's the difference!****************/
if (v == w) {
if (op == Py_EQ)
return 1;
else if (op == Py_NE)
return 0;
}
res = PyObject_RichCompare(v, w, op);
if (res == NULL)
return -1;
if (PyBool_Check(res))
ok = (res == Py_True);
else
ok = PyObject_IsTrue(res);
Py_DECREF(res);
return ok;
}
对象标识测试 (if (v == w)
) 确实在正常 python 比较 PyObject_RichCompare(v, w, op);
被使用并在 its documentation:
Note :
If o1 and o2 are the same object,
PyObject_RichCompareBool()
will always return 1 for Py_EQ and 0 for Py_NE.