为什么 itertools.groupby 可以将 NaN 分组到列表中而不是 numpy 数组中

Question

我很难调试 list 中的浮点数 nan 和 numpy.array 中的 nan 的处理方式不同的问题用于 itertools.groupby:

给定以下列表和数组：

from itertools import groupby
import numpy as np

lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
arr = np.array(lst)

当我遍历列表时，连续的 nan 被分组：

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan, nan, nan] <class 'float'>
nan [nan] <class 'float'>

但是，如果我使用数组，它会将连续的 nan 放在不同的组中：

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>
nan [nan] <class 'numpy.float64'>

即使我将数组转换回列表：

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, list(group), type(key))
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>
nan [nan] <class 'float'>

我正在使用：

numpy 1.11.3
python 3.5

我知道通常 nan != nan 那么为什么这些操作会给出不同的结果？ groupby 怎么可能将 nan 分组？

Answer 1

Python 列表只是指向内存中对象的指针数组。特别是 lst 持有指向对象 np.nan:

的指针

>>> [id(x) for x in lst]
[139832272211880, # nan
 139832272211880, # nan
 139832272211880, # nan
 139832133974296,
 139832270325408,
 139832133974296,
 139832133974464,
 139832133974320,
 139832133974296,
 139832133974440,
 139832272211880, # nan
 139832133974296]

（np.nan 在我的电脑上是 139832272211880。）

另一方面，NumPy 数组只是内存的连续区域；它们是由 NumPy 解释为值序列（浮点数、整数等）的位和字节区域。

问题在于，当您要求 Python 遍历包含浮点值的 NumPy 数组时（在 for 循环或 groupby 级别），Python需要将这些字节装箱到适当的 Python 对象中。它在迭代时为数组中的每个单个值在内存中创建一个全新的 Python 对象。

例如，您可以看到在调用 .tolist() 时为每个 nan 值创建了不同的对象：

>>> [id(x) for x in arr.tolist()]
[4355054616, # nan
 4355054640, # nan
 4355054664, # nan
 4355054688,
 4355054712,
 4355054736,
 4355054760,
 4355054784,
 4355054808,
 4355054832,
 4355054856, # nan
 4355054880]

itertools.groupby 能够对 Python 列表的 np.nan 进行分组，因为它在比较 [=40] 时首先检查 identity =] 对象。因为这些指向 nan 的指针都指向同一个 np.nan 对象，所以分组是可能的。

但是，对 NumPy 数组的迭代不允许此初始身份检查成功，因此 Python 回退到检查相等性和 nan != nan 正如您所说。

Answer 2

我不确定是否是这个原因，但我刚刚注意到 lst 和 arr 中的 nan:

>>> lst[0] == lst[1], arr[0] == arr[1]
(False, False)
>>> lst[0] is lst[1], arr[0] is arr[1]
(True, False)

即，虽然所有 nan 都不相等，但常规 np.nan（float 类型）都是 same 实例，而arr 中的 nan 是类型 numpy.float64 的不同个实例）。所以我的猜测是，如果没有给出 key 函数，groupby 将在进行更昂贵的相等性检查之前测试身份。

这也与未在 arr.tolist() 中分组的观察结果一致，因为即使那些 nan 现在又是 float，它们不再是同一个实例.

>>> atl = arr.tolist()
>>> atl[0] is atl[1]
False

Answer 3

and 的答案是正确的，这是因为列表中的nan具有相同的id，而在列表中“迭代”时它们具有不同的id numpy-array.

这个答案是对这些答案的补充。

>>> from itertools import groupby
>>> import numpy as np

>>> lst = [np.nan, np.nan, np.nan, 0.16, 1, 0.16, 0.9999, 0.0001, 0.16, 0.101, np.nan, 0.16]
>>> arr = np.array(lst)

>>> for key, group in groupby(lst):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274500321192 [1274500321192, 1274500321192, 1274500321192]
nan 1274500321192 [1274500321192]

>>> for key, group in groupby(arr):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130480 [1274537130480]
nan 1274537130504 [1274537130504]
nan 1274537130480 [1274537130480]
nan 1274537130480 [1274537130480]  # same id as before but these are not consecutive

>>> for key, group in groupby(arr.tolist()):
...     if np.isnan(key):
...         print(key, id(key), [id(item) for item in group])
nan 1274537130336 [1274537130336]
nan 1274537130408 [1274537130408]
nan 1274500320904 [1274500320904]
nan 1274537130168 [1274537130168]

问题是 Python 使用 PyObject_RichCompare-operation when comparing values, which only tests for object identity if == fails because it's not implemented. itertools.groupby on the other hand uses PyObject_RichCompareBool (see Source: 1, 2) 测试对象身份 首先并且在 == 被测试之前 .

这可以用一个小的 cython 片段来验证：

%load_ext cython
%%cython

from cpython.object cimport PyObject_RichCompareBool, PyObject_RichCompare, Py_EQ

def compare(a, b):
    return PyObject_RichCompare(a, b, Py_EQ), PyObject_RichCompareBool(a, b, Py_EQ)

>>> compare(np.nan, np.nan)
(False, True)

PyObject_RichCompareBool 的源代码是这样写的：

/* Perform a rich comparison with object result.  This wraps do_richcompare()
   with a check for NULL arguments and a recursion check. */

/* Perform a rich comparison with integer result.  This wraps
   PyObject_RichCompare(), returning -1 for error, 0 for false, 1 for true. */
int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    /**********************That's the difference!****************/
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}

对象标识测试 (if (v == w) ) 确实在正常 python 比较 PyObject_RichCompare(v, w, op); 被使用并在 its documentation:

中提到之前完成

Note :

If o1 and o2 are the same object, PyObject_RichCompareBool() will always return 1 for Py_EQ and 0 for Py_NE.

为什么 itertools.groupby 可以将 NaN 分组到列表中而不是 numpy 数组中

Why can itertools.groupby group the NaNs in lists but not in numpy arrays

python

arrays

numpy

list

nan