如何简化 splitting/counting numpy 子数组的过程？

Question

我有一个大型数据集，我想为其实施高效的 numpy 解决方案。作为一个更简单的例子，考虑一小部分数字。

import numpy as np 
arr = np.linspace(1, 10, 10)

下面的代码非常接近我的理想解决方案，但我遇到了障碍。首先，我创建了一个布尔掩码来指示数组的索引，其中数组值大于预定义的下限且小于预定义的上限。然后我将布尔掩码拆分为子数组，每个子数组由连续索引的相同值组成。例如，[0, 0, 0, 1, 1, 0, 0, 1, 1, 1] 拆分为 [0, 0, 0], [1, 1], [0, 0], [1, 1, 1]。最后，我想将所有仅由 1 组成的子数组拆分成单独的子数组。例如，[1, 1, 1] 应该拆分为 [1], [1], [1]。

下面的代码完成了大部分我想要的，但方式不方便。我希望所有子数组都存储在一个数组中，从中我可以计算子数组的数量和每个子数组中的元素数量。不幸的是，这对我来说很棘手，因为函数输出是 numpy 数组那是 array(...) 而不仅仅是 (...)。我在想有一种方法可以使用 np.ndarray.T 来做到这一点，我从中获取 True/False 值并将 axis kwarg 应用于，尽管我有到目前为止，实施这种方法还没有取得任何成功。我怎样才能简化这个过程？

def get_groups_by_difference(array, difference):
    """ This function splits arrays into subarrays in which every element is identical. """
    return np.split(array[:], np.where(abs(np.diff(array)) != difference)[0] + 1)

def check_consecutive_nested_arrays(array, repeated_value):
    """ This function returns a boolean array mask - True if all elements of a subarray contain the repeated value; False otherwise. """
    return np.array([np.all(subarray == repeated_value) for subarray in array])

def get_solution(array, lbound, ubound):
    # get boolean mask for array values within bounds
    bool_cnd = np.logical_and(array>lbound, array<ubound)
    # convert True/False into 1/0
    bool_cnd = bool_cnd * 1
    # split array into subarrays of identical values by consecutive index
    stay_idx = np.array(get_groups_by_difference(bool_cnd, 0))
    # find indices of subarrays of ones
    bool_chk = check_consecutive_nested_arrays(stay_idx, 1)
    # get full subarrays of ones 
    ones_sub = stay_idx[bool_chk]
    return bool_cnd, stay_idx, bool_chk, ones_sub

bool_cnd, stay_idx, bool_chk, ones_sub = get_solution(arr, 3, 7)
print(bool_cnd)
>> [0 0 0 1 1 1 0 0 0 0]
print(stay_idx)
>> [array([0, 0, 0]) array([1, 1, 1]) array([0, 0, 0, 0])]
print(bool_chk)
>> [False  True False]
print(ones_sub)
>> [array([1, 1, 1])]

我的目标是获得如下数组结果：

[[0 0 0]
[1]
[1]
[1]
[0 0 0 0]]

这样，我可以找到每个子数组的元素数量和子数组的数量（即 5 长度为 [3, 1, 1, 1, 4] 的子数组。

Answer 1

然后你不能像这样处理你的结果吗：

ret = []
for idx, check in zip(stay_idx, bool_chk):
    if check:
        ret += idx.tolist()
    else:
        ret.append(idx)
ret = np.array(ret)

不是特别漂亮，但可能足以满足您的特定需求。

Answer 2

如果我没理解错的话，

np.split(a, 1 + np.where(a[1:]|a[:-1])[0])

应该做你想做的。这里 a 是 1 和 0 的向量。

这利用了这样一个事实，即你的最终结果可以通过左右拆分 every 1.

如何简化 splitting/counting numpy 子数组的过程？

How can I simplify this process of splitting/counting numpy subarrays?

arrays

boolean

numpy

cluster-analysis

python-3.x