合并间隔而不使用循环和经典 python 代码

merge intervals without using loops and classic python code

我的问题是合并有重叠的区间 示例:

input: 
[(4,8),(6,10),(11,12),(15,20),(20,25)] 
output: 
[(4, 10),(11,12), (15, 25)]
input: 
([(4,8),(6,10),(11,12),(15,20)])
output: 
[(4, 10),(11,12), (15, 20)]

我用经典的 python 代码(使用循环,if 条件) 但是我想在几行中使用 python 库(pandas、numpy..) 有什么建议吗? 提前致谢

我不确定这一定是最佳选择,因为它需要 O(max_interval_value * num_intervals) 时间和内存,但这是使用 NumPy 的直接实现:

import numpy as np

def simplify_intervals(intervals):
    # Make intervals into an array
    intervals = np.asarray(intervals)
    # Make array for zero to the greatest interval end (plus bounds values
    r = np.arange(intervals[:, 1].max() + 3)[:, np.newaxis]
    # Check what elements of the array are within each interval
    m = (r >= intervals[:, 0] + 1) & (r <= intervals[:, 1] + 1)
    # Collapse belonging test for each value
    ind = m.any(1).astype(np.int8)
    # Find where the belonging test changes
    d = np.diff(ind)
    # Find interval bounds
    start = np.where(d > 0)[0]
    end = np.where(d < 0)[0] - 1
    # Make final intervals array
    return np.stack((start, end), axis=1)

print(simplify_intervals([(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)]))
# [[ 4 12]
#  [15 25]]
print(simplify_intervals(([(4,8),(6,10),(11,12),(15,20)])))
# [[ 4 12]
#  [15 20]]

注意:这假定为正区间值。它可以适应支持负范围,实际上优化了一点,只考虑从最小值到最大值。

编辑:

如果您想对大量间隔或边界使用此方法,您可能会受益于使用 Numba:

import numpy as np
import numba as nb

@nb.njit
def simplify_intervals_nb(intervals):
    n = 0
    for _, end in intervals:
        n = max(n, end)
    r = np.arange(n + 3)
    m = np.zeros(n + 3, dtype=np.bool_)
    for start, end in intervals:
        m |= (r >= start + 1) & (r <= end + 1)
    ind = m.astype(np.int8)
    # Find where the belonging test changes
    d = np.diff(ind)
    # Find interval bounds
    start = np.where(d > 0)[0]
    end = np.where(d < 0)[0] - 1
    # Make final intervals array
    return np.stack((start, end), axis=1)

IPython 中的快速测试:

import random

random.seed(100)
start = [random.randint(0, 10000) for _ in range(300)]
end = start = [s + random.randint(0, 3000) for s in start]
intervals = list(zip(start, end))
print(np.all(simplify_intervals(intervals) == simplify_intervals_nb(intervals)))
# True
%timeit simplify_intervals(intervals)
# 15.2 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit simplify_intervals_nb(intervals)
# 9.54 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

假设您的输入元组按照示例中的方式排序,则可以执行以下操作:

p = [(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)] 

ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
np.delete(p, [ind, ind+1]).reshape(-1, 2)

输出:

array([[ 4, 10],
       [11, 12],
       [15, 25]])

然后您可以使用例如将其转换为 [(4, 10), (11, 12), (15, 25)] list(map(tuple, ...)).


Edit:只有当每个元组 (x_i, y_i) 对所有 i 都满足 x_i <= x_{i+1}y_i <= y_{i+1} 时,以上才有效,如原始示例。

要使其在所有 i 的唯一条件 x_i <= y_i 下工作,您必须预处理列表:

# Example from comments (modified by subtracting the min value and removing duplicates)
p = [(0, 90), (72, 81), (87, 108), (459, 606)]

p = list(zip(sorted([tup[0] for tup in p]), sorted([tup[1] for tup in p])))

ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
ind = ind[ind % 2 == 1]  # this is needed for cases when x_i = y_i
np.delete(p, [ind, ind+1]).reshape(-1, 2)

输出:

array([[  0, 108],
       [459, 606]])