合并间隔而不使用循环和经典 python 代码
merge intervals without using loops and classic python code
我的问题是合并有重叠的区间
示例:
input:
[(4,8),(6,10),(11,12),(15,20),(20,25)]
output:
[(4, 10),(11,12), (15, 25)]
input:
([(4,8),(6,10),(11,12),(15,20)])
output:
[(4, 10),(11,12), (15, 20)]
我用经典的 python 代码(使用循环,if 条件)
但是我想在几行中使用 python 库(pandas、numpy..)
有什么建议吗?
提前致谢
我不确定这一定是最佳选择,因为它需要 O(max_interval_value * num_intervals)
时间和内存,但这是使用 NumPy 的直接实现:
import numpy as np
def simplify_intervals(intervals):
# Make intervals into an array
intervals = np.asarray(intervals)
# Make array for zero to the greatest interval end (plus bounds values
r = np.arange(intervals[:, 1].max() + 3)[:, np.newaxis]
# Check what elements of the array are within each interval
m = (r >= intervals[:, 0] + 1) & (r <= intervals[:, 1] + 1)
# Collapse belonging test for each value
ind = m.any(1).astype(np.int8)
# Find where the belonging test changes
d = np.diff(ind)
# Find interval bounds
start = np.where(d > 0)[0]
end = np.where(d < 0)[0] - 1
# Make final intervals array
return np.stack((start, end), axis=1)
print(simplify_intervals([(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)]))
# [[ 4 12]
# [15 25]]
print(simplify_intervals(([(4,8),(6,10),(11,12),(15,20)])))
# [[ 4 12]
# [15 20]]
注意:这假定为正区间值。它可以适应支持负范围,实际上优化了一点,只考虑从最小值到最大值。
编辑:
如果您想对大量间隔或边界使用此方法,您可能会受益于使用 Numba:
import numpy as np
import numba as nb
@nb.njit
def simplify_intervals_nb(intervals):
n = 0
for _, end in intervals:
n = max(n, end)
r = np.arange(n + 3)
m = np.zeros(n + 3, dtype=np.bool_)
for start, end in intervals:
m |= (r >= start + 1) & (r <= end + 1)
ind = m.astype(np.int8)
# Find where the belonging test changes
d = np.diff(ind)
# Find interval bounds
start = np.where(d > 0)[0]
end = np.where(d < 0)[0] - 1
# Make final intervals array
return np.stack((start, end), axis=1)
IPython 中的快速测试:
import random
random.seed(100)
start = [random.randint(0, 10000) for _ in range(300)]
end = start = [s + random.randint(0, 3000) for s in start]
intervals = list(zip(start, end))
print(np.all(simplify_intervals(intervals) == simplify_intervals_nb(intervals)))
# True
%timeit simplify_intervals(intervals)
# 15.2 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit simplify_intervals_nb(intervals)
# 9.54 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
假设您的输入元组按照示例中的方式排序,则可以执行以下操作:
p = [(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)]
ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
np.delete(p, [ind, ind+1]).reshape(-1, 2)
输出:
array([[ 4, 10],
[11, 12],
[15, 25]])
然后您可以使用例如将其转换为 [(4, 10), (11, 12), (15, 25)]
list(map(tuple, ...))
.
Edit:只有当每个元组 (x_i, y_i)
对所有 i 都满足 x_i <= x_{i+1}
和 y_i <= y_{i+1}
时,以上才有效,如原始示例。
要使其在所有 i
的唯一条件 x_i <= y_i
下工作,您必须预处理列表:
# Example from comments (modified by subtracting the min value and removing duplicates)
p = [(0, 90), (72, 81), (87, 108), (459, 606)]
p = list(zip(sorted([tup[0] for tup in p]), sorted([tup[1] for tup in p])))
ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
ind = ind[ind % 2 == 1] # this is needed for cases when x_i = y_i
np.delete(p, [ind, ind+1]).reshape(-1, 2)
输出:
array([[ 0, 108],
[459, 606]])
我的问题是合并有重叠的区间 示例:
input:
[(4,8),(6,10),(11,12),(15,20),(20,25)]
output:
[(4, 10),(11,12), (15, 25)]
input:
([(4,8),(6,10),(11,12),(15,20)])
output:
[(4, 10),(11,12), (15, 20)]
我用经典的 python 代码(使用循环,if 条件) 但是我想在几行中使用 python 库(pandas、numpy..) 有什么建议吗? 提前致谢
我不确定这一定是最佳选择,因为它需要 O(max_interval_value * num_intervals)
时间和内存,但这是使用 NumPy 的直接实现:
import numpy as np
def simplify_intervals(intervals):
# Make intervals into an array
intervals = np.asarray(intervals)
# Make array for zero to the greatest interval end (plus bounds values
r = np.arange(intervals[:, 1].max() + 3)[:, np.newaxis]
# Check what elements of the array are within each interval
m = (r >= intervals[:, 0] + 1) & (r <= intervals[:, 1] + 1)
# Collapse belonging test for each value
ind = m.any(1).astype(np.int8)
# Find where the belonging test changes
d = np.diff(ind)
# Find interval bounds
start = np.where(d > 0)[0]
end = np.where(d < 0)[0] - 1
# Make final intervals array
return np.stack((start, end), axis=1)
print(simplify_intervals([(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)]))
# [[ 4 12]
# [15 25]]
print(simplify_intervals(([(4,8),(6,10),(11,12),(15,20)])))
# [[ 4 12]
# [15 20]]
注意:这假定为正区间值。它可以适应支持负范围,实际上优化了一点,只考虑从最小值到最大值。
编辑:
如果您想对大量间隔或边界使用此方法,您可能会受益于使用 Numba:
import numpy as np
import numba as nb
@nb.njit
def simplify_intervals_nb(intervals):
n = 0
for _, end in intervals:
n = max(n, end)
r = np.arange(n + 3)
m = np.zeros(n + 3, dtype=np.bool_)
for start, end in intervals:
m |= (r >= start + 1) & (r <= end + 1)
ind = m.astype(np.int8)
# Find where the belonging test changes
d = np.diff(ind)
# Find interval bounds
start = np.where(d > 0)[0]
end = np.where(d < 0)[0] - 1
# Make final intervals array
return np.stack((start, end), axis=1)
IPython 中的快速测试:
import random
random.seed(100)
start = [random.randint(0, 10000) for _ in range(300)]
end = start = [s + random.randint(0, 3000) for s in start]
intervals = list(zip(start, end))
print(np.all(simplify_intervals(intervals) == simplify_intervals_nb(intervals)))
# True
%timeit simplify_intervals(intervals)
# 15.2 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit simplify_intervals_nb(intervals)
# 9.54 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
假设您的输入元组按照示例中的方式排序,则可以执行以下操作:
p = [(4, 8), (6, 10), (11, 12), (15, 20), (20, 25)]
ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
np.delete(p, [ind, ind+1]).reshape(-1, 2)
输出:
array([[ 4, 10],
[11, 12],
[15, 25]])
然后您可以使用例如将其转换为 [(4, 10), (11, 12), (15, 25)]
list(map(tuple, ...))
.
Edit:只有当每个元组 (x_i, y_i)
对所有 i 都满足 x_i <= x_{i+1}
和 y_i <= y_{i+1}
时,以上才有效,如原始示例。
要使其在所有 i
的唯一条件 x_i <= y_i
下工作,您必须预处理列表:
# Example from comments (modified by subtracting the min value and removing duplicates)
p = [(0, 90), (72, 81), (87, 108), (459, 606)]
p = list(zip(sorted([tup[0] for tup in p]), sorted([tup[1] for tup in p])))
ind = np.where(np.diff(np.array(p).flatten()) <= 0)[0]
ind = ind[ind % 2 == 1] # this is needed for cases when x_i = y_i
np.delete(p, [ind, ind+1]).reshape(-1, 2)
输出:
array([[ 0, 108],
[459, 606]])