如何有效地检查 Python 中的多个范围值中是否存在整数

How to efficiently check if an integer exist in a multiple range value in Python

如果counter(即idx)在多个范围

中的任何一个范围内,则objective执行特定程序

在这种情况下,范围源自df,如下

df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))

例如,如果 counter 整数值在 (4-8)(20,25).

范围内,则触发某些 activity

下面的代码应该回答下面的问题objective

import pandas as pd

df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))

r_bot=df['rbot'].values.tolist()
r_top=df['rtop'].values.tolist()
for idx in range (120):
    h=[True for x,y in zip(r_bot,r_top) if x <= idx <=y ]

    if True in h:
        print(f'Do some operation with  {idx}')

产生以下输出

 Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  9
Do some operation with  10
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

在实际实现中,range pairs可达数百个,而counter可达数十万个。因此,我想知道这样做是否更有效?

您可以尝试使用 numpy 广播创建一个布尔掩码,对于落在每对 rbotrtop 值之间的索引,该掩码 returns 为真。然后将它与 range 相乘以获得相关值。最后,使用 flatnonzero 到 select True 值:

import numpy as np
arr = np.arange(120)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
    print(f'Do some operation with  {idx}')

输出:

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

有很多方法可以解决这个问题,这里是一个~

df = pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
df.rtop += 1
for idx in range(120):
    if any(idx in range(*df.iloc[x]) for x in df.index):
        print(f'Do some operation with  {idx}')

输出:

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

一个选项是 pandas 切割和间隔索引:

arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')

for idx in arr:
    if intervals.contains(idx).any():
        print(f'Do some operation with  {idx}')


Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

使用更新后的示例,上面的代码有效:


arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')

for idx in arr:
    if intervals.contains(idx).any():
        print(f'Do some operation with  {idx}')

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  9
Do some operation with  10
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

感谢@user2246849 的测试,我想你应该看看它是否满足你的需求。

仅供参考,如果您只想对每个有效索引执行一个操作,而不打算稍后执行任何需要 pandas 的额外聚合,这样速度更快且内存效率更高:

import pandas as pd

rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120)

df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))

intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
    overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
    for idx in overlap:
         print(f'Do some operation with  {idx}')

只需计算主范围与子范围的重叠。

Do some operation with  4
Do some operation with  5
Do some operation with  6
Do some operation with  7
Do some operation with  8
Do some operation with  20
Do some operation with  21
Do some operation with  22
Do some operation with  23
Do some operation with  24
Do some operation with  25

具有更大数据集的运行时:

import pandas as pd
import numpy as np

rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120000)

df = pd.DataFrame({'rbot': rbot, 'rtop': rtop})

def python():
    intervals = zip(df['rbot'], df['rtop'])
    for i in intervals:
        overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
        for idx in overlap:
            pass#print(idx)

# 5.03 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit python()

def pandas():
    arr = np.arange(*main_range)
    
    intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')

    out = pd.cut(arr, intervals)

    out = arr[pd.notna(out)]
    
    for idx in out:
        pass#print(idx)

# 67 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pandas()


def numpy():
    arr = np.arange(*main_range)
    msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
    out = np.flatnonzero(msk*arr)
    for idx in out:
        pass#print(idx)

# 2.77 s ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    
%timeit numpy()