如何有效地检查 Python 中的多个范围值中是否存在整数
How to efficiently check if an integer exist in a multiple range value in Python
如果counter
(即idx)在多个范围
中的任何一个范围内,则objective执行特定程序
在这种情况下,范围源自df
,如下
df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))
例如,如果 counter
整数值在 (4-8)
或 (20,25)
.
范围内,则触发某些 activity
下面的代码应该回答下面的问题objective
import pandas as pd
df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))
r_bot=df['rbot'].values.tolist()
r_top=df['rtop'].values.tolist()
for idx in range (120):
h=[True for x,y in zip(r_bot,r_top) if x <= idx <=y ]
if True in h:
print(f'Do some operation with {idx}')
产生以下输出
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 9
Do some operation with 10
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
在实际实现中,range pairs可达数百个,而counter可达数十万个。因此,我想知道这样做是否更有效?
您可以尝试使用 numpy 广播创建一个布尔掩码,对于落在每对 rbot
和 rtop
值之间的索引,该掩码 returns 为真。然后将它与 range
相乘以获得相关值。最后,使用 flatnonzero
到 select True 值:
import numpy as np
arr = np.arange(120)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
print(f'Do some operation with {idx}')
输出:
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
有很多方法可以解决这个问题,这里是一个~
df = pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
df.rtop += 1
for idx in range(120):
if any(idx in range(*df.iloc[x]) for x in df.index):
print(f'Do some operation with {idx}')
输出:
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
一个选项是 pandas 切割和间隔索引:
arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
for idx in arr:
if intervals.contains(idx).any():
print(f'Do some operation with {idx}')
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
使用更新后的示例,上面的代码有效:
arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
for idx in arr:
if intervals.contains(idx).any():
print(f'Do some operation with {idx}')
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 9
Do some operation with 10
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
感谢@user2246849 的测试,我想你应该看看它是否满足你的需求。
仅供参考,如果您只想对每个有效索引执行一个操作,而不打算稍后执行任何需要 pandas 的额外聚合,这样速度更快且内存效率更高:
import pandas as pd
rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120)
df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
for idx in overlap:
print(f'Do some operation with {idx}')
只需计算主范围与子范围的重叠。
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
具有更大数据集的运行时:
import pandas as pd
import numpy as np
rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120000)
df = pd.DataFrame({'rbot': rbot, 'rtop': rtop})
def python():
intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
for idx in overlap:
pass#print(idx)
# 5.03 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit python()
def pandas():
arr = np.arange(*main_range)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
out = pd.cut(arr, intervals)
out = arr[pd.notna(out)]
for idx in out:
pass#print(idx)
# 67 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pandas()
def numpy():
arr = np.arange(*main_range)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
pass#print(idx)
# 2.77 s ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit numpy()
如果counter
(即idx)在多个范围
在这种情况下,范围源自df
,如下
df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))
例如,如果 counter
整数值在 (4-8)
或 (20,25)
.
下面的代码应该回答下面的问题objective
import pandas as pd
df=pd.DataFrame(dict(rbot=[4,7,20],rtop=[8,10,25]))
r_bot=df['rbot'].values.tolist()
r_top=df['rtop'].values.tolist()
for idx in range (120):
h=[True for x,y in zip(r_bot,r_top) if x <= idx <=y ]
if True in h:
print(f'Do some operation with {idx}')
产生以下输出
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 9
Do some operation with 10
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
在实际实现中,range pairs可达数百个,而counter可达数十万个。因此,我想知道这样做是否更有效?
您可以尝试使用 numpy 广播创建一个布尔掩码,对于落在每对 rbot
和 rtop
值之间的索引,该掩码 returns 为真。然后将它与 range
相乘以获得相关值。最后,使用 flatnonzero
到 select True 值:
import numpy as np
arr = np.arange(120)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
print(f'Do some operation with {idx}')
输出:
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
有很多方法可以解决这个问题,这里是一个~
df = pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
df.rtop += 1
for idx in range(120):
if any(idx in range(*df.iloc[x]) for x in df.index):
print(f'Do some operation with {idx}')
输出:
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
一个选项是 pandas 切割和间隔索引:
arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
for idx in arr:
if intervals.contains(idx).any():
print(f'Do some operation with {idx}')
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
使用更新后的示例,上面的代码有效:
arr = np.arange(120)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
for idx in arr:
if intervals.contains(idx).any():
print(f'Do some operation with {idx}')
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 9
Do some operation with 10
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
感谢@user2246849 的测试,我想你应该看看它是否满足你的需求。
仅供参考,如果您只想对每个有效索引执行一个操作,而不打算稍后执行任何需要 pandas 的额外聚合,这样速度更快且内存效率更高:
import pandas as pd
rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120)
df=pd.DataFrame(dict(rbot=[4,20],rtop=[8,25]))
intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
for idx in overlap:
print(f'Do some operation with {idx}')
只需计算主范围与子范围的重叠。
Do some operation with 4
Do some operation with 5
Do some operation with 6
Do some operation with 7
Do some operation with 8
Do some operation with 20
Do some operation with 21
Do some operation with 22
Do some operation with 23
Do some operation with 24
Do some operation with 25
具有更大数据集的运行时:
import pandas as pd
import numpy as np
rbot = [i*1000 for i in range(10000)]
rtop = [(i+1)*1000-2 for i in range(10000)]
main_range = (0, 120000)
df = pd.DataFrame({'rbot': rbot, 'rtop': rtop})
def python():
intervals = zip(df['rbot'], df['rtop'])
for i in intervals:
overlap = range(max(main_range[0], i[0]), min(main_range[1], i[-1])+1)
for idx in overlap:
pass#print(idx)
# 5.03 ms ± 58 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit python()
def pandas():
arr = np.arange(*main_range)
intervals = pd.IntervalIndex.from_arrays(df.rbot, df.rtop, closed='both')
out = pd.cut(arr, intervals)
out = arr[pd.notna(out)]
for idx in out:
pass#print(idx)
# 67 ms ± 467 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pandas()
def numpy():
arr = np.arange(*main_range)
msk = ((df[['rbot']].to_numpy() <= arr) & (arr <= df[['rtop']].to_numpy())).sum(axis=0)
out = np.flatnonzero(msk*arr)
for idx in out:
pass#print(idx)
# 2.77 s ± 7.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit numpy()