包含空值的数据部分的索引

Indices of parts of data containing null values

我正在寻找一种算法,它允许我搜索并获取系列中所有间隙 (nans) 的索引,其中索引指的是“分区”的开始和结束。我找不到解决方案,所以我最终使用了自己创建的代码。一切都很好,除了这两种方法似乎有点慢。我想知道有没有办法优化代码。

我尝试了两种方法。第一个对所有索引进行简单的 for 循环并检查连续性。另一个删除 nan 值,然后再次使用 List Comprehension 检查是否继续。后一种方法更快。

我想知道是否有更好的方法来提高速度,或者我可能错过了一些已经内置的东西。谢谢


数据:

import numpy as np
import pandas as pd

# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
    w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan```


第一种方法:

# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
    i_nans = w[w.isnull()].index.to_numpy()
    len_nans = i_nans.shape[0]
    f, l, p, n = np.nan, np.nan, np.nan, np.nan
    l_nans = list()
    i = 0
    for i, e in enumerate(i_nans.tolist()):
        if not np.isnan(n):
            p = n
        n = e
        if np.isnan(f):
            f = e
        if (n-p) > 1:
            l = p
            l_nans.append((f, l))
            f, l = e, np.nan
        if i == len_nans-1:
            l = n
            l_nans.append((f, l))
print(l_nans)
print(time() - t0)

[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
3.1106319427490234


第二种方法:

# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
    v = w.drop(w[w.isnull()].index, axis=0)
    l_nans = [(e[0]+1, e[1]-1) for e in zip(v.index[:-1], v.index[1:]) if e[1]-e[0] > 1]
    if not any(v.index.isin([w.index[0]])):
        l_nans.insert(0, (0, v.first_valid_index()-1))
    if not any(v.index.isin([w.index[-1]])):
        l_nans.append((v.last_valid_index()+1, w.index[-1]))
print(l_nans)
print(time() - t0)

[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
1.8505527973175049

编辑。

我意识到我的真​​实数据的某些部分具有单个 nan 值。因此示例数据如下:

import numpy as np
import pandas as pd

# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
    w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan
w.loc[1600] = np.nan
w.loc[1700] = np.nan

您可以使此循环更快。

import pandas as pd
import numpy as np
import time

df = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
for i in np.arange(0, 1500, 200):
    df.loc[df.index[0]+i:df.index[0]+i+100] = np.nan
df.loc[2880-100:] = np.nan


start_time = time.time()
data = df.index[df.isnull() == True].tolist() + [10**6]

nan_range = []
start = 0
for i in range(len(data)-1):
    if data[i] + 2 < data[i+1]:
        end = data[i]
        nan_range.append((start, end))
        start = data[i+1]

end_time = time.time()
print('time = %f' % (end_time-start_time))

输出:

[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]

time = 0.000942

这是另一个版本。简而言之,我们找到带有 NaN 的索引值(一行),然后我们找到连续 NaN 的起点和终点。

import numpy as np
import pandas as pd
import time

# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
    w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan

大部分代码是打印语句:

start_time = time.time()

# find index such that w is NaN
idx = w[ w.isna() ].index

# find the break-points
# idx[1:] is the index (except the first value)
# idx[:-1] is the index (except the last value)
# this allows us to calculate distance from current to previous

print(f'[({idx[0]}, ', end='')

for curr, prev in zip(idx[1:], idx[:-1]):
    diff = curr - prev
    if diff > 1:
        print(f'{prev}),')
        print(f'({curr}, ', end='')
print(f'{idx[-1]})]')

end_time = time.time()
print('time = %f' % (end_time-start_time))

[(0, 100),
(200, 300),
(400, 500),
(600, 700),
(800, 900),
(1000, 1100),
(1200, 1300),
(1400, 1500),
(2780, 2879)]
time = 0.002937

您可以使用 https://www.geeksforgeeks.org/python-make-a-list-of-intervals-with-sequential-numbers/

中的 intervals_extract 食谱
import itertools 
  
def intervals_extract(iterable): 
      
    iterable = sorted(set(iterable)) 
    for key, group in itertools.groupby(enumerate(iterable), 
    lambda t: t[1] - t[0]): 
        group = list(group) 
        yield [group[0][1], group[-1][1]] 

itertools.groupby 将数据组合在一起 只要键函数 returns 具有相同的值 。关键函数是两个连续值之间的差值,只要它们属于同一区间,则为 1,否则为更大的值。这也是我们使用集合并对它进行排序:避免重复或错误排序的值。因此,我们为每个间隔 (group) 得到一个迭代器。唯一剩下的就是使用 list 函数使用迭代器并生成每个迭代器的第一个和最后一个值。对于这种情况,直接打印值会更简单一些,但像这样它会更通用。

作为输入,只需使用您拥有 NaNs:

的索引
In [72]: list(intervals_extract(w[w.isna()].index))                                                                                                                                                        
Out[72]: 
[[0, 100],
 [200, 300],
 [400, 500],
 [600, 700],
 [800, 900],
 [1000, 1100],
 [1200, 1300],
 [1400, 1500],
 [1600, 1600],
 [1700, 1700],
 [2780, 2879]]
In [73]: %timeit list(intervals_extract(w[w.isna()].index))                                                                                                                                                
485 µs ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

编辑:解释了intervals_extract函数

背后的想法

只是循环的更优化版本:

w2 = w.index[w.isna()].tolist()
s = e = w2[0]
l_nans = []
for i in range(1, len(w2)):
    if w2[i] != 1 + e:
        l_nans.append((s, e))
        s = w2[i]
    e = w2[i]
if e - s >= 1:
    l_nans.append((s, e))

输出:

[(0, 100),
 (200, 300),
 (400, 500),
 (600, 700),
 (800, 900),
 (1000, 1100),
 (1200, 1300),
 (1400, 1500),
 (2780, 2879)]

性能(%%timeit):

392 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

你有一个开始和结束指针,se。只需将 ind[i] 分配给 e(e 将始终等于您检查的最后一个元素),if ind[i] - e > 1 表示您移动到另一个范围,因此添加范围并设置 sind[i],然后重复直到结束。

最后,由于循环可能在到达e - s > 1之前中断,检查最后一个索引 - 开始是否大于1,那么这意味着最后一个索引形成一个范围,所以将其添加到列表。