包含空值的数据部分的索引
Indices of parts of data containing null values
我正在寻找一种算法,它允许我搜索并获取系列中所有间隙 (nans) 的索引,其中索引指的是“分区”的开始和结束。我找不到解决方案,所以我最终使用了自己创建的代码。一切都很好,除了这两种方法似乎有点慢。我想知道有没有办法优化代码。
我尝试了两种方法。第一个对所有索引进行简单的 for 循环并检查连续性。另一个删除 nan 值,然后再次使用 List Comprehension 检查是否继续。后一种方法更快。
我想知道是否有更好的方法来提高速度,或者我可能错过了一些已经内置的东西。谢谢
数据:
import numpy as np
import pandas as pd
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan```
第一种方法:
# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
i_nans = w[w.isnull()].index.to_numpy()
len_nans = i_nans.shape[0]
f, l, p, n = np.nan, np.nan, np.nan, np.nan
l_nans = list()
i = 0
for i, e in enumerate(i_nans.tolist()):
if not np.isnan(n):
p = n
n = e
if np.isnan(f):
f = e
if (n-p) > 1:
l = p
l_nans.append((f, l))
f, l = e, np.nan
if i == len_nans-1:
l = n
l_nans.append((f, l))
print(l_nans)
print(time() - t0)
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
3.1106319427490234
第二种方法:
# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
v = w.drop(w[w.isnull()].index, axis=0)
l_nans = [(e[0]+1, e[1]-1) for e in zip(v.index[:-1], v.index[1:]) if e[1]-e[0] > 1]
if not any(v.index.isin([w.index[0]])):
l_nans.insert(0, (0, v.first_valid_index()-1))
if not any(v.index.isin([w.index[-1]])):
l_nans.append((v.last_valid_index()+1, w.index[-1]))
print(l_nans)
print(time() - t0)
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
1.8505527973175049
编辑。
我意识到我的真实数据的某些部分具有单个 nan 值。因此示例数据如下:
import numpy as np
import pandas as pd
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan
w.loc[1600] = np.nan
w.loc[1700] = np.nan
您可以使此循环更快。
import pandas as pd
import numpy as np
import time
df = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
for i in np.arange(0, 1500, 200):
df.loc[df.index[0]+i:df.index[0]+i+100] = np.nan
df.loc[2880-100:] = np.nan
start_time = time.time()
data = df.index[df.isnull() == True].tolist() + [10**6]
nan_range = []
start = 0
for i in range(len(data)-1):
if data[i] + 2 < data[i+1]:
end = data[i]
nan_range.append((start, end))
start = data[i+1]
end_time = time.time()
print('time = %f' % (end_time-start_time))
输出:
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
time = 0.000942
这是另一个版本。简而言之,我们找到带有 NaN 的索引值(一行),然后我们找到连续 NaN 的起点和终点。
import numpy as np
import pandas as pd
import time
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan
大部分代码是打印语句:
start_time = time.time()
# find index such that w is NaN
idx = w[ w.isna() ].index
# find the break-points
# idx[1:] is the index (except the first value)
# idx[:-1] is the index (except the last value)
# this allows us to calculate distance from current to previous
print(f'[({idx[0]}, ', end='')
for curr, prev in zip(idx[1:], idx[:-1]):
diff = curr - prev
if diff > 1:
print(f'{prev}),')
print(f'({curr}, ', end='')
print(f'{idx[-1]})]')
end_time = time.time()
print('time = %f' % (end_time-start_time))
[(0, 100),
(200, 300),
(400, 500),
(600, 700),
(800, 900),
(1000, 1100),
(1200, 1300),
(1400, 1500),
(2780, 2879)]
time = 0.002937
您可以使用 https://www.geeksforgeeks.org/python-make-a-list-of-intervals-with-sequential-numbers/
中的 intervals_extract
食谱
import itertools
def intervals_extract(iterable):
iterable = sorted(set(iterable))
for key, group in itertools.groupby(enumerate(iterable),
lambda t: t[1] - t[0]):
group = list(group)
yield [group[0][1], group[-1][1]]
itertools.groupby
将数据组合在一起 只要键函数 returns 具有相同的值 。关键函数是两个连续值之间的差值,只要它们属于同一区间,则为 1,否则为更大的值。这也是我们使用集合并对它进行排序:避免重复或错误排序的值。因此,我们为每个间隔 (group
) 得到一个迭代器。唯一剩下的就是使用 list
函数使用迭代器并生成每个迭代器的第一个和最后一个值。对于这种情况,直接打印值会更简单一些,但像这样它会更通用。
作为输入,只需使用您拥有 NaN
s:
的索引
In [72]: list(intervals_extract(w[w.isna()].index))
Out[72]:
[[0, 100],
[200, 300],
[400, 500],
[600, 700],
[800, 900],
[1000, 1100],
[1200, 1300],
[1400, 1500],
[1600, 1600],
[1700, 1700],
[2780, 2879]]
In [73]: %timeit list(intervals_extract(w[w.isna()].index))
485 µs ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
编辑:解释了intervals_extract
函数
背后的想法
只是循环的更优化版本:
w2 = w.index[w.isna()].tolist()
s = e = w2[0]
l_nans = []
for i in range(1, len(w2)):
if w2[i] != 1 + e:
l_nans.append((s, e))
s = w2[i]
e = w2[i]
if e - s >= 1:
l_nans.append((s, e))
输出:
[(0, 100),
(200, 300),
(400, 500),
(600, 700),
(800, 900),
(1000, 1100),
(1200, 1300),
(1400, 1500),
(2780, 2879)]
性能(%%timeit
):
392 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
你有一个开始和结束指针,s
和 e
。只需将 ind[i]
分配给 e
(e 将始终等于您检查的最后一个元素),if ind[i] - e > 1
表示您移动到另一个范围,因此添加范围并设置 s
到ind[i]
,然后重复直到结束。
最后,由于循环可能在到达e - s > 1之前中断,检查最后一个索引 - 开始是否大于1,那么这意味着最后一个索引形成一个范围,所以将其添加到列表。
我正在寻找一种算法,它允许我搜索并获取系列中所有间隙 (nans) 的索引,其中索引指的是“分区”的开始和结束。我找不到解决方案,所以我最终使用了自己创建的代码。一切都很好,除了这两种方法似乎有点慢。我想知道有没有办法优化代码。
我尝试了两种方法。第一个对所有索引进行简单的 for 循环并检查连续性。另一个删除 nan 值,然后再次使用 List Comprehension 检查是否继续。后一种方法更快。
我想知道是否有更好的方法来提高速度,或者我可能错过了一些已经内置的东西。谢谢
数据:
import numpy as np
import pandas as pd
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan```
第一种方法:
# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
i_nans = w[w.isnull()].index.to_numpy()
len_nans = i_nans.shape[0]
f, l, p, n = np.nan, np.nan, np.nan, np.nan
l_nans = list()
i = 0
for i, e in enumerate(i_nans.tolist()):
if not np.isnan(n):
p = n
n = e
if np.isnan(f):
f = e
if (n-p) > 1:
l = p
l_nans.append((f, l))
f, l = e, np.nan
if i == len_nans-1:
l = n
l_nans.append((f, l))
print(l_nans)
print(time() - t0)
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
3.1106319427490234
第二种方法:
# Get indices
# `l_nans` stores the first and the last index of each gap
t0 = time()
for c in range(1000):
v = w.drop(w[w.isnull()].index, axis=0)
l_nans = [(e[0]+1, e[1]-1) for e in zip(v.index[:-1], v.index[1:]) if e[1]-e[0] > 1]
if not any(v.index.isin([w.index[0]])):
l_nans.insert(0, (0, v.first_valid_index()-1))
if not any(v.index.isin([w.index[-1]])):
l_nans.append((v.last_valid_index()+1, w.index[-1]))
print(l_nans)
print(time() - t0)
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
1.8505527973175049
编辑。
我意识到我的真实数据的某些部分具有单个 nan 值。因此示例数据如下:
import numpy as np
import pandas as pd
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan
w.loc[1600] = np.nan
w.loc[1700] = np.nan
您可以使此循环更快。
import pandas as pd
import numpy as np
import time
df = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
for i in np.arange(0, 1500, 200):
df.loc[df.index[0]+i:df.index[0]+i+100] = np.nan
df.loc[2880-100:] = np.nan
start_time = time.time()
data = df.index[df.isnull() == True].tolist() + [10**6]
nan_range = []
start = 0
for i in range(len(data)-1):
if data[i] + 2 < data[i+1]:
end = data[i]
nan_range.append((start, end))
start = data[i+1]
end_time = time.time()
print('time = %f' % (end_time-start_time))
输出:
[(0, 100), (200, 300), (400, 500), (600, 700), (800, 900), (1000, 1100), (1200, 1300), (1400, 1500), (2780, 2879)]
time = 0.000942
这是另一个版本。简而言之,我们找到带有 NaN 的索引值(一行),然后我们找到连续 NaN 的起点和终点。
import numpy as np
import pandas as pd
import time
# Create an object with sample data
w = pd.Series(np.sin(2*np.pi*np.linspace(0,1,2880)))
# Insert a few gaps with missing values
for i in np.arange(0, 1500, 200):
w.loc[w.index[0]+i:w.index[0]+i+100] = np.nan
w.loc[2880-100:] = np.nan
大部分代码是打印语句:
start_time = time.time()
# find index such that w is NaN
idx = w[ w.isna() ].index
# find the break-points
# idx[1:] is the index (except the first value)
# idx[:-1] is the index (except the last value)
# this allows us to calculate distance from current to previous
print(f'[({idx[0]}, ', end='')
for curr, prev in zip(idx[1:], idx[:-1]):
diff = curr - prev
if diff > 1:
print(f'{prev}),')
print(f'({curr}, ', end='')
print(f'{idx[-1]})]')
end_time = time.time()
print('time = %f' % (end_time-start_time))
[(0, 100),
(200, 300),
(400, 500),
(600, 700),
(800, 900),
(1000, 1100),
(1200, 1300),
(1400, 1500),
(2780, 2879)]
time = 0.002937
您可以使用 https://www.geeksforgeeks.org/python-make-a-list-of-intervals-with-sequential-numbers/
中的intervals_extract
食谱
import itertools
def intervals_extract(iterable):
iterable = sorted(set(iterable))
for key, group in itertools.groupby(enumerate(iterable),
lambda t: t[1] - t[0]):
group = list(group)
yield [group[0][1], group[-1][1]]
itertools.groupby
将数据组合在一起 只要键函数 returns 具有相同的值 。关键函数是两个连续值之间的差值,只要它们属于同一区间,则为 1,否则为更大的值。这也是我们使用集合并对它进行排序:避免重复或错误排序的值。因此,我们为每个间隔 (group
) 得到一个迭代器。唯一剩下的就是使用 list
函数使用迭代器并生成每个迭代器的第一个和最后一个值。对于这种情况,直接打印值会更简单一些,但像这样它会更通用。
作为输入,只需使用您拥有 NaN
s:
In [72]: list(intervals_extract(w[w.isna()].index))
Out[72]:
[[0, 100],
[200, 300],
[400, 500],
[600, 700],
[800, 900],
[1000, 1100],
[1200, 1300],
[1400, 1500],
[1600, 1600],
[1700, 1700],
[2780, 2879]]
In [73]: %timeit list(intervals_extract(w[w.isna()].index))
485 µs ± 5.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
编辑:解释了intervals_extract
函数
只是循环的更优化版本:
w2 = w.index[w.isna()].tolist()
s = e = w2[0]
l_nans = []
for i in range(1, len(w2)):
if w2[i] != 1 + e:
l_nans.append((s, e))
s = w2[i]
e = w2[i]
if e - s >= 1:
l_nans.append((s, e))
输出:
[(0, 100),
(200, 300),
(400, 500),
(600, 700),
(800, 900),
(1000, 1100),
(1200, 1300),
(1400, 1500),
(2780, 2879)]
性能(%%timeit
):
392 µs ± 9.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
你有一个开始和结束指针,s
和 e
。只需将 ind[i]
分配给 e
(e 将始终等于您检查的最后一个元素),if ind[i] - e > 1
表示您移动到另一个范围,因此添加范围并设置 s
到ind[i]
,然后重复直到结束。
最后,由于循环可能在到达e - s > 1之前中断,检查最后一个索引 - 开始是否大于1,那么这意味着最后一个索引形成一个范围,所以将其添加到列表。