Python:滑动窗口均值,忽略缺失数据
Python: Sliding windowed mean, ignoring missing data
我目前正在尝试处理一个包含缺失值的实验性时间序列数据集。我想计算这个数据集随时间的滑动 windowed 平均值,同时处理 nan 值。我这样做的正确方法是在每个 window 内计算有限元素的总和并将其除以它们的数量。这种非线性迫使我使用非卷积方法来面对这个问题,因此我在这部分过程中遇到了严重的时间瓶颈。作为我试图完成的代码示例,我提出以下内容:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input:\t',data
print 'Output:\t',result
输出:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
是否可以在不使用 for 循环的情况下通过 numpy 操作生成此结果?
您可以使用 Pandas 的 rolling
函数来做到这一点:
import numpy as np
import pandas as pd
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data = np.random.random(n)
data[np.random.randint(0, n-1, n_miss)] = None
windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
输出:
Data Windowed mean
0 0.589376 0.589376
1 0.639173 0.614274
2 0.343534 0.524027
3 0.250329 0.411012
4 0.911952 0.501938
5 NaN 0.581141
6 0.224964 0.568458
7 NaN 0.224964
8 0.508419 0.366692
9 0.215418 0.361918
10 NaN 0.361918
11 0.638118 0.426768
12 0.587478 0.612798
13 0.097037 0.440878
14 0.688689 0.457735
15 0.858593 0.548107
16 0.408903 0.652062
17 0.448993 0.572163
18 NaN 0.428948
19 0.877453 0.663223
20 NaN 0.877453
21 NaN 0.877453
22 0.021798 0.021798
23 0.482054 0.251926
24 0.092387 0.198746
25 0.251766 0.275402
26 0.093854 0.146002
27 NaN 0.172810
28 NaN 0.093854
29 NaN NaN
30 0.965669 0.965669
31 0.695999 0.830834
32 NaN 0.830834
33 NaN 0.695999
34 NaN NaN
35 0.613727 0.613727
36 0.837533 0.725630
37 NaN 0.725630
38 0.782295 0.809914
39 NaN 0.782295
40 0.777429 0.779862
41 0.401355 0.589392
42 0.491709 0.556831
43 0.127813 0.340292
44 0.781625 0.467049
45 0.960466 0.623301
46 0.637618 0.793236
47 0.651264 0.749782
48 0.154911 0.481264
49 0.159145 0.321773
这是一个基于卷积的方法,使用 np.convolve
-
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
请注意,这将在两侧各有一个额外的元素。
如果您正在处理 2D
数据,我们可以使用 Scipy's 2D convolution
。
接近 -
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): \
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
样本运行-
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
运行时测试 -
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# @jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop
我目前正在尝试处理一个包含缺失值的实验性时间序列数据集。我想计算这个数据集随时间的滑动 windowed 平均值,同时处理 nan 值。我这样做的正确方法是在每个 window 内计算有限元素的总和并将其除以它们的数量。这种非线性迫使我使用非卷积方法来面对这个问题,因此我在这部分过程中遇到了严重的时间瓶颈。作为我试图完成的代码示例,我提出以下内容:
import numpy as np
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data= np.random.random(50)
data[np.random.randint(0,n-1, n_miss)] = None
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
print 'Input:\t',data
print 'Output:\t',result
输出:
Input: [ 0.47431791 0.17620835 0.78495647 0.79894688 0.58334064 0.38068788
0.87829696 nan 0.71589171 nan 0.70359557 0.76113969
0.13694387 0.32126573 0.22730891 nan 0.35057169 nan
0.89251851 0.56226354 0.040117 nan 0.37249799 0.77625334
nan nan nan nan 0.63227417 0.92781944
0.99416471 0.81850753 0.35004997 nan 0.80743783 0.60828597
nan 0.01410721 nan nan 0.6976317 nan
0.03875394 0.60924066 0.22998065 nan 0.34476729 0.38090961
nan 0.2021964 ]
Output: [ 0.32526313 0.47849424 0.5867039 0.72241466 0.58765847 0.61410849
0.62949242 0.79709433 0.71589171 0.70974364 0.73236763 0.53389305
0.40644977 0.22850617 0.27428732 0.2889403 0.35057169 0.6215451
0.72739103 0.49829968 0.30119027 0.20630749 0.57437567 0.57437567
0.77625334 nan nan 0.63227417 0.7800468 0.85141944
0.91349722 0.7209074 0.58427875 0.5787439 0.7078619 0.7078619
0.31119659 0.01410721 0.01410721 0.6976317 0.6976317 0.36819282
0.3239973 0.29265842 0.41961066 0.28737397 0.36283845 0.36283845
0.29155301 0.2021964 ]
是否可以在不使用 for 循环的情况下通过 numpy 操作生成此结果?
您可以使用 Pandas 的 rolling
函数来做到这一点:
import numpy as np
import pandas as pd
#Construct sample data
n = 50
n_miss = 20
win_size = 3
data = np.random.random(n)
data[np.random.randint(0, n-1, n_miss)] = None
windowed_mean = pd.Series(data).rolling(window=win_size, min_periods=1).mean()
print(pd.DataFrame({'Data': data, 'Windowed mean': windowed_mean}) )
输出:
Data Windowed mean
0 0.589376 0.589376
1 0.639173 0.614274
2 0.343534 0.524027
3 0.250329 0.411012
4 0.911952 0.501938
5 NaN 0.581141
6 0.224964 0.568458
7 NaN 0.224964
8 0.508419 0.366692
9 0.215418 0.361918
10 NaN 0.361918
11 0.638118 0.426768
12 0.587478 0.612798
13 0.097037 0.440878
14 0.688689 0.457735
15 0.858593 0.548107
16 0.408903 0.652062
17 0.448993 0.572163
18 NaN 0.428948
19 0.877453 0.663223
20 NaN 0.877453
21 NaN 0.877453
22 0.021798 0.021798
23 0.482054 0.251926
24 0.092387 0.198746
25 0.251766 0.275402
26 0.093854 0.146002
27 NaN 0.172810
28 NaN 0.093854
29 NaN NaN
30 0.965669 0.965669
31 0.695999 0.830834
32 NaN 0.830834
33 NaN 0.695999
34 NaN NaN
35 0.613727 0.613727
36 0.837533 0.725630
37 NaN 0.725630
38 0.782295 0.809914
39 NaN 0.782295
40 0.777429 0.779862
41 0.401355 0.589392
42 0.491709 0.556831
43 0.127813 0.340292
44 0.781625 0.467049
45 0.960466 0.623301
46 0.637618 0.793236
47 0.651264 0.749782
48 0.154911 0.481264
49 0.159145 0.321773
这是一个基于卷积的方法,使用 np.convolve
-
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
请注意,这将在两侧各有一个额外的元素。
如果您正在处理 2D
数据,我们可以使用 Scipy's 2D convolution
。
接近 -
def original_app(data, win_size):
#Compute mean
result = np.zeros(data.size)
for count in range(data.size):
part_data = data[max(count - (win_size - 1) / 2, 0): \
min(count + (win_size + 1) / 2, data.size)]
mask = np.isfinite(part_data)
if np.sum(mask) != 0:
result[count] = np.sum(part_data[mask]) / np.sum(mask)
else:
result[count] = None
return result
def numpy_app(data, win_size):
mask = np.isnan(data)
K = np.ones(win_size,dtype=int)
out = np.convolve(np.where(mask,0,data), K)/np.convolve(~mask,K)
return out[1:-1] # Slice out the one-extra elems on sides
样本运行-
In [118]: #Construct sample data
...: n = 50
...: n_miss = 20
...: win_size = 3
...: data= np.random.random(50)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [119]: original_app(data, win_size = 3)
Out[119]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
In [120]: numpy_app(data, win_size = 3)
__main__:36: RuntimeWarning: invalid value encountered in divide
Out[120]:
array([ 0.88356487, 0.86829731, 0.85249541, 0.83776219, nan,
nan, 0.61054015, 0.63111926, 0.63111926, 0.65169837,
0.1857301 , 0.58335324, 0.42088104, 0.5384565 , 0.31027752,
0.40768907, 0.3478563 , 0.34089655, 0.55462903, 0.71784816,
0.93195716, nan, 0.41635575, 0.52211653, 0.65053379,
0.76762282, 0.72888574, 0.35250449, 0.35250449, 0.14500637,
0.06997668, 0.22582318, 0.18621848, 0.36320784, 0.19926647,
0.24506199, 0.09983572, 0.47595439, 0.79792941, 0.5982114 ,
0.42389375, 0.28944089, 0.36246113, 0.48088139, 0.71105449,
0.60234163, 0.40012839, 0.45100475, 0.41768466, 0.41768466])
运行时测试 -
In [122]: #Construct sample data
...: n = 50000
...: n_miss = 20000
...: win_size = 3
...: data= np.random.random(n)
...: data[np.random.randint(0,n-1, n_miss)] = np.nan
...:
In [123]: %timeit original_app(data, win_size = 3)
1 loops, best of 3: 1.51 s per loop
In [124]: %timeit numpy_app(data, win_size = 3)
1000 loops, best of 3: 1.09 ms per loop
In [125]: import pandas as pd
# @jdehesa's pandas solution
In [126]: %timeit pd.Series(data).rolling(window=3, min_periods=1).mean()
100 loops, best of 3: 3.34 ms per loop