平滑 numpy/pandas 中的一系列加权值
Smoothing a series of weighted values in numpy/pandas
我有一个 pandas 测量值和相应权重的 DataFrame:
df = pd.DataFrame({'x': np.random.randn(1000), 'w': np.random.rand(1000)})
我想在采用元素方式时平滑测量值 (x
)
权重 (w
) 考虑在内。这与滑动 window 的权重无关,
我也想申请(例如三角形 window,或更漂亮的东西)。因此,要计算每个 window 中的平滑值,该函数不仅应通过 window 函数(例如三角形),而且还应通过中的相应元素对 x
的切片元素进行加权w
.
据我所知,pd.rolling_apply
不会这样做,因为它适用于
分别在 x
和 w
上给出函数。同样,pd.rolling_window
也不考虑源 DataFrame 的元素权重;加权 window(例如 'triangle')可以由用户定义,但预先固定。
这是我的慢速实现:
def rolling_weighted_triangle(x, w, window_size):
"""Smooth with triangle window, also using per-element weights."""
# Simplify slicing
wing = window_size // 2
# Pad both arrays with mirror-image values at edges
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
# Generate a (triangular) window of weights to slide
incr = 1. / (wing + 1)
ramp = np.arange(incr, 1, incr)
triangle = np.r_[ramp, 1.0, ramp[::-1]]
# Apply both sets of weights over each window
slices = (slice(i - wing, i + wing + 1) for i in xrange(wing, len(x) + wing))
out = (np.average(xp[slc], weights=triangle * wp[slc]) for slc in slices)
return np.fromiter(out, x.dtype)
如何使用 numpy/scipy/pandas 加快速度?
数据帧已经可以占用 RAM 的重要部分(10k 到 200M 行),例如预先分配 window-weights-per-element 的二维数组太多了。我试图尽量减少临时数组的使用,也许使用
np.lib.stride_tricks.as_strided
和 np.apply_along_axis
或 np.convolve
,但还没有找到任何可以完全复制上述内容的东西。
这是统一 window 的等价物,而不是三角形(使用 )——接近但不完全:
def get_sliding_window(a, width):
"""Sliding window over a 2D array.
Source:
"""
# NB: a = df.values or np.vstack([x, y]).T
s0, s1 = a.strides
m, n = a.shape
return as_strided(a,
shape=(m-width+1, width, n),
strides=(s0, s0, s1))
def rolling_weighted_average(x, w, window_size):
"""Rolling weighted average with a uniform 'boxcar' window."""
wing = window_size // 2
window_size = 2 * wing + 1
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
x_w = np.vstack([xp, wp]).T
wins = get_sliding_window(x_w, window_size)
# TODO - apply triangle window weights - multiply over wins[,:,1]?
result = np.average(wins[:,:,0], axis=1, weights=wins[:,:,1])
return result
你可以在那里简单地使用卷积,就像这样 -
def rolling_weighted_triangle_conv(x, w, window_size):
"""Smooth with triangle window, also using per-element weights."""
# Simplify slicing
wing = window_size // 2
# Pad both arrays with mirror-image values at edges
xp = np.concatenate(( x[wing-1::-1], x, x[:-wing-1:-1] ))
wp = np.concatenate(( w[wing-1::-1], w, w[:-wing-1:-1] ))
# Generate a (triangular) window of weights to slide
incr = 1. / (wing + 1)
ramp = np.arange(incr, 1, incr)
triangle = np.r_[ramp, 1.0, ramp[::-1]]
D = np.convolve(wp*xp, triangle)[window_size-1:-window_size+1]
N = np.convolve(wp, triangle)[window_size-1:-window_size+1]
return D/N
运行时测试
In [265]: x = np.random.randn(1000)
...: w = np.random.rand(1000)
...: WSZ = 7
...:
In [266]: out1 = rolling_weighted_triangle(x, w, window_size=WSZ)
...: out2 = rolling_weighted_triangle_conv(x, w, window_size=WSZ)
...: print(np.allclose(out1, out2))
...:
True
In [267]: %timeit rolling_weighted_triangle(x, w, window_size=WSZ)
...: %timeit rolling_weighted_triangle_conv(x, w, window_size=WSZ)
...:
100 loops, best of 3: 10.2 ms per loop
10000 loops, best of 3: 32.9 µs per loop
300x+
那里加速了!
我有一个 pandas 测量值和相应权重的 DataFrame:
df = pd.DataFrame({'x': np.random.randn(1000), 'w': np.random.rand(1000)})
我想在采用元素方式时平滑测量值 (x
)
权重 (w
) 考虑在内。这与滑动 window 的权重无关,
我也想申请(例如三角形 window,或更漂亮的东西)。因此,要计算每个 window 中的平滑值,该函数不仅应通过 window 函数(例如三角形),而且还应通过中的相应元素对 x
的切片元素进行加权w
.
据我所知,pd.rolling_apply
不会这样做,因为它适用于
分别在 x
和 w
上给出函数。同样,pd.rolling_window
也不考虑源 DataFrame 的元素权重;加权 window(例如 'triangle')可以由用户定义,但预先固定。
这是我的慢速实现:
def rolling_weighted_triangle(x, w, window_size):
"""Smooth with triangle window, also using per-element weights."""
# Simplify slicing
wing = window_size // 2
# Pad both arrays with mirror-image values at edges
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
# Generate a (triangular) window of weights to slide
incr = 1. / (wing + 1)
ramp = np.arange(incr, 1, incr)
triangle = np.r_[ramp, 1.0, ramp[::-1]]
# Apply both sets of weights over each window
slices = (slice(i - wing, i + wing + 1) for i in xrange(wing, len(x) + wing))
out = (np.average(xp[slc], weights=triangle * wp[slc]) for slc in slices)
return np.fromiter(out, x.dtype)
如何使用 numpy/scipy/pandas 加快速度?
数据帧已经可以占用 RAM 的重要部分(10k 到 200M 行),例如预先分配 window-weights-per-element 的二维数组太多了。我试图尽量减少临时数组的使用,也许使用
np.lib.stride_tricks.as_strided
和 np.apply_along_axis
或 np.convolve
,但还没有找到任何可以完全复制上述内容的东西。
这是统一 window 的等价物,而不是三角形(使用
def get_sliding_window(a, width):
"""Sliding window over a 2D array.
Source:
"""
# NB: a = df.values or np.vstack([x, y]).T
s0, s1 = a.strides
m, n = a.shape
return as_strided(a,
shape=(m-width+1, width, n),
strides=(s0, s0, s1))
def rolling_weighted_average(x, w, window_size):
"""Rolling weighted average with a uniform 'boxcar' window."""
wing = window_size // 2
window_size = 2 * wing + 1
xp = np.r_[x[wing-1::-1], x, x[:-wing-1:-1]]
wp = np.r_[w[wing-1::-1], w, w[:-wing-1:-1]]
x_w = np.vstack([xp, wp]).T
wins = get_sliding_window(x_w, window_size)
# TODO - apply triangle window weights - multiply over wins[,:,1]?
result = np.average(wins[:,:,0], axis=1, weights=wins[:,:,1])
return result
你可以在那里简单地使用卷积,就像这样 -
def rolling_weighted_triangle_conv(x, w, window_size):
"""Smooth with triangle window, also using per-element weights."""
# Simplify slicing
wing = window_size // 2
# Pad both arrays with mirror-image values at edges
xp = np.concatenate(( x[wing-1::-1], x, x[:-wing-1:-1] ))
wp = np.concatenate(( w[wing-1::-1], w, w[:-wing-1:-1] ))
# Generate a (triangular) window of weights to slide
incr = 1. / (wing + 1)
ramp = np.arange(incr, 1, incr)
triangle = np.r_[ramp, 1.0, ramp[::-1]]
D = np.convolve(wp*xp, triangle)[window_size-1:-window_size+1]
N = np.convolve(wp, triangle)[window_size-1:-window_size+1]
return D/N
运行时测试
In [265]: x = np.random.randn(1000)
...: w = np.random.rand(1000)
...: WSZ = 7
...:
In [266]: out1 = rolling_weighted_triangle(x, w, window_size=WSZ)
...: out2 = rolling_weighted_triangle_conv(x, w, window_size=WSZ)
...: print(np.allclose(out1, out2))
...:
True
In [267]: %timeit rolling_weighted_triangle(x, w, window_size=WSZ)
...: %timeit rolling_weighted_triangle_conv(x, w, window_size=WSZ)
...:
100 loops, best of 3: 10.2 ms per loop
10000 loops, best of 3: 32.9 µs per loop
300x+
那里加速了!