寻找最快的方法在 3600x20x20 的 xarray 中查找阈值之间的值
Looking for fastest way to find values between thresholds in xarray of 3600x20x20
我想计算平均温度介于两个值(假设 293K 和 303K)之间的天数。这需要针对大约 10000x20x20 大数组(时间、纬度、经度)进行计算。在这种规模下,代码的效率成为一个问题。我知道循环效率很低,但我一直想不出另一种方法来编写它。
简而言之,我正在寻找比我在下面插入的代码更高效的代码。欢迎任何提示或参考!
(除上述之外,我对 python 还很陌生,所以如果您有任何反馈,我们将不胜感激!)
首先我有三个嵌套循环(for i
.. for j
... for k
..)但是这花费了大约 100 倍的时间。使用 1*(boolean)
证明效率更高。我现在正试图摆脱我的最后一个循环 (for i
..)。速度非常重要,因为此脚本将包含在交互式 Web 应用程序中。
import xarray as xr
import numpy as np
import time
# Firstly construct a data array of temperatures with dimensions latitude, longitude, time
da_t1 = xr.DataArray([[290, 295, 300, 305, 295],
[295, 295, 305, 295, 290],
[300, 300, 300, 305, 295],
[290, 295, 300, 305, 295],
[290, 295, 300, 305, 295]],
dims=['lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
da_t2 = xr.DataArray([[295, 295, 305, 295, 295],
[295, 295, 305, 295, 290],
[300, 300, 300, 305, 295],
[290, 300, 300, 305, 305],
[290, 285, 285, 285, 295]],
dims=['lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
da = xr.concat([da_t1, da_t2], 'time')
# Create an array of zeros to keep track of number of days within certain temperature range for each cell
zeros = da[0]
zeros.values = np.zeros((da.sizes['lat'], da.sizes['lon']))
# Loop through the timesteps and the cells to count for each cell the number of days in the temperature range
trange = (293,303)
# Here's the part that could use faster performance
start = time.time()
for i in range(0, (len(da.time))):
int_array = 1*(da.values[i] >= trange[0]) * (da.values[i] <= trange[1])
zeros = zeros + int_array
end = time.time()
print('time elapsed: ',end-start)
print(zeros.values)
结果是一个数组,显示选定时间段内指定温度范围内的天数。在这种情况下:
zeros =
[[1. 2. 1. 1. 2.]
[2. 2. 0. 2. 0.]
[2. 2. 2. 0. 2.]
[0. 2. 2. 0. 1.]
[0. 1. 1. 0. 2.]]
只需使用element-wise布尔/逻辑索引,如
in_between = np.logical_and(da.values[i] >= trange[0], da.values[i] <= trange[1])
sum_in_between = np.count_nonzero(in_between) # True = 1, False = 0
https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
我的方法是
((da >= trange[0]) & (da <= trange[1])).sum(axis=0)
结果:
# <xarray.DataArray (lat: 5, lon: 5)>
# array([[1, 2, 1, 1, 2],
# [2, 2, 0, 2, 0],
# [2, 2, 2, 0, 2],
# [0, 2, 2, 0, 1],
# [0, 1, 1, 0, 2]])
# Coordinates:
# * lat (lat) float64 -5.0 -2.5 0.0 2.5 5.0
# * lon (lon) int32 33 35 37 39 41
编辑:在 IPython 控制台中使用 %timeit
进行计时测量:
import xarray as xr
import numpy as np
da_big = xr.DataArray(np.random.randint(290, 305, (10000, 5, 5)),
dims=['time', 'lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
def OP(darr, trange = (293,303)):
zeros = darr[0]
zeros.values = np.zeros((darr.sizes['lat'], darr.sizes['lon']))
for i in range(0, (len(darr.time))):
int_array = 1*(darr.values[i] >= trange[0]) * (darr.values[i] <= trange[1])
zeros = zeros + int_array
return zeros.values
def SumAxis(darr, trange = (293,303)):
return ((darr >= trange[0]) & (darr <= trange[1])).sum(axis=0)
%timeit -n10 OP(da_big)
%timeit -n10 SumAxis(da_big)
# 466 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1.89 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
我想计算平均温度介于两个值(假设 293K 和 303K)之间的天数。这需要针对大约 10000x20x20 大数组(时间、纬度、经度)进行计算。在这种规模下,代码的效率成为一个问题。我知道循环效率很低,但我一直想不出另一种方法来编写它。
简而言之,我正在寻找比我在下面插入的代码更高效的代码。欢迎任何提示或参考!
(除上述之外,我对 python 还很陌生,所以如果您有任何反馈,我们将不胜感激!)
首先我有三个嵌套循环(for i
.. for j
... for k
..)但是这花费了大约 100 倍的时间。使用 1*(boolean)
证明效率更高。我现在正试图摆脱我的最后一个循环 (for i
..)。速度非常重要,因为此脚本将包含在交互式 Web 应用程序中。
import xarray as xr
import numpy as np
import time
# Firstly construct a data array of temperatures with dimensions latitude, longitude, time
da_t1 = xr.DataArray([[290, 295, 300, 305, 295],
[295, 295, 305, 295, 290],
[300, 300, 300, 305, 295],
[290, 295, 300, 305, 295],
[290, 295, 300, 305, 295]],
dims=['lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
da_t2 = xr.DataArray([[295, 295, 305, 295, 295],
[295, 295, 305, 295, 290],
[300, 300, 300, 305, 295],
[290, 300, 300, 305, 305],
[290, 285, 285, 285, 295]],
dims=['lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
da = xr.concat([da_t1, da_t2], 'time')
# Create an array of zeros to keep track of number of days within certain temperature range for each cell
zeros = da[0]
zeros.values = np.zeros((da.sizes['lat'], da.sizes['lon']))
# Loop through the timesteps and the cells to count for each cell the number of days in the temperature range
trange = (293,303)
# Here's the part that could use faster performance
start = time.time()
for i in range(0, (len(da.time))):
int_array = 1*(da.values[i] >= trange[0]) * (da.values[i] <= trange[1])
zeros = zeros + int_array
end = time.time()
print('time elapsed: ',end-start)
print(zeros.values)
结果是一个数组,显示选定时间段内指定温度范围内的天数。在这种情况下:
zeros =
[[1. 2. 1. 1. 2.]
[2. 2. 0. 2. 0.]
[2. 2. 2. 0. 2.]
[0. 2. 2. 0. 1.]
[0. 1. 1. 0. 2.]]
只需使用element-wise布尔/逻辑索引,如
in_between = np.logical_and(da.values[i] >= trange[0], da.values[i] <= trange[1])
sum_in_between = np.count_nonzero(in_between) # True = 1, False = 0
https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#boolean-array-indexing
我的方法是
((da >= trange[0]) & (da <= trange[1])).sum(axis=0)
结果:
# <xarray.DataArray (lat: 5, lon: 5)>
# array([[1, 2, 1, 1, 2],
# [2, 2, 0, 2, 0],
# [2, 2, 2, 0, 2],
# [0, 2, 2, 0, 1],
# [0, 1, 1, 0, 2]])
# Coordinates:
# * lat (lat) float64 -5.0 -2.5 0.0 2.5 5.0
# * lon (lon) int32 33 35 37 39 41
编辑:在 IPython 控制台中使用 %timeit
进行计时测量:
import xarray as xr
import numpy as np
da_big = xr.DataArray(np.random.randint(290, 305, (10000, 5, 5)),
dims=['time', 'lat', 'lon'],
coords={'lat': [-5, -2.5, 0, 2.5, 5], 'lon': [33, 35, 37, 39, 41]})
def OP(darr, trange = (293,303)):
zeros = darr[0]
zeros.values = np.zeros((darr.sizes['lat'], darr.sizes['lon']))
for i in range(0, (len(darr.time))):
int_array = 1*(darr.values[i] >= trange[0]) * (darr.values[i] <= trange[1])
zeros = zeros + int_array
return zeros.values
def SumAxis(darr, trange = (293,303)):
return ((darr >= trange[0]) & (darr <= trange[1])).sum(axis=0)
%timeit -n10 OP(da_big)
%timeit -n10 SumAxis(da_big)
# 466 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# 1.89 ms ± 151 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)