解决此 python 嵌套 for 循环的算法可能比我正在使用的算法更好？

Question

我有一个必须遍历大量数据的嵌套循环。

假设一个包含随机值的数据框，大小为 1000,000 行，每个数据框在二维中都有一个 X、Y 位置 space。有一个长度为10的window将所有1M的数据行一一遍历，直到所有计算完成。

解释代码应该做什么：

每行代表一个X-Y平面坐标。
r_test包含我们二维平面（X-Y平面）中不同调查圆的直径。
对于每 10 个 points/rows，对于 r_test 中的每个直径，我们将每个点与其余 9 个点之间的距离进行比较，如果该值小于 R，我们将 2 添加到 H。然后我们计算H/(N**5)存入c_10，索引对应调查直径
当循环经过 r_test 中的所有这些直径时，对于前 10 个点，我们读取拟合线的斜率并将其保存到 S_wind[ii]。因此前 9 个数据点将没有计算值，因此 np.inf 稍后可以区分。
然后 window 向下移动一个点并重复此过程，直到 S_wind 完成。

有什么算法比我正在使用的算法更能解决这个问题？在 python 3.x?

非常感谢！

import numpy as np
import pandas as pd
####generating input data frame
df = pd.DataFrame(data = np.random.randint(2000, 6000, (1000000, 2)))
df.columns= ['X','Y']


####====creating upper and lower bound for the diameter of the investigation circles    
x_range =max(df['X']) - min(df['X']) 
y_range = max(df['Y']) - min(df['Y'])
R = max(x_range,y_range)/20
d = 2
N = 10 #### Number of points in each window
#r1 = 2*R*(1/N)**(1/d)  
#r2 = (R)/(1+d)
#r_test = np.arange(r1, r2, 0.05)
##===avoiding generation of empty r_test
r1 = 80
r2= 800  
r_test = np.arange(r1, r2, 5) 

S_wind = np.zeros(len(df['X'])) + np.inf

for ii in range (10,len(df['X'])): #### maybe the code run slower because of using len() function instead of a number
        c_10 = np.zeros(len(r_test)) +np.inf
        H = 0
        C = 0
        N = 10 ##### maybe I should also remove this
        for ind in range(len(r_test)):
            for i in range (ii-10,ii):
                for j in range(ii-10,ii):
                    dd = r_test[ind] - np.sqrt((df['X'][i] - df['X'][j])**2+ (df['Y'][i] - df['Y'][j])**2)
                    if dd > 0:
                        H += 1
            c_10[ind] = (H/(N**2))

        S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0]

Answer 1

您可以使用 numpy 广播来消除所有内部循环。我不确定是否有一种简单的方法可以摆脱最外层的循环，但其他的循环并不太难避免。

内部循环正在成对比较十个 2D 点。那只是因为使用 10x10x2 numpy 数组而死：

# replacing the `for ind` loop and its contents:
points = np.hstack((np.asarray(df['X'])[ii-10:ii, None], np.asarray(df['Y'])[ii-10:ii, None]))
differences = np.subtract(points[None, :, :],  points[:, None, :]) # broadcast to 10x10x2
squared_distances = (differences * differences).sum(axis=2)
within_range = squared_distances[None,:,:] < (r_test*r_test)[:, None, None]  # compare squares
c_10 = within_range.sum(axis=(1,2)).cumsum() * 2 / (N**2)

S_wind[ii] = np.polyfit(np.log10(r_test), np.log10(c_10), 1)[0] # this is unchanged...

我不是很 pandas 精明，所以可能有更好的方法将 X 和 Y 值放入单个二维 numpy 数组中。您以我认为最有用的格式生成了随机数据，然后转换为对数字运算不太有用的格式！

请注意，此代码与循环代码的输出匹配。我不确定这是否真的在做你想要它做的事情，因为你当前的代码中有几处有点奇怪。例如，您可能不希望我代码中的 cumsum，这对应于仅在最外层循环中将 H 重新初始化为零。如果您不希望针对较大的值再次计算 r_test 的较小值的匹配项，您可以跳过该总和（或者等效地，将 H = 0 行移动到 for ind 和原始代码中的 for i 循环）。

解决此 python 嵌套 for 循环的算法可能比我正在使用的算法更好？

What's a potentially better algorithm to solve this python nested for loop than the one I'm using?

for-loop

nested-loops

python-3.x