洗牌大型 pandas DataFrame 的行并与一系列相关

Shuffling rows of large pandas DataFrame and correlation with a series

我需要将大型 pandas DataFrame 的每一行独立打乱几次(典型形状是 (10000,1000)),然后估计每一行与给定序列的相关性。

我发现保持在 pandas 内的最有效(=快速)方法如下:

for i in range(N): #the larger is N, the better it is
    df_sh = df.apply(numpy.random.permutation, axis=1)
    #where df this is my large dataframe, with 10K rows and 1K columns

    corr = df_sh.corrwith(s, axis = 1)
    #where s is the provided series (shape of s =(1000,))

这两个任务花费的时间大致相同(即每个任务 30 秒)。我尝试将我的数据帧转换为 numpy.array,以在数组上执行 for 循环,并且对于每一行,我首先执行排列,然后测量与 scipy.stats.pearsonr 的相关性。不幸的是,我设法将我的两项任务加快了 2 倍。 还有其他可行的选择可以进一步加快任务速度吗? (注意:我已经与 Joblib 并行执行我的代码,直到我使用的机器允许的最大因子)。

2D matrix/array 和 1D array/vector 之间的相关性:

我们可以调整 corr2_coeff_rowwise 以获得 2D array/matrix 和一维 array/vector 之间的相关性,就像这样 -

def corr2_coeff_2d_1d(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1,keepdims=1)
    B_mB = B - B.mean()

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    ssB = B_mB.dot(B_mB)

    # Finally get corr coeff
    return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

要打乱每一行并对所有行执行此操作,我们可以使用 np.random.shuffle。现在,这个 shuffle 函数沿着第一个轴工作。因此,为了解决我们的问题,我们需要输入转置版本。另外,请注意,这种改组将就地完成。因此,如果其他地方需要原始数据框,请在处理前制作一份副本。因此,解决方案是 -

因此,让我们用它来解决我们的问题 -

# Extract underlying arry data for faster NumPy processing in loop later on    
a = df.values  
s_ar = s.values

# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # 
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    corr = corr2_coeff_2d_1d(shuffled_a, s_ar)

优化版本#1

现在,我们可以预先计算涉及在迭代之间保持相同的系列的部分。因此,进一步优化的版本将如下所示 -

a = df.values  
s_ar = s.values
r = np.arange(a.shape[0])[:,None]

B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)

A = a
A_mean = A.mean(1,keepdims=1)

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # 
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    A = shuffled_a
    A_mA = A - A_mean
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

基准测试

设置输入与实际用例shapes/sizes

In [302]: df = pd.DataFrame(np.random.rand(10000,1000))

In [303]: s = pd.Series(df.iloc[0])

1.原方法

In [304]: %%timeit
     ...: df_sh = df.apply(np.random.permutation, axis=1)
     ...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop

2。建议方法

预处理部分(只在开始循环之前做一次,所以不包括在计时中)-

In [305]: a = df.values  
     ...: s_ar = s.values
     ...: r = np.arange(a.shape[0])[:,None]
     ...: 
     ...: B = s_ar
     ...: B_mB = B - B.mean()
     ...: ssB = B_mB.dot(B_mB)
     ...: 
     ...: A = a
     ...: A_mean = A.mean(1,keepdims=1)

在循环中运行的建议解决方案的一部分 -

In [306]: %%timeit
     ...: idx = np.random.rand(*a.shape).argsort(1)
     ...: shuffled_a = a[r, idx]
     ...: 
     ...: A = shuffled_a
     ...: A_mA = A - A_mean
     ...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
     ...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop

因此,我们在这里看到了大约 3x 的加速!