洗牌大型 pandas DataFrame 的行并与一系列相关
Shuffling rows of large pandas DataFrame and correlation with a series
我需要将大型 pandas DataFrame 的每一行独立打乱几次(典型形状是 (10000,1000)
),然后估计每一行与给定序列的相关性。
我发现保持在 pandas 内的最有效(=快速)方法如下:
for i in range(N): #the larger is N, the better it is
df_sh = df.apply(numpy.random.permutation, axis=1)
#where df this is my large dataframe, with 10K rows and 1K columns
corr = df_sh.corrwith(s, axis = 1)
#where s is the provided series (shape of s =(1000,))
这两个任务花费的时间大致相同(即每个任务 30 秒)。我尝试将我的数据帧转换为 numpy.array
,以在数组上执行 for
循环,并且对于每一行,我首先执行排列,然后测量与 scipy.stats.pearsonr
的相关性。不幸的是,我设法将我的两项任务加快了 2 倍。
还有其他可行的选择可以进一步加快任务速度吗? (注意:我已经与 Joblib
并行执行我的代码,直到我使用的机器允许的最大因子)。
2D matrix/array 和 1D array/vector 之间的相关性:
我们可以调整 corr2_coeff_rowwise
以获得 2D
array/matrix 和一维 array/vector 之间的相关性,就像这样 -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
要打乱每一行并对所有行执行此操作,我们可以使用 np.random.shuffle
。现在,这个 shuffle 函数沿着第一个轴工作。因此,为了解决我们的问题,我们需要输入转置版本。另外,请注意,这种改组将就地完成。因此,如果其他地方需要原始数据框,请在处理前制作一份副本。因此,解决方案是 -
因此,让我们用它来解决我们的问题 -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
#
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
优化版本#1
现在,我们可以预先计算涉及在迭代之间保持相同的系列的部分。因此,进一步优化的版本将如下所示 -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
#
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
基准测试
设置输入与实际用例shapes/sizes
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1.原方法
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2。建议方法
预处理部分(只在开始循环之前做一次,所以不包括在计时中)-
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
在循环中运行的建议解决方案的一部分 -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
因此,我们在这里看到了大约 3x
的加速!
我需要将大型 pandas DataFrame 的每一行独立打乱几次(典型形状是 (10000,1000)
),然后估计每一行与给定序列的相关性。
我发现保持在 pandas 内的最有效(=快速)方法如下:
for i in range(N): #the larger is N, the better it is
df_sh = df.apply(numpy.random.permutation, axis=1)
#where df this is my large dataframe, with 10K rows and 1K columns
corr = df_sh.corrwith(s, axis = 1)
#where s is the provided series (shape of s =(1000,))
这两个任务花费的时间大致相同(即每个任务 30 秒)。我尝试将我的数据帧转换为 numpy.array
,以在数组上执行 for
循环,并且对于每一行,我首先执行排列,然后测量与 scipy.stats.pearsonr
的相关性。不幸的是,我设法将我的两项任务加快了 2 倍。
还有其他可行的选择可以进一步加快任务速度吗? (注意:我已经与 Joblib
并行执行我的代码,直到我使用的机器允许的最大因子)。
2D matrix/array 和 1D array/vector 之间的相关性:
我们可以调整 corr2_coeff_rowwise
以获得 2D
array/matrix 和一维 array/vector 之间的相关性,就像这样 -
def corr2_coeff_2d_1d(A, B):
# Rowwise mean of input arrays & subtract from input arrays themeselves
A_mA = A - A.mean(1,keepdims=1)
B_mB = B - B.mean()
# Sum of squares across rows
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
ssB = B_mB.dot(B_mB)
# Finally get corr coeff
return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
要打乱每一行并对所有行执行此操作,我们可以使用 np.random.shuffle
。现在,这个 shuffle 函数沿着第一个轴工作。因此,为了解决我们的问题,我们需要输入转置版本。另外,请注意,这种改组将就地完成。因此,如果其他地方需要原始数据框,请在处理前制作一份副本。因此,解决方案是 -
因此,让我们用它来解决我们的问题 -
# Extract underlying arry data for faster NumPy processing in loop later on
a = df.values
s_ar = s.values
# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
#
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
corr = corr2_coeff_2d_1d(shuffled_a, s_ar)
优化版本#1
现在,我们可以预先计算涉及在迭代之间保持相同的系列的部分。因此,进一步优化的版本将如下所示 -
a = df.values
s_ar = s.values
r = np.arange(a.shape[0])[:,None]
B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)
A = a
A_mean = A.mean(1,keepdims=1)
for i in range(N):
# Get shuffled indices per row with `rand+argsort/argpartition` trick from -
#
idx = np.random.rand(*a.shape).argsort(1)
# Shuffle array data with NumPy's advanced indexing
shuffled_a = a[r, idx]
# Compute correlation
A = shuffled_a
A_mA = A - A_mean
ssA = np.einsum('ij,ij->i',A_mA,A_mA)
corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
基准测试
设置输入与实际用例shapes/sizes
In [302]: df = pd.DataFrame(np.random.rand(10000,1000))
In [303]: s = pd.Series(df.iloc[0])
1.原方法
In [304]: %%timeit
...: df_sh = df.apply(np.random.permutation, axis=1)
...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop
2。建议方法
预处理部分(只在开始循环之前做一次,所以不包括在计时中)-
In [305]: a = df.values
...: s_ar = s.values
...: r = np.arange(a.shape[0])[:,None]
...:
...: B = s_ar
...: B_mB = B - B.mean()
...: ssB = B_mB.dot(B_mB)
...:
...: A = a
...: A_mean = A.mean(1,keepdims=1)
在循环中运行的建议解决方案的一部分 -
In [306]: %%timeit
...: idx = np.random.rand(*a.shape).argsort(1)
...: shuffled_a = a[r, idx]
...:
...: A = shuffled_a
...: A_mA = A - A_mean
...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop
因此,我们在这里看到了大约 3x
的加速!