如何向量化 Python 中二维和一维 NumPy 数组的迭代操作?
How to vectorize an iterative operation on a 2D and 1D NumPy array in Python?
我正在尝试加速我的代码以充分利用 NumPy 的矢量化。我已经能够对大部分代码(clr/vlr
函数)中的所有计算进行矢量化,我认为这是 NumPy 的最佳选择。我不相信这些可以再加速,因为 np.einsum
并不真正适用。我不知道如何向量化 rho
函数。特别是,正是这个嵌套的 for 循环给我带来了麻烦:1 - (ratios[i,j]/ (variances[i] + variances[j]))
。我试过使用 np.meshgrid
来尝试扩大方差,但这没有用。我一直在尝试在 np.einsum
上关注 this tutorial 但是,它仍然有点令人困惑,我仍然不确定它是否适用于这种情况,因为它涉及的操作比点积和矩阵乘法更多.
有人知道如何向量化这个函数吗?
另外,有什么建议可以加快我的前 2 个函数的代码吗?
import numpy as np
import pandas as pd
def clr(X):
index = None
labels = None
if isinstance(X, pd.DataFrame):
index = X.index
labels = X.columns
X = X.values
X_log = np.log(X)
geometric_mean = X_log.mean(axis=1)
X_clr = X_log - geometric_mean[:,np.newaxis]
if labels is not None:
X_clr = pd.DataFrame(X_clr, index=index, columns=labels)
return X_clr
def vlr(X):
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
n,m = X.shape
X_log = np.log(X)
covariance = np.cov(X_log.T)
diagonal = np.diagonal(covariance)
output = -2*covariance + diagonal[:,np.newaxis] + diagonal
if labels is not None:
output = pd.DataFrame(output, index=labels, columns=labels)
return output
def rho(X):
n, m = X.shape
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
ratios = vlr(X)
X_clr = clr(X)
variances = np.var(X_clr, axis=0)
# How to vectorize?
rhos = np.ones_like(ratios)
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
if labels is not None:
rhos = pd.DataFrame(rhos, index=labels, columns=labels)
return rhos
# Load data
X_iris = pd.read_csv("https://pastebin.com/raw/dR59vTD4", sep="\t", index_col=0)
# Calculation
print(rho(X_iris))
# sepal_length sepal_width petal_length petal_width
# sepal_length 1.000000 0.855012 -0.796224 -0.796770
# sepal_width 0.855012 1.000000 -0.670387 -0.964775
# petal_length -0.796224 -0.670387 1.000000 0.493560
# petal_width -0.796770 -0.964775 0.493560 1.000000
您可以替换循环:
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
与:
rhos = 1 - ratios / np.add.outer(variances, variances)
np.add.outer(variances, variances)
取 variances
的外和。例如:
np.add.outer(range(3), range(3))
>>> array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
鉴于数组的形状和循环数学,我们可以使用上面的代码一次完成 ratios / variances[i] + variances[j]
。
确认一下:
# original loop
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
# array math replacement
rhos2 = 1-ratios / np.add.outer(variances, variances)
np.allclose(rhos, rhos2)
>>> True
我正在尝试加速我的代码以充分利用 NumPy 的矢量化。我已经能够对大部分代码(clr/vlr
函数)中的所有计算进行矢量化,我认为这是 NumPy 的最佳选择。我不相信这些可以再加速,因为 np.einsum
并不真正适用。我不知道如何向量化 rho
函数。特别是,正是这个嵌套的 for 循环给我带来了麻烦:1 - (ratios[i,j]/ (variances[i] + variances[j]))
。我试过使用 np.meshgrid
来尝试扩大方差,但这没有用。我一直在尝试在 np.einsum
上关注 this tutorial 但是,它仍然有点令人困惑,我仍然不确定它是否适用于这种情况,因为它涉及的操作比点积和矩阵乘法更多.
有人知道如何向量化这个函数吗?
另外,有什么建议可以加快我的前 2 个函数的代码吗?
import numpy as np
import pandas as pd
def clr(X):
index = None
labels = None
if isinstance(X, pd.DataFrame):
index = X.index
labels = X.columns
X = X.values
X_log = np.log(X)
geometric_mean = X_log.mean(axis=1)
X_clr = X_log - geometric_mean[:,np.newaxis]
if labels is not None:
X_clr = pd.DataFrame(X_clr, index=index, columns=labels)
return X_clr
def vlr(X):
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
n,m = X.shape
X_log = np.log(X)
covariance = np.cov(X_log.T)
diagonal = np.diagonal(covariance)
output = -2*covariance + diagonal[:,np.newaxis] + diagonal
if labels is not None:
output = pd.DataFrame(output, index=labels, columns=labels)
return output
def rho(X):
n, m = X.shape
labels = None
if isinstance(X, pd.DataFrame):
labels = X.columns
X = X.values
ratios = vlr(X)
X_clr = clr(X)
variances = np.var(X_clr, axis=0)
# How to vectorize?
rhos = np.ones_like(ratios)
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
if labels is not None:
rhos = pd.DataFrame(rhos, index=labels, columns=labels)
return rhos
# Load data
X_iris = pd.read_csv("https://pastebin.com/raw/dR59vTD4", sep="\t", index_col=0)
# Calculation
print(rho(X_iris))
# sepal_length sepal_width petal_length petal_width
# sepal_length 1.000000 0.855012 -0.796224 -0.796770
# sepal_width 0.855012 1.000000 -0.670387 -0.964775
# petal_length -0.796224 -0.670387 1.000000 0.493560
# petal_width -0.796770 -0.964775 0.493560 1.000000
您可以替换循环:
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
与:
rhos = 1 - ratios / np.add.outer(variances, variances)
np.add.outer(variances, variances)
取 variances
的外和。例如:
np.add.outer(range(3), range(3))
>>> array([[0, 1, 2],
[1, 2, 3],
[2, 3, 4]])
鉴于数组的形状和循环数学,我们可以使用上面的代码一次完成 ratios / variances[i] + variances[j]
。
确认一下:
# original loop
for i in range(n):
for j in range(i+1,m):
coef = 1 - (ratios[i,j]/ (variances[i] + variances[j]))
rhos[i,j] = rhos[j,i] = coef
# array math replacement
rhos2 = 1-ratios / np.add.outer(variances, variances)
np.allclose(rhos, rhos2)
>>> True