如何在 NumPy 计算中避免克罗内克积
How to avoid Kronecker product in NumPy computations
背景
生成随机权重列表后:
sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
利用 Numpy 的 Kronecker 产品创建 foo
(形状 (900, 23520)
):
foo = np.kron(np.identity(30),weights[0])
然后,矩阵将foo
乘以data
的切片,即
bar = np.dot(foo,data[0])
data[0].shape
是 (23520,)
而 data[0].dtype
是 float32
。
问题
foo
相当浪费。形状为 (30,784)
的 weights[0]
如何更灵活地用于与 data[0]
的乘法运算?
更一般地,data[0]
是形状为 (1666,23520)
的数组的切片,因此乘法过程需要执行 1666 次。此外,数据数组接近稀疏,只有不到 20% 的条目为非零。
这是我试过的循环:
for i in range(len(data)):
foo = np.kron(np.identity(30),weights[0])
bar = np.dot(foo,data[i])
诀窍是将 data
重塑为 3D
张量,然后使用 np.tensordot
对抗 weights[0]
从而绕过 foo
创建,就像这样-
k = 30 # kernel size
data3D = data.reshape(data.shape[0],k,-1)
out = np.tensordot(data3D, weights[0], axes=(2,1)).reshape(-1,k**2)
在幕后,tensordot
使用转置轴、重塑然后 np.dot
。因此,使用所有 手动劳动 来避免对 tensordot
的函数调用,我们将有一个,就像这样 -
out = data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)
.
样本运行
让我们用一个玩具示例向可能不了解问题的人解释发生了什么:
In [68]: # Toy setup and code run with original codes
...: k = 3 # kernel size, which is 30 in the original case
...:
...: data = np.random.rand(4,6)
...: w0 = np.random.rand(3,2) # this is weights[0]
...: foo = np.kron(np.identity(k), w0)
...: output_first_row = foo.dot(data[0])
所以,问题是摆脱 foo
创建步骤并到达 output_first_row
并对 data
.
的所有行执行此操作
建议的解决方案是:
...: data3D = data.reshape(data.shape[0],k,-1)
...: vectorized_out = np.tensordot(data3D, w0, axes=(2,1)).reshape(-1,k**2)
让我们验证一下结果:
In [69]: output_first_row
Out[69]: array([ 0.11, 0.13, 0.34, 0.67, 0.53, 1.51, 0.17, 0.16, 0.44])
In [70]: vectorized_out
Out[70]:
array([[ 0.11, 0.13, 0.34, 0.67, 0.53, 1.51, 0.17, 0.16, 0.44],
[ 0.43, 0.23, 0.73, 0.43, 0.38, 1.05, 0.64, 0.49, 1.41],
[ 0.57, 0.45, 1.3 , 0.68, 0.51, 1.48, 0.45, 0.28, 0.85],
[ 0.41, 0.35, 0.98, 0.4 , 0.24, 0.75, 0.22, 0.28, 0.71]])
所有提议方法的运行时测试 -
In [30]: import numpy as np
In [31]: sizes = [784,30,10]
In [32]: weights = [np.random.rand(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
In [33]: data = np.random.rand(1666,23520)
In [37]: k = 30 # kernel size
# @Paul Panzer's soln
In [38]: %timeit (weights[0] @ data.reshape(-1, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)
1 loops, best of 3: 707 ms per loop
In [39]: %timeit np.tensordot(data.reshape(data.shape[0],k,-1), weights[0], axes=(2,1)).reshape(-1,k**2)
10 loops, best of 3: 114 ms per loop
In [40]: %timeit data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)
10 loops, best of 3: 118 ms per loop
和下面的评论可能有助于理解 tensordot
如何更好地与 tensors
配合使用。
您实际上是在进行矩阵-矩阵乘法,其中第一个因子是 weights[0]
,第二个因子是 data[i]
,被分成 30 个相等的切片,形成列。
import numpy as np
sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
k = 2
# create sparse data
data = np.maximum(np.random.uniform(-100, 1, (k, 23520)), 0)
foo = np.kron(np.identity(30),weights[0])
# This is the original loop as a list comprehension
bar = [np.dot(foo,d) for d in data]
# This is the equivalent using matrix multiplication.
# We can take advantage of the fact that the '@' operator
# can do batch matrix multiplication (it uses the last two
# dimensions as the matrix and all others as batch index).
# The reshape does the chopping up but gives us rows where columns
# are required, hence the first swapaxes.
# The second swapaxes is to make the result directly comparable to
# the `np.kron` based result.
bar2 = (weights[0] @ data.reshape(k, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)
# Instead of letting numpy do the batching we can glue all the
# columns of all the second factors together into one matrix
bar3 = (weights[0] @ data.reshape(-1, 784).T).T.reshape(k, -1)
# This last formulation works more or less unchanged on sparse data
from scipy import sparse
dsp = sparse.csr_matrix(data.reshape(-1, 784))
bar4 = (weights[0] @ dsp.T).T.reshape(k, -1)
print(np.allclose(bar, bar2.reshape(k, -1)))
print(np.allclose(bar, bar3))
print(np.allclose(bar, bar4))
打印:
True
True
True
背景
生成随机权重列表后:
sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
利用 Numpy 的 Kronecker 产品创建 foo
(形状 (900, 23520)
):
foo = np.kron(np.identity(30),weights[0])
然后,矩阵将foo
乘以data
的切片,即
bar = np.dot(foo,data[0])
data[0].shape
是 (23520,)
而 data[0].dtype
是 float32
。
问题
foo
相当浪费。形状为 (30,784)
的 weights[0]
如何更灵活地用于与 data[0]
的乘法运算?
更一般地,data[0]
是形状为 (1666,23520)
的数组的切片,因此乘法过程需要执行 1666 次。此外,数据数组接近稀疏,只有不到 20% 的条目为非零。
这是我试过的循环:
for i in range(len(data)):
foo = np.kron(np.identity(30),weights[0])
bar = np.dot(foo,data[i])
诀窍是将 data
重塑为 3D
张量,然后使用 np.tensordot
对抗 weights[0]
从而绕过 foo
创建,就像这样-
k = 30 # kernel size
data3D = data.reshape(data.shape[0],k,-1)
out = np.tensordot(data3D, weights[0], axes=(2,1)).reshape(-1,k**2)
在幕后,tensordot
使用转置轴、重塑然后 np.dot
。因此,使用所有 手动劳动 来避免对 tensordot
的函数调用,我们将有一个,就像这样 -
out = data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)
样本运行
让我们用一个玩具示例向可能不了解问题的人解释发生了什么:
In [68]: # Toy setup and code run with original codes
...: k = 3 # kernel size, which is 30 in the original case
...:
...: data = np.random.rand(4,6)
...: w0 = np.random.rand(3,2) # this is weights[0]
...: foo = np.kron(np.identity(k), w0)
...: output_first_row = foo.dot(data[0])
所以,问题是摆脱 foo
创建步骤并到达 output_first_row
并对 data
.
建议的解决方案是:
...: data3D = data.reshape(data.shape[0],k,-1)
...: vectorized_out = np.tensordot(data3D, w0, axes=(2,1)).reshape(-1,k**2)
让我们验证一下结果:
In [69]: output_first_row
Out[69]: array([ 0.11, 0.13, 0.34, 0.67, 0.53, 1.51, 0.17, 0.16, 0.44])
In [70]: vectorized_out
Out[70]:
array([[ 0.11, 0.13, 0.34, 0.67, 0.53, 1.51, 0.17, 0.16, 0.44],
[ 0.43, 0.23, 0.73, 0.43, 0.38, 1.05, 0.64, 0.49, 1.41],
[ 0.57, 0.45, 1.3 , 0.68, 0.51, 1.48, 0.45, 0.28, 0.85],
[ 0.41, 0.35, 0.98, 0.4 , 0.24, 0.75, 0.22, 0.28, 0.71]])
所有提议方法的运行时测试 -
In [30]: import numpy as np
In [31]: sizes = [784,30,10]
In [32]: weights = [np.random.rand(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
In [33]: data = np.random.rand(1666,23520)
In [37]: k = 30 # kernel size
# @Paul Panzer's soln
In [38]: %timeit (weights[0] @ data.reshape(-1, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)
1 loops, best of 3: 707 ms per loop
In [39]: %timeit np.tensordot(data.reshape(data.shape[0],k,-1), weights[0], axes=(2,1)).reshape(-1,k**2)
10 loops, best of 3: 114 ms per loop
In [40]: %timeit data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)
10 loops, best of 3: 118 ms per loop
tensordot
如何更好地与 tensors
配合使用。
您实际上是在进行矩阵-矩阵乘法,其中第一个因子是 weights[0]
,第二个因子是 data[i]
,被分成 30 个相等的切片,形成列。
import numpy as np
sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]
k = 2
# create sparse data
data = np.maximum(np.random.uniform(-100, 1, (k, 23520)), 0)
foo = np.kron(np.identity(30),weights[0])
# This is the original loop as a list comprehension
bar = [np.dot(foo,d) for d in data]
# This is the equivalent using matrix multiplication.
# We can take advantage of the fact that the '@' operator
# can do batch matrix multiplication (it uses the last two
# dimensions as the matrix and all others as batch index).
# The reshape does the chopping up but gives us rows where columns
# are required, hence the first swapaxes.
# The second swapaxes is to make the result directly comparable to
# the `np.kron` based result.
bar2 = (weights[0] @ data.reshape(k, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)
# Instead of letting numpy do the batching we can glue all the
# columns of all the second factors together into one matrix
bar3 = (weights[0] @ data.reshape(-1, 784).T).T.reshape(k, -1)
# This last formulation works more or less unchanged on sparse data
from scipy import sparse
dsp = sparse.csr_matrix(data.reshape(-1, 784))
bar4 = (weights[0] @ dsp.T).T.reshape(k, -1)
print(np.allclose(bar, bar2.reshape(k, -1)))
print(np.allclose(bar, bar3))
print(np.allclose(bar, bar4))
打印:
True
True
True