物理转置一个大的非方形 numpy 矩阵
Physically transposing a large non-square numpy matrix
有没有比 array.transpose.copy() 更快的物理转置大型 2D numpy 矩阵的方法?是否有任何例程可以有效地使用内存?
无论您做什么,都需要 O(n^2)
时间和记忆。我认为 .transpose
和 .copy
(用 C 语言编写)将是您应用程序最有效的选择。
编辑:假设您确实需要复制矩阵
可能值得看看转置的作用,这样我们就清楚您所说的 'physically tranposing' 是什么意思。
从一个小的 (4,3) 数组开始:
In [51]: arr = np.array([[1,2,3],[10,11,12],[22,23,24],[30,32,34]])
In [52]: arr
Out[52]:
array([[ 1, 2, 3],
[10, 11, 12],
[22, 23, 24],
[30, 32, 34]])
这是用一维数据缓冲区存储的,我们可以用ravel
显示:
In [53]: arr.ravel()
Out[53]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
和 strides
告诉它按 8 个字节步进列,按 24 (3*8) 步进行:
In [54]: arr.strides
Out[54]: (24, 8)
我们可以弄清楚“F”顺序 - 顺着顺序排列:
In [55]: arr.ravel(order='F')
Out[55]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
虽然 [53] 是 view
,但 [55] 是副本。
现在转置:
In [57]: arrt=arr.T
In [58]: arrt
Out[58]:
array([[ 1, 10, 22, 30],
[ 2, 11, 23, 32],
[ 3, 12, 24, 34]])
这个view
;我们可以遍历 [53] 数据缓冲区,以 8 字节的步长向下移动行。使用 arrt
进行计算基本上与使用 arr
一样快。使用 strided
迭代,订单 'F' 与订单 'C'.
一样快
In [59]: arrt.strides
Out[59]: (8, 24)
原订单:
In [60]: arrt.ravel(order='F')
Out[60]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
但是做 'C' ravel 会创建一个副本,与 [55]
相同
In [61]: arrt.ravel(order='C')
Out[61]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
复制转置会生成一个按 'C' 顺序转置的数组。这是你的 'physical transpose':
In [62]: arrc = arrt.copy()
In [63]: arrc.strides
Out[63]: (32, 8)
像 [61] 那样重塑转置确实会制作一个副本,但通常我们不需要明确制作副本。我认为这样做的唯一原因是为了避免在以后的计算中出现多次冗余。
我假设如果行在内存中是连续的,并且您没有足够的可用内存来制作副本,那么您需要执行更有效地使用 CPU 缓存的逐行操作。
维基百科在 in-place matrix transposition 上有一篇文章。事实证明,这样的换位是不平凡的。这是那里描述的循环算法:
import numpy as np
from numba import njit
@njit # comment this line for debugging
def transpose_in_place(a):
"""In-place matrix transposition for a rectangular matrix.
Parameter:
- a: 2D array. Unless it's a square matrix, it will be scrambled
in the process.
Return:
- transposed array, using the same in-memory data storage as the
input array.
This algorithm is typically 10x slower than a.T.copy().
Only use it if you are short on memory.
"""
if a.shape == (1, 1):
return a # special case
n, m = a.shape
# find max length L of permutation cycle by starting at a[0,1].
# k is the index in the flat buffer; i, j are the indices in
# a.
L = 0
k = 1
while True:
j = k % m
i = k // m
k = n*j + i
L += 1
if k == 1:
break
permut = np.zeros(L, dtype=np.int32)
# Now do the permutations, one cycle at a time
seen = np.full(n*m, False)
aflat = a.reshape(-1) # flat view
for k0 in range(1, n*m-1):
if seen[k0]:
continue
# construct cycle
k = k0
permut[0] = k0
q = 1 # size of permutation array
while True:
seen[k] = True
# note that this is slightly faster than the formula
# on Wikipedia, k = n*k % (n*m-1)
i = k // m
j = k - i*m
k = n*j + i
if k == k0:
break
permut[q] = k
q += 1
# apply cyclic permutation
tmp = aflat[permut[q-1]]
aflat[permut[1:q]] = aflat[permut[:q-1]]
aflat[permut[0]] = tmp
aT = aflat.reshape(m, n)
return aT
def test_transpose(n, m):
a = np.arange(n*m).reshape(n, m)
aT = a.T.copy()
assert np.all(transpose_in_place(a) == aT)
def roundtrip_inplace(a):
a = transpose_in_place(a)
a = transpose_in_place(a)
def roundtrip_copy(a):
a = a.T.copy()
a = a.T.copy()
if __name__ == '__main__':
test_transpose(1, 1)
test_transpose(3, 4)
test_transpose(5, 5)
test_transpose(1, 5)
test_transpose(5, 1)
test_transpose(19, 29)
即使我在这里使用 numba.njit
以便编译转置函数中的循环,它仍然比复制转置慢很多。
n, m = 1000, 10000
a_big = np.arange(n*m, dtype=np.float64).reshape(n, m)
%timeit -r2 -n10 roundtrip_copy(a_big)
54.5 ms ± 153 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
%timeit -r2 -n1 roundtrip_inplace(a_big)
614 ms ± 141 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
有没有比 array.transpose.copy() 更快的物理转置大型 2D numpy 矩阵的方法?是否有任何例程可以有效地使用内存?
无论您做什么,都需要 O(n^2)
时间和记忆。我认为 .transpose
和 .copy
(用 C 语言编写)将是您应用程序最有效的选择。
编辑:假设您确实需要复制矩阵
可能值得看看转置的作用,这样我们就清楚您所说的 'physically tranposing' 是什么意思。
从一个小的 (4,3) 数组开始:
In [51]: arr = np.array([[1,2,3],[10,11,12],[22,23,24],[30,32,34]])
In [52]: arr
Out[52]:
array([[ 1, 2, 3],
[10, 11, 12],
[22, 23, 24],
[30, 32, 34]])
这是用一维数据缓冲区存储的,我们可以用ravel
显示:
In [53]: arr.ravel()
Out[53]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
和 strides
告诉它按 8 个字节步进列,按 24 (3*8) 步进行:
In [54]: arr.strides
Out[54]: (24, 8)
我们可以弄清楚“F”顺序 - 顺着顺序排列:
In [55]: arr.ravel(order='F')
Out[55]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
虽然 [53] 是 view
,但 [55] 是副本。
现在转置:
In [57]: arrt=arr.T
In [58]: arrt
Out[58]:
array([[ 1, 10, 22, 30],
[ 2, 11, 23, 32],
[ 3, 12, 24, 34]])
这个view
;我们可以遍历 [53] 数据缓冲区,以 8 字节的步长向下移动行。使用 arrt
进行计算基本上与使用 arr
一样快。使用 strided
迭代,订单 'F' 与订单 'C'.
In [59]: arrt.strides
Out[59]: (8, 24)
原订单:
In [60]: arrt.ravel(order='F')
Out[60]: array([ 1, 2, 3, 10, 11, 12, 22, 23, 24, 30, 32, 34])
但是做 'C' ravel 会创建一个副本,与 [55]
相同In [61]: arrt.ravel(order='C')
Out[61]: array([ 1, 10, 22, 30, 2, 11, 23, 32, 3, 12, 24, 34])
复制转置会生成一个按 'C' 顺序转置的数组。这是你的 'physical transpose':
In [62]: arrc = arrt.copy()
In [63]: arrc.strides
Out[63]: (32, 8)
像 [61] 那样重塑转置确实会制作一个副本,但通常我们不需要明确制作副本。我认为这样做的唯一原因是为了避免在以后的计算中出现多次冗余。
我假设如果行在内存中是连续的,并且您没有足够的可用内存来制作副本,那么您需要执行更有效地使用 CPU 缓存的逐行操作。
维基百科在 in-place matrix transposition 上有一篇文章。事实证明,这样的换位是不平凡的。这是那里描述的循环算法:
import numpy as np
from numba import njit
@njit # comment this line for debugging
def transpose_in_place(a):
"""In-place matrix transposition for a rectangular matrix.
Parameter:
- a: 2D array. Unless it's a square matrix, it will be scrambled
in the process.
Return:
- transposed array, using the same in-memory data storage as the
input array.
This algorithm is typically 10x slower than a.T.copy().
Only use it if you are short on memory.
"""
if a.shape == (1, 1):
return a # special case
n, m = a.shape
# find max length L of permutation cycle by starting at a[0,1].
# k is the index in the flat buffer; i, j are the indices in
# a.
L = 0
k = 1
while True:
j = k % m
i = k // m
k = n*j + i
L += 1
if k == 1:
break
permut = np.zeros(L, dtype=np.int32)
# Now do the permutations, one cycle at a time
seen = np.full(n*m, False)
aflat = a.reshape(-1) # flat view
for k0 in range(1, n*m-1):
if seen[k0]:
continue
# construct cycle
k = k0
permut[0] = k0
q = 1 # size of permutation array
while True:
seen[k] = True
# note that this is slightly faster than the formula
# on Wikipedia, k = n*k % (n*m-1)
i = k // m
j = k - i*m
k = n*j + i
if k == k0:
break
permut[q] = k
q += 1
# apply cyclic permutation
tmp = aflat[permut[q-1]]
aflat[permut[1:q]] = aflat[permut[:q-1]]
aflat[permut[0]] = tmp
aT = aflat.reshape(m, n)
return aT
def test_transpose(n, m):
a = np.arange(n*m).reshape(n, m)
aT = a.T.copy()
assert np.all(transpose_in_place(a) == aT)
def roundtrip_inplace(a):
a = transpose_in_place(a)
a = transpose_in_place(a)
def roundtrip_copy(a):
a = a.T.copy()
a = a.T.copy()
if __name__ == '__main__':
test_transpose(1, 1)
test_transpose(3, 4)
test_transpose(5, 5)
test_transpose(1, 5)
test_transpose(5, 1)
test_transpose(19, 29)
即使我在这里使用 numba.njit
以便编译转置函数中的循环,它仍然比复制转置慢很多。
n, m = 1000, 10000
a_big = np.arange(n*m, dtype=np.float64).reshape(n, m)
%timeit -r2 -n10 roundtrip_copy(a_big)
54.5 ms ± 153 µs per loop (mean ± std. dev. of 2 runs, 10 loops each)
%timeit -r2 -n1 roundtrip_inplace(a_big)
614 ms ± 141 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)