为什么 cython 循环的性能在速度方面与 python 相比有所下降？

Question

我正在尝试通过使用 cython 功能提高我的 python 代码的速度。我的 python 代码由 py_child 和 py_parent 类和 py_backup 函数组成，如下所示：

import random
from time import clock
import numpy as np
from libc.string cimport memcmp
## python code #################################################
class py_child:
    def __init__(self, move):
        self.move = move
        self.Q = 0
        self.N = 0

class py_parent:
    def __init__(self):
        self.children = []
    def add_children(self, moves):
        for move in moves:
            self.children.append(py_child(move))

def py_backup(parent, white_rave, black_rave):
    for point in white_rave:
        for ch in parent.children:
            if ch.move == point:
                ch.Q += 1
                ch.N += 1

    for point in black_rave:
        for ch in parent.children:
            if ch.move == point:
                ch.Q += 1
                ch.N += 1

这与 cython 中的实现相同，使用内存视图作为一些变量：

## cython ######################################################

cdef class cy_child:
    cdef public:
        int[:] move
        int Q
        int N
    def __init__(self, move):
        self.move = move
        self.Q = 0
        self.N = 0

cdef class cy_parent:
    cdef public:
        list children
        int[:, :] moves
    def __init__(self):
        self.children = []
    def add_children(self, moves):
        cdef int i = 0
        cdef int N = len(moves)
        for i in range(N):
            self.children.append(cy_child(moves[i]))

cpdef cy_backup(cy_parent parent_node, int[:, :] white_rave,int[:, :] black_rave):
    cdef int[:] move
    cdef cy_child ch
    for move in white_rave:
        for ch in parent_node.children:
            if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:
                ch.Q += 1
                ch.N += 1

    for move in black_rave:
        for ch in parent_node.children:
            if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:
                ch.Q += 1
                ch.N += 1

现在我想评估函数代码的速度 cy_backup，py_backup.So 我使用这段代码：

### Setup variables #########################################
size = 11
board = np.random.randint(2, size=(size, size), dtype=np.int32)

for x in range(board.shape[0]):
    for y in range(board.shape[1]):
        if board[x,y] == 0:
            black_rave.append((x,y))
        else:
            white_rave.append((x,y))

py_temp = []
for i in range(size):
    for j in range(size):
        py_temp.append((i,j))

#### python arguments #######################################

py = py_parent()
py.add_children(py_temp)
# also py_temp, black_rave, white_rave

#### cython arguments #######################################
cy_temp = np.assarray(py_temp, , dtype= np.int32)
cy_black_rave = np.asarray(black_rave, dtype= np.int32)
cy_white_rave = np.asarray(white_rave, dtype= np.int32)
cy = cy_parent()
cy.add_children(cy_temp)

#### Speed test #################################################
%timeit py_backup(py_parent, black_rave, white_rave)
%timeit cy_backup(cy_parent, cy_black_rave, cy_white_rave)

当我运行这个程序时，我对结果感到惊讶：

1000 loops, best of 3: 759 µs per loop
100 loops, best of 3: 6.38 ms per loop

我期待 cython 比 python 快得多，特别是在使用内存视图时。
为什么 cython 中的循环比 python 中的循环运行得慢？
如果有人有任何加速 cython 代码的建议，我们将不胜感激。
预先为我的问题包括太多代码道歉。

Answer 1

Cython 内存视图实际上只针对访问单个元素或切片（通常在循环中）的一件事进行了优化

# e.g.
cdef int i
cdef int[:] mview = # something
for i in range(mview.shape[0]):
   mview[i] # do some work with this....

这类代码可以直接转换成高效的C代码。对于几乎所有其他操作，内存视图都被视为 Python 对象。

不幸的是，几乎 none 你的代码利用了内存视图擅长的一件事，所以你没有得到真正的加速。相反，它实际上更糟，因为你添加了一个额外的层，并且整个负载的小长度 2 内存视图将会非常糟糕。

我的建议实际上只是使用列表 - 它们实际上对这种事情非常有用，我一点也不清楚如何重写你的代码以真正加快 Cython 的速度。

我发现了一些小的优化： 你可以通过查看由 cython -a。您会看到内存视图的一般迭代很慢（即纯 Python）。您可以通过更改

得到改善

# instead of:
# for move in white_rave:
for i in range(white_rave.shape[0]):
    move = white_rave[i,:]

这让 Cython 可以高效地迭代内存视图。

您可以通过关闭 memcmp 行的一些安全检查来提高速度：

with cython.boundscheck(False), cython.initializedcheck(False):
   if memcmp(&move[0], &ch.move[0], move.nbytes) == 0:

（你需要 cimport cython）。如果你这样做并且你还没有初始化 ch.move 或者两个内存视图都没有至少一个元素那么你的程序可能会崩溃。

我知道这不是一个有帮助的答案，但只要您想将 child 保留为 Python class（事件为 cdef）真的没有什么可以加快它的速度。您可能会考虑将其更改为 C 结构（您可以拥有一个 C 数组），但是您会失去 all 使用 Python 的好处（即您必须管理你自己的记忆，你无法通过 Python 代码轻松访问它。

为什么 cython 循环的性能在速度方面与 python 相比有所下降？

why performance of cython loop has diminished in comparison with python one in terms of speed?

python

performance

loops

cython

typed-memory-views