并行化执行按位操作的代码

Question

我有这段代码可以通过将此 AU 矩阵的每个字节 8 个元素打包到 A 中来减少内存消耗，从而使 100k*200k 矩阵占用更少 space。如您所料，这段代码永远需要运行，我也计划将行数增加到 200k。我在一个非常强大的实例（CPU 和 GPU）上运行编写代码并且可以扩展它，所以任何人都可以帮助并行化此代码以使其更快。

import numpy as np
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))

Answer 1

Python 解决方案：

import numpy as np
import time

def compute(A, AU):
    A[:,:] = 0
    # Put every 8 columns in AU into A
    for i in range(A.shape[1]):
        A[:, i//8] = np.bitwise_or(A[:, i//8], np.left_shift(AU[:, i], i % 8))

colm = int(20000/8)
rows = 10000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
start_time = time.time()

A = np.empty((rows,colm), dtype=np.uint8)

start_time = time.time()

compute(A, AU)
    
end_time = time.time()
print(end_time - start_time)

在 1/2 秒内打包位

C 中的相同代码：

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main(int argc, char* argv[]) {

    int colm = 200000/8;
    int rows = 10000;
    int cols = colm*8;
    unsigned char *A = (unsigned char *)malloc(rows * colm * sizeof(unsigned char)); 
    unsigned char *AU = (unsigned char *)malloc(rows * cols * sizeof(unsigned char)); 
    int i, j;
    clock_t begin;
    clock_t end;
    double time_spent;

    begin = clock();
        
    // Create AU
    for (i = 0; i < rows; i++)
        for (j = 0; j < cols; j++)
            *(AU + i*cols + j) = (unsigned char) (rand() & 0x01);  
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create AU\n", time_spent);
            
    begin = clock();
    
    // Create a zeroed out A
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++)
            *(A + i*colm + j) = (unsigned char) 0;  

    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to create A\n", time_spent);

    begin = clock();
            
    // Pack into bits
    for (i = 0; i < rows; i++)
        for (j = 0; j < colm; j++) {
            int au_idx = i*cols + j*8;
            for (int k=0; k<8; k++)
                *(A + i*colm + j) |= *(AU + au_idx + k) << k;
            }
            
    end = clock();
    time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
    printf("%lf seconds to pack\n", time_spent);
            

    free(A); 
    free(AU);
    return 0;
}

使用 colm=200,000 进行测试。位打包需要 0.27 秒，而 Jérôme Richard 提供的优化 Python 版本需要 0.64 秒。对 rand() 的调用非常昂贵，并且会大大增加整体运行时间。在内存方面，C 版本的峰值为 2GB，而 Python 的峰值为 4.2GB。进一步的代码优化和并行化肯定会减少运行时间。

Julia version:

using Random
colm = 200000÷8
rows = 30000
cols = colm*8

AU = zeros(UInt8, (rows, cols))

rand!(AU)
AU .&= 0x01

A = zeros(UInt8, (rows, colm))

function compute(A, AU)
    for i in 1:size(A)[2]
        start_col = (i-1) << 3
        @views A[:, i] .=  AU[:, start_col + 1] .| 
                   (AU[:, start_col + 2] .<< 1) .|
                   (AU[:, start_col + 3] .<< 2) .|
                   (AU[:, start_col + 4] .<< 3) .|
                   (AU[:, start_col + 5] .<< 4) .|
                   (AU[:, start_col + 6] .<< 5) .|
                   (AU[:, start_col + 7] .<< 6) .|
                   (AU[:, start_col + 8] .<< 7)        
    end
end

@time compute(A, AU)

Julia 在性能方面表现出色。 colm=25,000 和 rows=30,000 的结果：

Language  Total Run Time (secs)   Bit Packing Time (secs)  Peak Memory (GB)
Python    22.1                    3.0                      6
Julia     11.7                    1.2                      6

Answer 2

警告：您尝试分配巨大内存量：大约 2 TB 内存 你可能没有。

假设您有足够的内存或者可以减小数据集的大小，则可以使用 Numba JIT 编写更快的实现。此外，您可以 并行化 代码并将缓慢的条件替换为 无分支 实现以显着加快计算速度，因为 AU 已填充与二进制值。最后，您可以展开处理 k 的内部循环，使代码更快。这是最终的实现：

import numpy as np
import numba as nb
colm = int(2000000/8)
rows = 1000000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

在我的机器上，此代码比较小数据集（colm=int(20000/8) 和 rows=10000）上的原始实现快 37851 倍。原始实现耗时 6min3s，优化后耗时 9.6ms。

此代码在我的机器上受内存限制。使用当前输入，此代码接近最佳，因为它花费大部分时间读取 AU 输入矩阵。一个很好的额外优化是将 AU 矩阵“压缩”为较小的矩阵（如果可能）。

Answer 3

昨天最初阅读了您的 post 之后，我实际上打算自己用即时编译器 numba 编写一个例程，但 Jérôme 比我快，并为您提供了一个出色的解决方案。但我有一个替代方案可以提供：当已经存在一个 numpy 函数 时，为什么要重新发明轮子，它的功能完全相同：numpy.packbits.

import numpy as np
A = np.packbits(AU,axis=-1)

完成任务。从我的测试来看，它似乎比 Jérôme 的版本慢了很多，但无论如何都比你的初始版本快得多。

并行化执行按位操作的代码

Parallelize code which is doing bit wise operation

python

parallel-processing

performance

multithreading

dask