使用 pickle 反序列化大型 numpy 数组比使用 numpy 慢一个数量级

Question

我正在反序列化大型 numpy 数组（在本例中为 500MB），我发现结果因方法之间的数量级而异。以下是我计时的 3 种方法。

我正在从 multiprocessing.shared_memory 包中接收数据，因此数据以 memoryview 对象的形式出现。但是在这些简单的示例中，我只是预先创建了一个字节数组来运行测试。

I wonder if there are any mistakes in these approaches, or if there are other techniques I didn't try. Deserialization in Python is a real pickle of a problem if you want to move data fast and not lock the GIL just for the IO. A good explanation as to why these approaches vary so much would also be a good answer.

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

结果：

Deserialize using pickle...
Time: 0.2509949207 sec
Deserialize from bytes...
Time: 0.0204288960 sec
Deserialize using numpy load from BytesIO...
Time: 28.9850852489 sec

第二个选项是最快的，但明显不够优雅，因为我需要显式序列化形状和 dtype 信息。

Answer 1

我发现你的问题很有用，我正在寻找最好的 numpy 序列化并确认 np.load() 是最好的，除了它在我下面的附加测试中被 pyarrow 击败。 Arrow 现在是一个超级流行的分布式计算数据序列化框架（例如 Spark，...）

""" Deserialization speed test """
import numpy as np
import pickle
import time
import io
import pyarrow as pa


sz = 524288000
sample = np.random.randint(0, 255, size=sz, dtype=np.uint8)  # 500 MB data
pa_buf = pa.serialize(sample).to_buffer()

serialized_sample = pickle.dumps(sample)
serialized_bytes = sample.tobytes()
serialized_bytesio = io.BytesIO()
np.save(serialized_bytesio, sample, allow_pickle=False)
serialized_bytesio.seek(0)

result = None

print('Deserialize using pickle...')
t0 = time.time()
result = pickle.loads(serialized_sample)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize from bytes...')
t0 = time.time()
result = np.ndarray(shape=sz, dtype=np.uint8, buffer=serialized_bytes)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize using numpy load from BytesIO...')
t0 = time.time()
result = np.load(serialized_bytesio, allow_pickle=False)
print('Time: {:.10f} sec'.format(time.time() - t0))

print('Deserialize pyarrow')
t0 = time.time()
restored_data = pa.deserialize(pa_buf)
print('Time: {:.10f} sec'.format(time.time() - t0))

Databricks Runtime 8.3ML 上 i3.2xlarge 的结果 Python 3.8、Numpy 1.19.2、Pyarrow 1.0.1

Deserialize using pickle...
Time: 0.4069395065 sec
Deserialize from bytes...
Time: 0.0281322002 sec
Deserialize using numpy load from BytesIO...
Time: 0.3059172630 sec
Deserialize pyarrow
Time: 0.0031735897 sec

你的 BytesIO 结果比我的多 100 倍，我不知道为什么。

使用 pickle 反序列化大型 numpy 数组比使用 numpy 慢一个数量级

Deserialization of large numpy arrays using pickle is order of magnitude slower than using numpy

python

numpy

deserialization

python-3.8