pickle 是否在大文件上随机失败并出现 OSError?
Does pickle randomly fail with OSError on large files?
问题陈述
我正在使用 python3 并尝试 pickle 一个 IntervalTrees 字典,它的重量大约为 2 到 3 GB。这是我的控制台输出:
10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO - - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO - - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO - - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument
我尝试保存的行是
def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
错误可能在调用 save_as_pickled_object
后 15 分钟出现(在 11:20)。
我用图案文件的一小部分尝试了这个,它工作正常,所有代码完全相同,所以它一定是比例问题。 python 3.6 中是否存在与您尝试 pickle 的规模有关的 pickle 已知错误?一般来说,酸洗大文件是否存在已知错误? 是否有任何已知的解决方法?
谢谢!
更新:这个问题可能与 Python 3 - Can pickle handle byte objects larger than 4GB?
重复
解决方案
这是我改用的代码。
def save_as_pickled_object(obj, filepath):
"""
This is a defensive way to write pickle.write, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
bytes_out = pickle.dumps(obj)
n_bytes = sys.getsizeof(bytes_out)
with open(filepath, 'wb') as f_out:
for idx in range(0, n_bytes, max_bytes):
f_out.write(bytes_out[idx:idx+max_bytes])
def try_to_load_as_pickled_object_or_None(filepath):
"""
This is a defensive way to write pickle.load, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
try:
input_size = os.path.getsize(filepath)
bytes_in = bytearray(0)
with open(filepath, 'rb') as f_in:
for _ in range(0, input_size, max_bytes):
bytes_in += f_in.read(max_bytes)
obj = pickle.loads(bytes_in)
except:
return None
return obj
Alex,如果我没记错的话,这份错误报告完美地描述了你的问题。
http://bugs.python.org/issue24658
作为解决方法,我认为您可以 pickle.dumps
而不是 pickle.dump
,然后以小于 2**31.
的大小块写入您的文件
问题陈述
我正在使用 python3 并尝试 pickle 一个 IntervalTrees 字典,它的重量大约为 2 到 3 GB。这是我的控制台输出:
10:39:25 - project: INFO - Checking if motifs file was generated by pickle...
10:39:25 - project: INFO - - Motifs file does not seem to have been generated by pickle, proceeding to parse...
10:39:38 - project: INFO - - Parse complete, constructing IntervalTrees...
11:04:05 - project: INFO - - IntervalTree construction complete, saving pickle file for next time.
Traceback (most recent call last):
File "/Users/alex/Documents/project/src/project.py", line 522, in dict_of_IntervalTree_from_motifs_file
save_as_pickled_object(motifs, output_dir + 'motifs_IntervalTree_dictionary.pickle')
File "/Users/alex/Documents/project/src/project.py", line 269, in save_as_pickled_object
def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
OSError: [Errno 22] Invalid argument
我尝试保存的行是
def save_as_pickled_object(object, filepath): return pickle.dump(object, open(filepath, "wb"))
错误可能在调用 save_as_pickled_object
后 15 分钟出现(在 11:20)。
我用图案文件的一小部分尝试了这个,它工作正常,所有代码完全相同,所以它一定是比例问题。 python 3.6 中是否存在与您尝试 pickle 的规模有关的 pickle 已知错误?一般来说,酸洗大文件是否存在已知错误? 是否有任何已知的解决方法?
谢谢!
更新:这个问题可能与 Python 3 - Can pickle handle byte objects larger than 4GB?
重复解决方案
这是我改用的代码。
def save_as_pickled_object(obj, filepath):
"""
This is a defensive way to write pickle.write, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
bytes_out = pickle.dumps(obj)
n_bytes = sys.getsizeof(bytes_out)
with open(filepath, 'wb') as f_out:
for idx in range(0, n_bytes, max_bytes):
f_out.write(bytes_out[idx:idx+max_bytes])
def try_to_load_as_pickled_object_or_None(filepath):
"""
This is a defensive way to write pickle.load, allowing for very large files on all platforms
"""
max_bytes = 2**31 - 1
try:
input_size = os.path.getsize(filepath)
bytes_in = bytearray(0)
with open(filepath, 'rb') as f_in:
for _ in range(0, input_size, max_bytes):
bytes_in += f_in.read(max_bytes)
obj = pickle.loads(bytes_in)
except:
return None
return obj
Alex,如果我没记错的话,这份错误报告完美地描述了你的问题。
http://bugs.python.org/issue24658
作为解决方法,我认为您可以 pickle.dumps
而不是 pickle.dump
,然后以小于 2**31.