存储包含数据帧的字典的最有效方法
Most efficient way of storing dictionary consisting dataframes
我有一本包含数据框的字典。
dictionary = {"key1": df1,
"key2": df2, and so on...}
很少有 Whosebug 帖子和 reddit 建议 Json 模块和 pickle 模块。
什么是最有效的方法,为什么?
当我将小字典转换为 pickle 时,它的内存小于 0kb,并且呈现 EOFError: Ran out of input
,这在此处进行了解释 Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?
如果您喜欢紧凑的文件格式,我建议您使用 pickle。
# import packages
import pandas as pd
import numpy as np
import pickle
import os
# create dictionary of dataframes
nrows, ncols, ndataframes = 1_000, 50, 100
my_dict = {k:v for (k,v) in [[f'df_{n}', pd.DataFrame(np.random.rand(nrows, ncols))] for n in range(ndataframes)]}
# save dictionary as pickle file
pickle_out = open('my_dict.pickle', 'wb')
pickle.dump(my_dict, pickle_out)
pickle_out.close()
# create new dictionary from pickle file
pickle_in = open('my_dict.pickle', 'rb')
new_dict = pickle.load(pickle_in)
# print file size
print('File size pickle file is', round(os.path.getsize('my_dict.pickle') / (1024**2), 1), 'MB')
# sample
new_dict['df_10'].iloc[:5, :5]
结果:
File size pickle file is 38.2 MB
0 1 2 3 4
0 0.338838 0.501158 0.406240 0.693233 0.567305
1 0.092142 0.569312 0.952694 0.083705 0.006950
2 0.684314 0.373091 0.550300 0.391419 0.877889
3 0.117929 0.597653 0.726894 0.763094 0.466603
4 0.530755 0.472033 0.553457 0.863435 0.906389
另一种选择可能是 HDFStore,它是一个类似 dict 的对象,它使用高性能 HDF5 格式读写 pandas,更多细节在这里:http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html#hdf5-pytables
我有一本包含数据框的字典。
dictionary = {"key1": df1,
"key2": df2, and so on...}
很少有 Whosebug 帖子和 reddit 建议 Json 模块和 pickle 模块。
什么是最有效的方法,为什么?
当我将小字典转换为 pickle 时,它的内存小于 0kb,并且呈现 EOFError: Ran out of input
,这在此处进行了解释 Why do I get "Pickle - EOFError: Ran out of input" reading an empty file?
如果您喜欢紧凑的文件格式,我建议您使用 pickle。
# import packages
import pandas as pd
import numpy as np
import pickle
import os
# create dictionary of dataframes
nrows, ncols, ndataframes = 1_000, 50, 100
my_dict = {k:v for (k,v) in [[f'df_{n}', pd.DataFrame(np.random.rand(nrows, ncols))] for n in range(ndataframes)]}
# save dictionary as pickle file
pickle_out = open('my_dict.pickle', 'wb')
pickle.dump(my_dict, pickle_out)
pickle_out.close()
# create new dictionary from pickle file
pickle_in = open('my_dict.pickle', 'rb')
new_dict = pickle.load(pickle_in)
# print file size
print('File size pickle file is', round(os.path.getsize('my_dict.pickle') / (1024**2), 1), 'MB')
# sample
new_dict['df_10'].iloc[:5, :5]
结果:
File size pickle file is 38.2 MB
0 1 2 3 4
0 0.338838 0.501158 0.406240 0.693233 0.567305
1 0.092142 0.569312 0.952694 0.083705 0.006950
2 0.684314 0.373091 0.550300 0.391419 0.877889
3 0.117929 0.597653 0.726894 0.763094 0.466603
4 0.530755 0.472033 0.553457 0.863435 0.906389
另一种选择可能是 HDFStore,它是一个类似 dict 的对象,它使用高性能 HDF5 格式读写 pandas,更多细节在这里:http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html#hdf5-pytables