使用 dask 将数据帧划分保存为镶木地板
Saving dataframe divisions to parquet with dask
我目前正在尝试将信息从 dask 保存和读取到 parquet 文件。但是,当尝试使用 dask "to_parquet" 保存数据帧并随后使用 "read_parquet" 再次加载它时,似乎分区信息丢失了。
>>df.divisions
(Timestamp('2014-10-01 17:25:17.928000'), Timestamp('2014-10-01 17:27:18.000860'), Timestamp('2014-10-01 17:29:19.000860'), Timestamp('2014-10-01 17:31:19.000860'), Timestamp('2014-10-01 17:33:20.000860'), Timestamp('2014-10-01 17:35:20.763000'), Timestamp('2014-10-01 17:36:12.992860'))
>>df.to_parquet(folder)
>>del df
>>df = dask.dataframe.read_parquet(folder)
>>df.divisions
(None, None, None, None, None, None, None)
这是故意的吗?
我目前的解决方法是加载后重新设置索引,但这需要很多时间。
>> df = dask.dataframe.read_parquet(folder,index=False).set_index('timestamp', sorted=True)
>> df.divisions
(Timestamp('2014-10-01 17:25:17.928000'), Timestamp('2014-10-01 17:27:18.000860'), Timestamp('2014-10-01 17:29:19.000860'), Timestamp('2014-10-01 17:31:19.000860'), Timestamp('2014-10-01 17:33:20.000860'), Timestamp('2014-10-01 17:35:20.763000'), Timestamp('2014-10-01 17:36:12.992860'))
或者我在保存和加载时是否遗漏了一些选项?
使用 fastparquet 后端进行测试,似乎有效:
> import pandas.util.testing as tm
> df = tm.makeTimeDataFrame()
> df
A B C D
2000-01-03 -0.414197 0.459438 1.105962 -0.791487
2000-01-04 -0.875873 0.987601 0.881839 -1.339756
2000-01-05 0.552543 3.415769 1.008780 0.127757
...
> d = dd.from_pandas(df, 2)
> d.to_parquet('temp.parq')
> dd.read_parquet('temp.parq').divisions
(Timestamp('2000-01-03 00:00:00'),
Timestamp('2000-01-24 00:00:00'),
Timestamp('2000-02-11 00:00:00'))
我目前正在尝试将信息从 dask 保存和读取到 parquet 文件。但是,当尝试使用 dask "to_parquet" 保存数据帧并随后使用 "read_parquet" 再次加载它时,似乎分区信息丢失了。
>>df.divisions
(Timestamp('2014-10-01 17:25:17.928000'), Timestamp('2014-10-01 17:27:18.000860'), Timestamp('2014-10-01 17:29:19.000860'), Timestamp('2014-10-01 17:31:19.000860'), Timestamp('2014-10-01 17:33:20.000860'), Timestamp('2014-10-01 17:35:20.763000'), Timestamp('2014-10-01 17:36:12.992860'))
>>df.to_parquet(folder)
>>del df
>>df = dask.dataframe.read_parquet(folder)
>>df.divisions
(None, None, None, None, None, None, None)
这是故意的吗? 我目前的解决方法是加载后重新设置索引,但这需要很多时间。
>> df = dask.dataframe.read_parquet(folder,index=False).set_index('timestamp', sorted=True)
>> df.divisions
(Timestamp('2014-10-01 17:25:17.928000'), Timestamp('2014-10-01 17:27:18.000860'), Timestamp('2014-10-01 17:29:19.000860'), Timestamp('2014-10-01 17:31:19.000860'), Timestamp('2014-10-01 17:33:20.000860'), Timestamp('2014-10-01 17:35:20.763000'), Timestamp('2014-10-01 17:36:12.992860'))
或者我在保存和加载时是否遗漏了一些选项?
使用 fastparquet 后端进行测试,似乎有效:
> import pandas.util.testing as tm
> df = tm.makeTimeDataFrame()
> df
A B C D
2000-01-03 -0.414197 0.459438 1.105962 -0.791487
2000-01-04 -0.875873 0.987601 0.881839 -1.339756
2000-01-05 0.552543 3.415769 1.008780 0.127757
...
> d = dd.from_pandas(df, 2)
> d.to_parquet('temp.parq')
> dd.read_parquet('temp.parq').divisions
(Timestamp('2000-01-03 00:00:00'),
Timestamp('2000-01-24 00:00:00'),
Timestamp('2000-02-11 00:00:00'))