Save/load pandas 具有自定义属性的数据框

Question

我有一个 pandas.DataFrame，我以属性的形式向其附加了一些元信息。我想 save/restore df 完整地使用它，但它在保存过程中被删除了：

import pandas as pd
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df.my_attribute = 'can I recover this attribute after saving?'
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
new_df.my_attribute

# AttributeError: 'DataFrame' object has no attribute 'my_attribute'

其他文件格式似乎更糟：如果您不小心，csv 和 json 会丢弃 type、index 或 column 信息。也许创建一个扩展 DataFrame 的新 class？对想法持开放态度。

Answer 1

此处没有通用或任何标准 close-to，但有一些选择

1) 一般建议 - 除了最短的术语序列化（例如 <1 天），我不会将 pickle 用于任何其他事情

2) 可以将任意元数据打包成 pandas 支持的两种二进制格式，msgpack 和 HDF5，以 ad-hoc 方式授予。你也可以这样做我们 CSV 等，但它变得更多 ad-hoc.

# msgpack
data = {'df': df, 'my_attribute': df.my_attribute}
pd.to_msgpack('tmp.msg', data)
pd.read_msgpack('tmp.msg')['my_attribute']
# Out[70]: 'can I recover this attribute after saving?'

# hdf
with pd.HDFStore('tmp.h5') as store:
    store.put('df', df)
    store.get_storer('df').attrs.my_attribute = df.my_attribute    
with pd.HDFStore('tmp.h5') as store:
    df = store.get('df')
    df.my_attribute = store.get_storer('df').attrs.my_attribute

df.my_attribute
Out[79]: 'can I recover this attribute after saving?'

3) xarray，它是 pandas 的 n-d 扩展，支持存储到 NetCDF 文件格式，它具有更多 built-in 元数据概念

import xarray
ds = xarray.Dataset.from_dataframe(df)
ds.attrs['my_attribute'] = df.my_attribute

ds.to_netcdf('test.cdf')
ds = xarray.open_dataset('test.cdf')
ds
Out[8]: 
<xarray.Dataset>
Dimensions:            (index: 150)
Coordinates:
  * index              (index) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ...
Data variables:
    sepal length (cm)  (index) float64 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 ...
    sepal width (cm)   (index) float64 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 ...
    petal length (cm)  (index) float64 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 ...
    petal width (cm)   (index) float64 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 ...
Attributes:
    my_attribute:  can I recover this attribute after saving?

Answer 2

当我问这个问题时，我想用我的数据框存储少量元数据。 Monkey-patching 信息可能是最糟糕的选择 :)。如果我今天遇到这个问题，我可能会执行以下操作之一：

使用plaintext/markdown（自述文件，最简单且更可取）
如果我想要一点结构（简单、最小的流程变化），请使用 json
如果这会变得更严重，请使用“生产级”工具（例如 sqlite/hdf5/parquet）

json 作为人类和机器 readable/editable 格式特别好。一种选择是存储 json metadata 文件：

metadata.json:

[
    {
        "path": "df.pkl",
        "metadata": "some editable metadata string"
    },
    {
        "path": "some/path/to/df2.pkl",
        "metadata": "metadata for df2"
    },
]

您甚至可以将其解析为 df:

df_meta = pd.read_json("metadata.json")

Save/load pandas 具有自定义属性的数据框

Save/load pandas dataframe with custom attributes

python

object

pickle

pandas