将数据源信息附加到 pandas 系列

Question

有没有办法将数据源的信息附加到pandas系列？目前，我只是在数据框中添加列以指示每个变量的来源...

非常感谢您的想法和建议！

Answer 1

与大多数 Python 对象一样，您可以使用句点 (.) 语法添加属性。但是，您应该注意您的属性名称不要与标签冲突。这是一个演示：

import pandas as pd

s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20

print(s.a, s.d)

10 20

print(s)

a    10
b     1
c     2

如上所述，当您实际上想要添加 a 属性时，您可能会无意中覆盖标签的值。如所述，缓解此问题的一种方法是执行简单检查：

if 'a' not in s:
    s.a = 100
else:
    print('Attempt to overwrite label when setting attribute aborted!')
    # or raise a custom error

请注意，GroupBy、pivot 等数据帧上的操作，如 here 所述，可能会 return 删除属性的数据副本。

最后，对于存储数据帧或附加元数据的系列，您不妨考虑HDF5。例如，参见 .

Answer 2

来自官方pandas documentation:

To let original data structures have additional properties, you should let pandas know what properties are added. pandas maps unknown properties to data names overriding __getattribute__. Defining original properties can be done in one of 2 ways:

Define _internal_names and _internal_names_set for temporary properties which WILL NOT be passed to manipulation results.

Define _metadata for normal properties which will be passed to manipulation results.

Below is an example to define two original properties, “internal_cache” as a temporary property and “added_property” as a normal property
class SubclassedDataFrame2(DataFrame):

    # temporary properties
    _internal_names = pd.DataFrame._internal_names + ['internal_cache']
    _internal_names_set = set(_internal_names)

    # normal properties
    _metadata = ['added_property']

@property
def _constructor(self):
    return SubclassedDataFrame2
_
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

>>> df.internal_cache = 'cached'
>>> df.added_property = 'property'

>>> df.internal_cache
cached
>>> df.added_property
property

# properties defined in _internal_names is reset after manipulation
>>> df[['A', 'B']].internal_cache
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'

# properties defined in _metadata are retained
>>> df[['A', 'B']].added_property
property

如您所见，通过 _metadata 定义自定义属性的好处是，属性将在（大多数）一对一数据框操作期间自动传播。请注意，在多对一数据帧操作期间（例如 merge() 或 concat()），您的自定义属性仍将丢失。

将数据源信息附加到 pandas 系列

Attach data source info to pandas series

python

metadata

series

pandas