将数据源信息附加到 pandas 系列
Attach data source info to pandas series
有没有办法将数据源的信息附加到pandas系列?目前,我只是在数据框中添加列以指示每个变量的来源...
非常感谢您的想法和建议!
与大多数 Python 对象一样,您可以使用句点 (.
) 语法添加属性。但是,您应该注意您的属性名称不要与标签冲突。这是一个演示:
import pandas as pd
s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20
print(s.a, s.d)
10 20
print(s)
a 10
b 1
c 2
如上所述,当您实际上想要添加 a
属性时,您可能会无意中覆盖标签的值。如 所述,缓解此问题的一种方法是执行简单检查:
if 'a' not in s:
s.a = 100
else:
print('Attempt to overwrite label when setting attribute aborted!')
# or raise a custom error
请注意,GroupBy
、pivot
等数据帧上的操作,如 here 所述,可能会 return 删除属性的数据副本。
最后,对于存储数据帧或附加元数据的系列,您不妨考虑HDF5。例如,参见 .
来自官方pandas documentation:
To let original data structures have additional properties, you should
let pandas
know what properties are added. pandas
maps unknown
properties to data names overriding __getattribute__
. Defining
original properties can be done in one of 2 ways:
Define _internal_names
and _internal_names_set
for temporary properties which WILL NOT be passed to manipulation results.
Define _metadata
for normal properties which will be passed to manipulation results.
Below is an example to define two original properties,
“internal_cache” as a temporary property and “added_property” as a
normal property
class SubclassedDataFrame2(DataFrame):
# temporary properties
_internal_names = pd.DataFrame._internal_names + ['internal_cache']
_internal_names_set = set(_internal_names)
# normal properties
_metadata = ['added_property']
@property
def _constructor(self):
return SubclassedDataFrame2
_
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.internal_cache = 'cached'
>>> df.added_property = 'property'
>>> df.internal_cache
cached
>>> df.added_property
property
# properties defined in _internal_names is reset after manipulation
>>> df[['A', 'B']].internal_cache
AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache'
# properties defined in _metadata are retained
>>> df[['A', 'B']].added_property
property
如您所见,通过 _metadata
定义自定义属性的好处是,属性将在(大多数)一对一数据框操作期间自动传播。请注意,在多对一数据帧操作期间(例如 merge()
或 concat()
),您的自定义属性仍将丢失。
有没有办法将数据源的信息附加到pandas系列?目前,我只是在数据框中添加列以指示每个变量的来源...
非常感谢您的想法和建议!
与大多数 Python 对象一样,您可以使用句点 (.
) 语法添加属性。但是,您应该注意您的属性名称不要与标签冲突。这是一个演示:
import pandas as pd
s = pd.Series(list(range(3)), index=list('abc'))
s.a = 10
s.d = 20
print(s.a, s.d)
10 20
print(s)
a 10
b 1
c 2
如上所述,当您实际上想要添加 a
属性时,您可能会无意中覆盖标签的值。如
if 'a' not in s:
s.a = 100
else:
print('Attempt to overwrite label when setting attribute aborted!')
# or raise a custom error
请注意,GroupBy
、pivot
等数据帧上的操作,如 here 所述,可能会 return 删除属性的数据副本。
最后,对于存储数据帧或附加元数据的系列,您不妨考虑HDF5。例如,参见
来自官方pandas documentation:
To let original data structures have additional properties, you should let
pandas
know what properties are added.pandas
maps unknown properties to data names overriding__getattribute__
. Defining original properties can be done in one of 2 ways:
Define
_internal_names
and_internal_names_set
for temporary properties which WILL NOT be passed to manipulation results.Define
_metadata
for normal properties which will be passed to manipulation results.Below is an example to define two original properties, “internal_cache” as a temporary property and “added_property” as a normal property
class SubclassedDataFrame2(DataFrame): # temporary properties _internal_names = pd.DataFrame._internal_names + ['internal_cache'] _internal_names_set = set(_internal_names) # normal properties _metadata = ['added_property'] @property def _constructor(self): return SubclassedDataFrame2
_
>>> df = SubclassedDataFrame2({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}) >>> df A B C 0 1 4 7 1 2 5 8 2 3 6 9 >>> df.internal_cache = 'cached' >>> df.added_property = 'property' >>> df.internal_cache cached >>> df.added_property property # properties defined in _internal_names is reset after manipulation >>> df[['A', 'B']].internal_cache AttributeError: 'SubclassedDataFrame2' object has no attribute 'internal_cache' # properties defined in _metadata are retained >>> df[['A', 'B']].added_property property
如您所见,通过 _metadata
定义自定义属性的好处是,属性将在(大多数)一对一数据框操作期间自动传播。请注意,在多对一数据帧操作期间(例如 merge()
或 concat()
),您的自定义属性仍将丢失。