Pandas Split (group) + Apply + Rejoin (concat) 的无法解释的行为,但仅在排序时
Unexplained behavior with Pandas Split (group) + Apply + Rejoin (concat), but only when sorting
我正在尝试为 Pandas 数据框中的组计算列与其滞后(偏移)之间的距离。需要对这些组进行排序,以便轮班在一个时间步之前。执行此操作的标准方法是 .groupby()
(又名 Split),然后 .apply()
每个组的距离函数,然后重新加入 .concat()
。这工作正常,但只有当我没有明确排序分组数据框时。当我对分组数据框进行排序时,我在重新加入步骤中遇到错误。
这是我的示例代码,我能够重现意外行为:
import pandas as pd
def dist_apply(group):
# when commented out, this code will run to completion (!)
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)
# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)
当对分组数据框进行排序的代码被注释掉时,代码会打印出这个,这是预期的:
X T Y
0 A 0.9 7
1 B 0.8 1
2 A 0.7 8
3 B 0.9 3
4 A 0.8 9
5 B 0.7 5
X T Y shift dist
0 A 0.9 7 NaN NaN
1 B 0.8 1 NaN NaN
2 A 0.7 8 7.0 1.0
3 B 0.9 3 1.0 2.0
4 A 0.8 9 8.0 1.0
5 B 0.7 5 3.0 2.0
X T Y X T Y shift dist
0 A 0.9 7 A 0.9 7 NaN NaN
1 B 0.8 1 B 0.8 1 NaN NaN
2 A 0.7 8 A 0.7 8 7.0 1.0
3 B 0.9 3 B 0.9 3 1.0 2.0
4 A 0.8 9 A 0.8 9 8.0 1.0
5 B 0.7 5 B 0.7 5 3.0 2.0
有了排序线,Traceback 看起来像这样:
Traceback (most recent call last):
File "test.py", line 19, in <module>
df = pd.concat([df,df_g],axis=1)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
return op.get_result()
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
indexers[ax] = obj_labels.reindex(new_labels)[1]
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
target = MultiIndex.from_tuples(target)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()
排序但不排序 运行 concat 为我打印此 df_g:
X T Y shift dist
X
A 2 A 0.7 8 NaN NaN
4 A 0.8 9 8.0 1.0
0 A 0.9 7 9.0 -2.0
B 5 B 0.7 5 NaN NaN
1 B 0.8 1 5.0 -4.0
3 B 0.9 3 1.0 2.0
这表明它的分组方式与未排序的 df_g 的打印方式不同(如上),但不清楚在这种情况下 concat 是如何中断的。
更新:我以为我已经通过重命名有问题的列(在这种情况下为'X')并在分组数据框上使用.reset_index()
来解决它合并前。
df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)
按预期运行,并打印:
X T Y X level_1 X_g T Y shift dist
0 A 0.9 7 A 2 A 0.7 8 NaN NaN
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
2 A 0.7 8 A 0 A 0.9 7 9.0 -2.0
3 B 0.9 3 B 5 B 0.7 5 NaN NaN
4 A 0.8 9 B 1 B 0.8 1 5.0 -4.0
5 B 0.7 5 B 3 B 0.9 3 1.0 2.0
但仔细观察,这一列显示合并不正确:
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
我正在使用 Mac OSX 和 Python 3.7.6 |由 conda-forge 打包 | (默认,2020 年 1 月 7 日,22:05:27)
Pandas 0.24.2 + 麻木 1.17.3
并尝试升级到 Pandas 0.25.3 和 Numpy 1.17.5,结果相同。
暂时有效。
重命名列以避免重复:
df_g.columns = ['X_g','T','Y','shift','dist']
:
df_g = df_g.reset_index(level=[0,1])
简单合并,如果要保持已排序的组顺序,请将df_g
放在第一位:
df = pd.merge(df_g,df)
给我
X level_1 X_g T Y shift dist
0 A 2 A 0.7 8 NaN NaN
1 A 4 A 0.8 9 8.0 1.0
2 A 0 A 0.9 7 9.0 -2.0
3 B 5 B 0.7 5 NaN NaN
4 B 1 B 0.8 1 5.0 -4.0
5 B 3 B 0.9 3 1.0 2.0
完整代码:
import pandas as pd
def dist_apply(group):
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
df_g = df.groupby(['X'])
df_g = df_g.apply(dist_apply)
#print(df_g)
df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])
#print(df_g)
df = pd.merge(df_g,df)
print(df)
我正在尝试为 Pandas 数据框中的组计算列与其滞后(偏移)之间的距离。需要对这些组进行排序,以便轮班在一个时间步之前。执行此操作的标准方法是 .groupby()
(又名 Split),然后 .apply()
每个组的距离函数,然后重新加入 .concat()
。这工作正常,但只有当我没有明确排序分组数据框时。当我对分组数据框进行排序时,我在重新加入步骤中遇到错误。
这是我的示例代码,我能够重现意外行为:
import pandas as pd
def dist_apply(group):
# when commented out, this code will run to completion (!)
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
# split
df_g = df.groupby(['X'])
# apply
df_g = df_g.apply(dist_apply)
print(df_g)
# rejoin
df = pd.concat([df,df_g],axis=1)
print(df)
当对分组数据框进行排序的代码被注释掉时,代码会打印出这个,这是预期的:
X T Y
0 A 0.9 7
1 B 0.8 1
2 A 0.7 8
3 B 0.9 3
4 A 0.8 9
5 B 0.7 5
X T Y shift dist
0 A 0.9 7 NaN NaN
1 B 0.8 1 NaN NaN
2 A 0.7 8 7.0 1.0
3 B 0.9 3 1.0 2.0
4 A 0.8 9 8.0 1.0
5 B 0.7 5 3.0 2.0
X T Y X T Y shift dist
0 A 0.9 7 A 0.9 7 NaN NaN
1 B 0.8 1 B 0.8 1 NaN NaN
2 A 0.7 8 A 0.7 8 7.0 1.0
3 B 0.9 3 B 0.9 3 1.0 2.0
4 A 0.8 9 A 0.8 9 8.0 1.0
5 B 0.7 5 B 0.7 5 3.0 2.0
有了排序线,Traceback 看起来像这样:
Traceback (most recent call last):
File "test.py", line 19, in <module>
df = pd.concat([df,df_g],axis=1)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
return op.get_result()
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 420, in get_result
indexers[ax] = obj_labels.reindex(new_labels)[1]
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 2236, in reindex
target = MultiIndex.from_tuples(target)
File "/Users/me/miniconda3/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 396, in from_tuples
arrays = list(lib.tuples_to_object_array(tuples).T)
File "pandas/_libs/lib.pyx", line 2287, in pandas._libs.lib.tuples_to_object_array
TypeError: object of type 'int' has no len()
排序但不排序 运行 concat 为我打印此 df_g:
X T Y shift dist
X
A 2 A 0.7 8 NaN NaN
4 A 0.8 9 8.0 1.0
0 A 0.9 7 9.0 -2.0
B 5 B 0.7 5 NaN NaN
1 B 0.8 1 5.0 -4.0
3 B 0.9 3 1.0 2.0
这表明它的分组方式与未排序的 df_g 的打印方式不同(如上),但不清楚在这种情况下 concat 是如何中断的。
更新:我以为我已经通过重命名有问题的列(在这种情况下为'X')并在分组数据框上使用.reset_index()
来解决它合并前。
df_g.columns = ['X_g','T','Y','shift','dist']
df = pd.concat([df,df_g.reset_index()],axis=1)
按预期运行,并打印:
X T Y X level_1 X_g T Y shift dist
0 A 0.9 7 A 2 A 0.7 8 NaN NaN
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
2 A 0.7 8 A 0 A 0.9 7 9.0 -2.0
3 B 0.9 3 B 5 B 0.7 5 NaN NaN
4 A 0.8 9 B 1 B 0.8 1 5.0 -4.0
5 B 0.7 5 B 3 B 0.9 3 1.0 2.0
但仔细观察,这一列显示合并不正确:
1 B 0.8 1 A 4 A 0.8 9 8.0 1.0
我正在使用 Mac OSX 和 Python 3.7.6 |由 conda-forge 打包 | (默认,2020 年 1 月 7 日,22:05:27)
Pandas 0.24.2 + 麻木 1.17.3 并尝试升级到 Pandas 0.25.3 和 Numpy 1.17.5,结果相同。
暂时有效。
重命名列以避免重复:
df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])
简单合并,如果要保持已排序的组顺序,请将df_g
放在第一位:
df = pd.merge(df_g,df)
给我
X level_1 X_g T Y shift dist
0 A 2 A 0.7 8 NaN NaN
1 A 4 A 0.8 9 8.0 1.0
2 A 0 A 0.9 7 9.0 -2.0
3 B 5 B 0.7 5 NaN NaN
4 B 1 B 0.8 1 5.0 -4.0
5 B 3 B 0.9 3 1.0 2.0
完整代码:
import pandas as pd
def dist_apply(group):
group.sort_values(by='T',inplace=True)
group['shift'] = group['Y'].shift()
group['dist'] = group['Y'] - group['shift']
return group
df = pd.DataFrame(pd.DataFrame({'X': ['A', 'B', 'A', 'B', 'A', 'B'], 'T': [0.9, 0.8, 0.7, 0.9, 0.8, 0.7], 'Y': [7, 1, 8, 3, 9, 5]}))
print(df)
df_g = df.groupby(['X'])
df_g = df_g.apply(dist_apply)
#print(df_g)
df_g.columns = ['X_g','T','Y','shift','dist']
df_g = df_g.reset_index(level=[0,1])
#print(df_g)
df = pd.merge(df_g,df)
print(df)