Pandas groupby() 版本 0.23.4 和 1.3.4 的不同输出
Pandas groupby() different output with versions 0.23.4 and 1.3.4
我有 2 个代码库,代码相同,唯一的区别是使用的 pandas 版本:
- 旧环境使用pandas版本0.23.4
- 新环境使用pandas版本1.3.4
我调试到这行代码,之后result
不一样了:
result = df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
变量 df
、group_items
、as_index
、sort
和 sum_items
在新旧环境中完全相同。
但是,返回的 result
在新版本中有点不同。具体来说,输出如下所示:
新环境:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
... ... ... ... .. ...
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 3115825.0
133678 3 SST22b,SST07, ... 1436891.0
[133679 rows x 16 columns]
旧环境:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
5 0 SST22a,SST22b, ... 4016202.0
6 0 SST22a,SST22b, ... 8799969.0
7 0 SST22a,SST22b, ... 1503269.0
8 0 SST22a,SST22b, ... 6385991.0
9 0 SST22a,SST22b, ... 1686520.0
10 0 SST22a,SST22b, ... 5287114.0
11 0 SST22a,SST22b, ... 2648534.0
12 0 SST22a,SST22b, ... 6159017.0
13 0 SST22a,SST22b, ... 5959591.0
14 0 SST22a,SST22b, ... 5809998.0
15 0 SST22a,SST22b, ... 4929077.0
16 0 SST22a,SST22b, ... 9166004.0
17 0 SST22a,SST22b, ... 2124498.0
18 0 SST22a,SST22b, ... 3051659.0
19 0 SST22a,SST22b, ... 1859001.0
20 0 SST22a,SST22b, ... 8522834.0
21 0 SST22a,SST22b, ... 7803526.0
22 0 SST22a,SST22b, ... 4067546.0
23 0 SST22a,SST22b, ... 9218486.0
24 0 SST22a,SST22b, ... 1453153.0
25 0 SST22a,SST22b, ... 7411706.0
26 0 SST22a,SST22b, ... 9160444.0
27 0 SST22a,SST22b, ... 6255426.0
28 0 SST22a,SST22b, ... 6007841.0
29 0 SST22a,SST22b, ... 4744588.0
... ... ... ... .. ...
133649 3 SST22b,SST07, ... 6487572.0
133650 3 SST22b,SST07, ... 3593805.0
133651 3 SST22b,SST07, ... 9192954.0
133652 3 SST22b,SST07, ... 2394981.0
133653 3 SST22b,SST07, ... 9398971.0
133654 3 SST22b,SST07, ... 5536294.0
133655 3 SST22b,SST07, ... 8759613.0
133656 3 SST22b,SST07, ... 2012212.0
133657 3 SST22b,SST07, ... 7930551.0
133658 3 SST22b,SST07, ... 3407871.0
133659 3 SST22b,SST07, ... 3071541.0
133660 3 SST22b,SST07, ... 1863129.0
133661 3 SST22b,SST07, ... 8439646.0
133662 3 SST22b,SST07, ... 1518097.0
133663 3 SST22b,SST07, ... 7396702.0
133664 3 SST22b,SST07, ... 8470274.0
133665 3 SST22b,SST07, ... 8363095.0
133666 3 SST22b,SST07, ... 1115614.0
133667 3 SST22b,SST07, ... 6317772.0
133668 3 SST22b,SST07, ... 2645613.0
133669 3 SST22b,SST07, ... 6555039.0
133670 3 SST22b,SST07, ... 5274987.0
133671 3 SST22b,SST07, ... 5779789.0
133672 3 SST22b,SST07, ... 6974948.0
133673 3 SST22b,SST07, ... 6370779.0
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 1436891.0
133678 3 SST22b,SST07, ... 3115825.0
[133679 rows x 16 columns]
如您所见,行数和列数是一样的。
两个 result
之间的列也完全相同。
但是,当您检查 AMOUNT
列时,您会看到,例如在最后一行中,来自 NEW 环境的 result
具有合并的值(例如,最后一行交换为前一行) .
知道为什么会这样吗?
PS: 不幸的是,我无法提供您可以加载的 DataFrame,因为我使用的 DataFrame 中包含大量数据。我更多的是寻找一个理论答案,说明上面提到的 pandas and/or 版本之间发生了什么变化,在新环境中使用哪个参数可以得到与旧环境中完全相同的结果。
好吧,您的数据未排序,在那些版本中似乎 pandas returns 数据的顺序不同。
来自 Pandas 文档:
sort in group by just sort the group keys and this does not influence the order of observations within each group.
您可以在分组依据后按 .sort_values()
对数据进行排序,有关详细信息,请参阅 docs
我有 2 个代码库,代码相同,唯一的区别是使用的 pandas 版本:
- 旧环境使用pandas版本0.23.4
- 新环境使用pandas版本1.3.4
我调试到这行代码,之后result
不一样了:
result = df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
变量 df
、group_items
、as_index
、sort
和 sum_items
在新旧环境中完全相同。
但是,返回的 result
在新版本中有点不同。具体来说,输出如下所示:
新环境:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
... ... ... ... .. ...
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 3115825.0
133678 3 SST22b,SST07, ... 1436891.0
[133679 rows x 16 columns]
旧环境:
df.groupby(group_items, as_index=as_index, sort=sort)[sum_items].sum()
SST_ADJ_TYPE SST_ADJ_RULE ... NCI AMOUNT
0 0 SST22a,SST22b, ... 1874757.0
1 0 SST22a,SST22b, ... 5945263.0
2 0 SST22a,SST22b, ... 4303110.0
3 0 SST22a,SST22b, ... 5342991.0
4 0 SST22a,SST22b, ... 9245478.0
5 0 SST22a,SST22b, ... 4016202.0
6 0 SST22a,SST22b, ... 8799969.0
7 0 SST22a,SST22b, ... 1503269.0
8 0 SST22a,SST22b, ... 6385991.0
9 0 SST22a,SST22b, ... 1686520.0
10 0 SST22a,SST22b, ... 5287114.0
11 0 SST22a,SST22b, ... 2648534.0
12 0 SST22a,SST22b, ... 6159017.0
13 0 SST22a,SST22b, ... 5959591.0
14 0 SST22a,SST22b, ... 5809998.0
15 0 SST22a,SST22b, ... 4929077.0
16 0 SST22a,SST22b, ... 9166004.0
17 0 SST22a,SST22b, ... 2124498.0
18 0 SST22a,SST22b, ... 3051659.0
19 0 SST22a,SST22b, ... 1859001.0
20 0 SST22a,SST22b, ... 8522834.0
21 0 SST22a,SST22b, ... 7803526.0
22 0 SST22a,SST22b, ... 4067546.0
23 0 SST22a,SST22b, ... 9218486.0
24 0 SST22a,SST22b, ... 1453153.0
25 0 SST22a,SST22b, ... 7411706.0
26 0 SST22a,SST22b, ... 9160444.0
27 0 SST22a,SST22b, ... 6255426.0
28 0 SST22a,SST22b, ... 6007841.0
29 0 SST22a,SST22b, ... 4744588.0
... ... ... ... .. ...
133649 3 SST22b,SST07, ... 6487572.0
133650 3 SST22b,SST07, ... 3593805.0
133651 3 SST22b,SST07, ... 9192954.0
133652 3 SST22b,SST07, ... 2394981.0
133653 3 SST22b,SST07, ... 9398971.0
133654 3 SST22b,SST07, ... 5536294.0
133655 3 SST22b,SST07, ... 8759613.0
133656 3 SST22b,SST07, ... 2012212.0
133657 3 SST22b,SST07, ... 7930551.0
133658 3 SST22b,SST07, ... 3407871.0
133659 3 SST22b,SST07, ... 3071541.0
133660 3 SST22b,SST07, ... 1863129.0
133661 3 SST22b,SST07, ... 8439646.0
133662 3 SST22b,SST07, ... 1518097.0
133663 3 SST22b,SST07, ... 7396702.0
133664 3 SST22b,SST07, ... 8470274.0
133665 3 SST22b,SST07, ... 8363095.0
133666 3 SST22b,SST07, ... 1115614.0
133667 3 SST22b,SST07, ... 6317772.0
133668 3 SST22b,SST07, ... 2645613.0
133669 3 SST22b,SST07, ... 6555039.0
133670 3 SST22b,SST07, ... 5274987.0
133671 3 SST22b,SST07, ... 5779789.0
133672 3 SST22b,SST07, ... 6974948.0
133673 3 SST22b,SST07, ... 6370779.0
133674 3 SST22b,SST07, ... 4164305.0
133675 3 SST22b,SST07, ... 7280203.0
133676 3 SST22b,SST07, ... 1235752.0
133677 3 SST22b,SST07, ... 1436891.0
133678 3 SST22b,SST07, ... 3115825.0
[133679 rows x 16 columns]
如您所见,行数和列数是一样的。
两个 result
之间的列也完全相同。
但是,当您检查 AMOUNT
列时,您会看到,例如在最后一行中,来自 NEW 环境的 result
具有合并的值(例如,最后一行交换为前一行) .
知道为什么会这样吗?
PS: 不幸的是,我无法提供您可以加载的 DataFrame,因为我使用的 DataFrame 中包含大量数据。我更多的是寻找一个理论答案,说明上面提到的 pandas and/or 版本之间发生了什么变化,在新环境中使用哪个参数可以得到与旧环境中完全相同的结果。
好吧,您的数据未排序,在那些版本中似乎 pandas returns 数据的顺序不同。 来自 Pandas 文档:
sort in group by just sort the group keys and this does not influence the order of observations within each group.
您可以在分组依据后按 .sort_values()
对数据进行排序,有关详细信息,请参阅 docs