如何在 timestamp/datetime/datetime64 类型的列上运行 groupby 时正确使用 pandas agg 函数？

Question

我试图理解为什么直接在组上调用 count() return 是正确答案（在本例中，该组中有 2 行），但通过 agg 中的 lambda 调用 count () 函数 returns 纪元的开始 ("1970-01-01 00:00:00.000000002").

# Using groupby(lambda x: True) in the code below just as an illustrative example.
# It will always create a single group.
x = DataFrame({'time': [np.datetime64('2005-02-25'), np.datetime64('2006-03-30')]}).groupby(lambda x: True)

display(x.count())
>>time
>>True  2

display(x.agg(lambda x: x.count()))
>>time
>>True  1970-01-01 00:00:00.000000002

这可能是 pandas 中的错误吗？我在用 Pandas版本：0.16.1 IPython版本：3.1.0 numpy 版本：1.9.2

无论我是否使用标准 python 日期时间与 np.datetime64 与 pandas 时间戳，我都会得到相同的结果。

编辑（根据@jeff 接受的答案，看起来我可能需要在应用不 return 日期时间类型的聚合函数之前强制转换为 dtype 对象）：

dt = [datetime.datetime(2012, 5, 1)] * 2
x = DataFrame({'time': dt})
x['time2'] = x['time'].astype(object)
display(x)
y = x.groupby(lambda x: True)
y.agg(lambda x: x.count())

>>time  time2
>>True  1970-01-01 00:00:00.000000002   2

Answer 1

这里的 x 是上面的原始帧（不是你的 groupby）。传递 UDF，例如lambda，在每个系列上调用它。所以这是函数的结果。

In [35]: x.count()
Out[35]: 
time    2
dtype: int64

然后强制转换为系列的原始数据类型。所以结果是：

In [36]: Timestamp(2)
Out[36]: Timestamp('1970-01-01 00:00:00.000000002')

这正是您所看到的。对原始 dtype 的强制转换点是尽可能保留它。不这样做会使 groupby 结果更加神奇。

如何在 timestamp/datetime/datetime64 类型的列上运行 groupby 时正确使用 pandas agg 函数？

How to correctly use pandas agg function when running groupby on a column of type timestamp/datetime/datetime64?

python

datetime

aggregate

count

pandas

如何在 timestamp/datetime/datetime64 类型的列上 运行 groupby 时正确使用 pandas agg 函数？

How to correctly use pandas agg function when running groupby on a column of type timestamp/datetime/datetime64?

python

datetime

aggregate

count

pandas

如何在 timestamp/datetime/datetime64 类型的列上运行 groupby 时正确使用 pandas agg 函数？