如何跨 pydatatable 中的列应用聚合（sum、mean、max、min 等）？

Question

我有一个数据表，


DT_X = dt.Frame({
    
    'issue':['cs-1','cs-2','cs-3','cs-1','cs-3','cs-2'],
    
    'speech':[1,1,1,0,1,1],
    
    'narrative':[1,0,1,1,1,0],
    
    'thought':[0,1,1,0,1,1]
})

可以看作，

Out[5]: 
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          1        0
 1 | cs-2        1          0        1
 2 | cs-3        1          1        1
 3 | cs-1        0          1        0
 4 | cs-3        1          1        1
 5 | cs-2        1          0        1

[6 rows x 4 columns]

我现在对 3 列的所有值进行分组运算求和，

DT_X[:,{'speech': dt.sum(f.speech),
        'narrative': dt.sum(f.narrative),
        'thought': dt.sum(f.thought)},
        by(f.issue)]

它产生的输出为，

Out[6]: 
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          2        0
 1 | cs-2        2          0        2
 2 | cs-3        2          2        2

[3 rows x 4 columns]

这里我手动给每个字段名称和聚合函数（dt.sum），因为它只需要 3 列我可以很容易地完成这个任务，但是如果我必须继续工作怎么办超过 10、20 等字段？

您还有其他解决方案吗？

参考：我们在 Rdatatable 中具有与 :

相同的功能

DT[,lapply(.SD,sum),by=.(issue),.SDcols=c('speech','narrative','thought')]

Answer 1

这是@Erez 推荐的解决方案之一。

DT_X[:,{name: dt.sum(getattr(f, name)) for name in ['speech', 'narrative', 'thought']},
by(f.issue)]

并输出：-

Out[7]: 
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          2        0
 1 | cs-2        2          0        2
 2 | cs-3        2          2        2

[3 rows x 4 columns]

Answer 2

datatable 中的大多数函数，包括 sum()，如果给定多列集作为参数，将自动应用于所有列。这样，R的lapply(.SD, sum)就变成了sum(.SD)，只不过python中没有.SD，而是用f符号和组合。在您的情况下，f[:] 将 select 除 groupby 之外的所有列，因此它基本上等同于 .SD.

其次，所有一元函数（即作用于单个列的函数，与像 + 或 corr 这样的二元函数相反）传递它们的列的名称。因此，sum(f[:]) 将生成一组与 f[:].

中同名的列

综合起来：

>>> from datatable import by, sum, f, dt

>>> DT_X[:, sum(f[:]), by(f.issue)]
   | issue  speech  narrative  thought
-- + -----  ------  ---------  -------
 0 | cs-1        1          2        0
 1 | cs-2        2          0        2
 2 | cs-3        2          2        2

[3 rows x 4 columns]

如何跨 pydatatable 中的列应用聚合（sum、mean、max、min 等）？

How to apply aggregations(sum,mean,max,min etc ) across columns in pydatatable?

python

py-datatable