一起使用 dplyr 的 summarize 和 summarise_each？

Question

我想同时对分组数据框应用 dplyr::summarise 和 dplyr::summarise_each。可能吗？

我的数据是这样的：

mydf <- data.frame(
    id = c(rep(1,2), rep(2, 3), rep(3, 4)), 
    amount = c(rep(1,4), rep(2,5)), 
    type1 = c(rep(1, 2), rep(0, 7)),
    type2 = c(rep(0, 4), rep(1, 5))
)
mydf
#  id amount type1 type2
#1  1      1     1     0
#2  1      1     1     0
#3  2      1     0     0
#4  2      1     0     0
#5  2      2     0     1
#6  3      2     0     1
#7  3      2     0     1
#8  3      2     0     1
#9  3      2     0     1

我想对 id 和 amount 变量求和并得到 type 变量的最大值。我知道我可以这样做：

mydf %>% 
    group_by(id) %>% 
    summarise(amount = sum(amount), type1 = max(type1), type2 = max(type2))

但是，我有很多 type 变量，所以我更喜欢这样的东西（但也有 amount 的总和）。

mydf %>%
    group_by(id) %>%
    summarise_each(funs(max), matches("type"))

Answer 1

我不确定使用 dplyr 的惯用方式，但使用 data.table

非常惯用

library(data.table)
setDT(mydf)[, c(amount = sum(amount), 
                lapply(.SD[, grep("type", names(mydf), value = TRUE), with = FALSE], max)),
            by = id]
#    id amount type1 type2
# 1:  1      2     1     0
# 2:  2      4     0     1
# 3:  3      8     0     1

基本上，我们使用 c 组合这两个操作，而 lapply(.SD, max) 代表 dplyr 中的 mutate_each 而 matches 只是 grep（清楚地显示 in the source code）。 with = FALSE 用于在 data.table 或 .SD 父框架（代表 SubData).

Answer 2

使用dplyr

library(dplyr)

mydf %>% 
     group_by(id) %>% 
     mutate(amount = sum(amount)) %>% 
     mutate_each(funs(max), matches("type")) %>%
     unique

#Source: local data table [3 x 4]

#  id amount type1 type2
#1  1      2     1     0
#2  2      4     0     1
#3  3      8     0     1

或者简单地如@HongOoi 所示

mydf %>% 
     group_by(id) %>% 
     mutate(amount=sum(amount)) %>% 
     summarise_each(funs(max))

Answer 3

dplyr 更通用的方法可能是：

mydf %>%
  group_by(id) %>%
  mutate_each('sum', amount) %>%
  mutate_each('max', matches("type")) %>%
  summarise_each('first', amount, matches("type"))

这样做的好处是，Veerendra Gadekar 的原始答案具有的每一列仅应用一个聚合函数。如果我们需要 sd 或类似的东西来代替 max，它会派上用场，Hong Ooi 的解决方案在这种情况下会失效。如果有字符列，它也会中断。第三个优点是它删除了不属于计算的列。

另见。

一起使用 dplyr 的 summarize 和 summarise_each？

Use dplyr's summarise and summarise_each together?

r

dplyr