在 R 中聚合

Question

我有一个包含两列的数据框。我想向数据集添加另外两列，其中包含基于聚合的计数。

df <- structure(list(ID = c(1045937900, 1045937900), 
SMS.Type = c("DF1", "WCB14"), 
SMS.Date = c("12/02/2015 19:51", "13/02/2015 08:38"), 
Reply.Date = c("", "13/02/2015 09:52")
), row.names = 4286:4287, class = "data.frame")

我想简单地计算 SMS.Type 和 Reply.Date 没有 null 的实例数。所以在下面的玩具示例中，我将为 SMS.Type 生成 2，为 Reply.Date

生成 1

然后我想将其作为总计数添加到数据框中（我知道它们会复制原始数据集中的行数，但没关系）

我一直在玩聚合和计数功能，但无济于事

mytempdf <-aggregate(cbind(testtrain$SMS.Type,testtrain$Response.option)~testtrain$ID,
                  train, 
                  function(x) length(unique(which(!is.na(x)))))

mytempdf <- aggregate(testtrain$Reply.Date~testtrain$ID,
                  testtrain, 
                  function(x) length(which(!is.na(x))))

有人可以帮忙吗？

感谢您的宝贵时间

Answer 1

使用 data.table 你可以做到（我已经在你的原始数据中添加了一个真实的 NA）。我也不确定你是真的在寻找 length(unique()) 还是只是 length?

library(data.table)
cols <- c("SMS.Type", "Reply.Date")
setDT(df)[, paste0(cols, ".count") := 
                  lapply(.SD, function(x) length(unique(na.omit(x)))), 
                  .SDcols = cols, 
            by = ID]
#            ID SMS.Type         SMS.Date       Reply.Date SMS.Type.count Reply.Date.count
# 1: 1045937900      DF1 12/02/2015 19:51               NA              2                1
# 2: 1045937900    WCB14 13/02/2015 08:38 13/02/2015 09:52              2                1

在开发版本 (v >= 1.9.5) 中，您还可以使用 uniqueN 函数

说明

这是一个通用解决方案，适用于任意数量的所需列。您需要做的就是将列名称放入 cols。

lapply(.SD, 正在对 .SDcols = cols
paste0(cols, ".count") 创建新的列名，同时将 count 添加到 cols
:= 通过引用执行赋值，意思是，用 lapply(.SD, 的输出更新新创建的列
by 参数指定聚合器列

Answer 2

将空字符串转换为 NA 后：

library(dplyr)
mutate(df, SMS.Type.count   = sum(!is.na(SMS.Type)),
           Reply.Date.count = sum(!is.na(Reply.Date)))

在 R 中聚合

Aggregating in R

r

count

aggregate-functions