r 分组和计数

Question

我正在处理如下数据集

      Id     Date           Color
      10     2008-11-17     Red
      10     2008-11-17     Red
      10     2008-11-17     Blue
      10     2010-01-26     Red
      10     2010-01-26     Green
      10     2010-01-26     Green
      10     2010-01-26     Red
      29     2007-07-31     Red
      29     2007-07-31     Red
      29     2007-07-31     Blue
      29     2007-07-31     Green
      29     2007-07-31     Red

我的目标是创建这样的数据集

     Color      Representation      Count            Min   Max
     Red        1 + 1 + 1  = 3      2 + 2 + 3 = 7    2     3
     Blue       1 + 1      = 2      1 + 1            1     1
     Green      1 +  1     = 2      2 + 1            1     2

代表

第 1^st 行，第 2^nd 列（表示）中的值是 3，因为根据ID 和日期的唯一组合。例如 1^st 和 2^nd 行相同，Id(10) 和 Date(2008-11-17) 所以这个组合是代表一次 (1_{(10, 2008-11-17)})。第 4^th 和第 7^th 行是相同的 Id(10) 和 Date(2010-01-26) 组合，所以这个独特的组合，表示一次 (1_{(10, 2010-01-26)}) 。第8^th, 9^th, 12^th都是Id(29)和Date的相同组合(2007-07-31) 并且类似地表示一次 (1_{(29, 2007-07-31)})。因此第 1 行第 2 列的值为 3。

1_{(10, 2008-11-17)} + 1_{(10, 2010-10-26)} + 1_{(29, 2007-07-31)} =3

计数

第 1^st 行，第 3^rd 列（计数）中的值为 7，因为 ID [= 提到了两次红色13=] 2008-11-17 (2 _{10, 2008-11-17}), 在 2010-01-26 (2 _{10, 2010-01-26}) 并在 2007-07-31 2 _{29,2007-07-31[=70 上按 ID 29 三次=]}

2_{(10, 2008-11-17)} + 2_{(10, 2010-10-26)} + 3_{(29, 2007-07-31)}

非常感谢任何有关完成此独特 frequency/count table 的帮助。

数据集

Id   = c(10,10,10,10,10,10,10,29,29,29,29,29) 
Date = c("2008-11-17", "2008-11-17", "2008-11-17","2010-01-26","2010-01-26","2010-01-26","2010-01-26",
         "2007-07-31","2007-07-31","2007-07-31","2007-07-31","2007-07-31") 
Color = c("Red", "Red", "Blue", "Red", "Green", "Green", "Red", "Red", "Red", "Blue", "Green", "Red") 
df = data.frame(Id, Date, Color)

Answer 1

与dplyr:

library(dplyr)
dat %>% group_by(Color) %>%
    summarize(Representation = n_distinct(Id, Date), Count = n())
# # A tibble: 3 × 3
#    Color Representation Count
#   <fctr>          <int> <int>
# 1   Blue              2     2
# 2  Green              2     3
# 3    Red              3     7

Answer 2

您可以使用aggregate()函数：

# Make a new column for the Date-Id joined (what you want to base the counts on
df$DateId <- paste(df$Date, df$Id)

# Get the representation values
Representation <- aggregate(DateId ~ Color, data=df,FUN=function(x){length(unique(x))})
Representation
#>   Color DateId
#> 1  Blue      2
#> 2 Green      2
#> 3   Red      3

# Get the Count values
Count <- aggregate(DateId ~ Color, data=df,FUN=length)
Count
#>   Color DateId
#> 1  Blue      2
#> 2 Green      3
#> 3   Red      7

Answer 3

另一种选择是data.table

library(data.table)
setDT(df)[, .(Representation = uniqueN(paste(Id, Date)), Count = .N) , by = Color]
#     Color Representation Count
#1:   Red              3     7
#2:  Blue              2     2
#3: Green              2     3

更新

第二个问题，我们可以试试

library(matrixStats)
m1 <- sapply(split(df[["Color"]], list(df$Id, df$Date), drop = TRUE),  function(x) table(x))
v1 <- (NA^!m1) * m1
df1 <- data.frame(Color = row.names(m1), Representation = rowSums(m1!=0), 
   Count = rowSums(m1), Min = rowMins(v1, na.rm=TRUE),
    Max = rowMaxs(v1, na.rm=TRUE))
row.names(df1) <- NULL
df1
#   Color Representation Count Min Max
#1  Blue              2     2   1   1
#2 Green              2     3   1   2
#3   Red              3     7   2   3

r 分组和计数

r Group by and count

group-by

r

plyr

reshape2

dplyr

更新