r 分组和计数
r Group by and count
我正在处理如下数据集
Id Date Color
10 2008-11-17 Red
10 2008-11-17 Red
10 2008-11-17 Blue
10 2010-01-26 Red
10 2010-01-26 Green
10 2010-01-26 Green
10 2010-01-26 Red
29 2007-07-31 Red
29 2007-07-31 Red
29 2007-07-31 Blue
29 2007-07-31 Green
29 2007-07-31 Red
我的目标是创建这样的数据集
Color Representation Count Min Max
Red 1 + 1 + 1 = 3 2 + 2 + 3 = 7 2 3
Blue 1 + 1 = 2 1 + 1 1 1
Green 1 + 1 = 2 2 + 1 1 2
代表
第 1st 行,第 2nd 列(表示)中的值是 3,因为根据ID 和日期的唯一组合。例如 1st 和 2nd 行相同,Id(10) 和 Date(2008-11-17) 所以这个组合是代表一次 (1(10, 2008-11-17))。第 4th 和第 7th 行是相同的 Id(10) 和 Date(2010-01-26) 组合,所以这个独特的组合,表示一次 (1(10, 2010-01-26)) 。第8th, 9th, 12th都是Id(29)和Date的相同组合(2007-07-31) 并且类似地表示一次 (1(29, 2007-07-31))。因此第 1 行第 2 列的值为 3。
1(10, 2008-11-17) + 1(10, 2010-10-26) + 1(29, 2007-07-31) =3
计数
第 1st 行,第 3rd 列(计数)中的值为 7,因为 ID [= 提到了两次红色13=] 2008-11-17
(2 10, 2008-11-17), 在 2010-01-26
(2 10, 2010-01-26) 并在 2007-07-31
2 29,2007-07-31[=70 上按 ID 29
三次=]
2(10, 2008-11-17) + 2(10, 2010-10-26) + 3(29, 2007-07-31)
非常感谢任何有关完成此独特 frequency/count table 的帮助。
数据集
Id = c(10,10,10,10,10,10,10,29,29,29,29,29)
Date = c("2008-11-17", "2008-11-17", "2008-11-17","2010-01-26","2010-01-26","2010-01-26","2010-01-26",
"2007-07-31","2007-07-31","2007-07-31","2007-07-31","2007-07-31")
Color = c("Red", "Red", "Blue", "Red", "Green", "Green", "Red", "Red", "Red", "Blue", "Green", "Red")
df = data.frame(Id, Date, Color)
与dplyr
:
library(dplyr)
dat %>% group_by(Color) %>%
summarize(Representation = n_distinct(Id, Date), Count = n())
# # A tibble: 3 × 3
# Color Representation Count
# <fctr> <int> <int>
# 1 Blue 2 2
# 2 Green 2 3
# 3 Red 3 7
您可以使用aggregate()
函数:
# Make a new column for the Date-Id joined (what you want to base the counts on
df$DateId <- paste(df$Date, df$Id)
# Get the representation values
Representation <- aggregate(DateId ~ Color, data=df,FUN=function(x){length(unique(x))})
Representation
#> Color DateId
#> 1 Blue 2
#> 2 Green 2
#> 3 Red 3
# Get the Count values
Count <- aggregate(DateId ~ Color, data=df,FUN=length)
Count
#> Color DateId
#> 1 Blue 2
#> 2 Green 3
#> 3 Red 7
另一种选择是data.table
library(data.table)
setDT(df)[, .(Representation = uniqueN(paste(Id, Date)), Count = .N) , by = Color]
# Color Representation Count
#1: Red 3 7
#2: Blue 2 2
#3: Green 2 3
更新
第二个问题,我们可以试试
library(matrixStats)
m1 <- sapply(split(df[["Color"]], list(df$Id, df$Date), drop = TRUE), function(x) table(x))
v1 <- (NA^!m1) * m1
df1 <- data.frame(Color = row.names(m1), Representation = rowSums(m1!=0),
Count = rowSums(m1), Min = rowMins(v1, na.rm=TRUE),
Max = rowMaxs(v1, na.rm=TRUE))
row.names(df1) <- NULL
df1
# Color Representation Count Min Max
#1 Blue 2 2 1 1
#2 Green 2 3 1 2
#3 Red 3 7 2 3
我正在处理如下数据集
Id Date Color
10 2008-11-17 Red
10 2008-11-17 Red
10 2008-11-17 Blue
10 2010-01-26 Red
10 2010-01-26 Green
10 2010-01-26 Green
10 2010-01-26 Red
29 2007-07-31 Red
29 2007-07-31 Red
29 2007-07-31 Blue
29 2007-07-31 Green
29 2007-07-31 Red
我的目标是创建这样的数据集
Color Representation Count Min Max
Red 1 + 1 + 1 = 3 2 + 2 + 3 = 7 2 3
Blue 1 + 1 = 2 1 + 1 1 1
Green 1 + 1 = 2 2 + 1 1 2
代表
第 1st 行,第 2nd 列(表示)中的值是 3,因为根据ID 和日期的唯一组合。例如 1st 和 2nd 行相同,Id(10) 和 Date(2008-11-17) 所以这个组合是代表一次 (1(10, 2008-11-17))。第 4th 和第 7th 行是相同的 Id(10) 和 Date(2010-01-26) 组合,所以这个独特的组合,表示一次 (1(10, 2010-01-26)) 。第8th, 9th, 12th都是Id(29)和Date的相同组合(2007-07-31) 并且类似地表示一次 (1(29, 2007-07-31))。因此第 1 行第 2 列的值为 3。
1(10, 2008-11-17) + 1(10, 2010-10-26) + 1(29, 2007-07-31) =3
计数
第 1st 行,第 3rd 列(计数)中的值为 7,因为 ID [= 提到了两次红色13=] 2008-11-17
(2 10, 2008-11-17), 在 2010-01-26
(2 10, 2010-01-26) 并在 2007-07-31
2 29,2007-07-31[=70 上按 ID 29
三次=]
2(10, 2008-11-17) + 2(10, 2010-10-26) + 3(29, 2007-07-31)
非常感谢任何有关完成此独特 frequency/count table 的帮助。
数据集
Id = c(10,10,10,10,10,10,10,29,29,29,29,29)
Date = c("2008-11-17", "2008-11-17", "2008-11-17","2010-01-26","2010-01-26","2010-01-26","2010-01-26",
"2007-07-31","2007-07-31","2007-07-31","2007-07-31","2007-07-31")
Color = c("Red", "Red", "Blue", "Red", "Green", "Green", "Red", "Red", "Red", "Blue", "Green", "Red")
df = data.frame(Id, Date, Color)
与dplyr
:
library(dplyr)
dat %>% group_by(Color) %>%
summarize(Representation = n_distinct(Id, Date), Count = n())
# # A tibble: 3 × 3
# Color Representation Count
# <fctr> <int> <int>
# 1 Blue 2 2
# 2 Green 2 3
# 3 Red 3 7
您可以使用aggregate()
函数:
# Make a new column for the Date-Id joined (what you want to base the counts on
df$DateId <- paste(df$Date, df$Id)
# Get the representation values
Representation <- aggregate(DateId ~ Color, data=df,FUN=function(x){length(unique(x))})
Representation
#> Color DateId
#> 1 Blue 2
#> 2 Green 2
#> 3 Red 3
# Get the Count values
Count <- aggregate(DateId ~ Color, data=df,FUN=length)
Count
#> Color DateId
#> 1 Blue 2
#> 2 Green 3
#> 3 Red 7
另一种选择是data.table
library(data.table)
setDT(df)[, .(Representation = uniqueN(paste(Id, Date)), Count = .N) , by = Color]
# Color Representation Count
#1: Red 3 7
#2: Blue 2 2
#3: Green 2 3
更新
第二个问题,我们可以试试
library(matrixStats)
m1 <- sapply(split(df[["Color"]], list(df$Id, df$Date), drop = TRUE), function(x) table(x))
v1 <- (NA^!m1) * m1
df1 <- data.frame(Color = row.names(m1), Representation = rowSums(m1!=0),
Count = rowSums(m1), Min = rowMins(v1, na.rm=TRUE),
Max = rowMaxs(v1, na.rm=TRUE))
row.names(df1) <- NULL
df1
# Color Representation Count Min Max
#1 Blue 2 2 1 1
#2 Green 2 3 1 2
#3 Red 3 7 2 3