R:分类变量的计数频率(以日期为条件)
R: counting frequency of a categorical variable (conditional on date )
我有三列 "Name"、"success Dummy" 和 "Date"。对于每个 NAME,我想检查该 NAME 的 PAST SUCCESS。
因此,例如,如果名称 "Peter" 出现了三次,每次我都想计算 "Peter" 和 "Success ==1" 的数量,并且 Date 发生在之前。
我需要为 "Past Success" 列获取的输出示例。
Name Success Date Past Success
David 1 2018 1
Peter 0 2017 3
Peter 1 2016 2
David 1 2017 0
Peter 1 2015 1
Peter 0 2010 1
Peter 1 2005 0
Peter NA 2004 0
有什么方法可以快速做到吗?
另外我需要它非常快,因为我的数据很大。
我所做的是根据姓名和日期对数据进行排序,并对照之前的 100 个观察结果检查每个观察结果(因为姓名的频率最大值为 100)。
请告知是否有更好的方法。
试试这个数据table方法:
library(data.table)
data <-data.table(Name = rep(c("David","Peter","David","Peter"), c(1,2,1,4)),
Success = c(1,0,1,1,1,0,1,NA),
Date = c(2018,2017,2016,2017,2015,2010,2005,2004)
)
data <- data[order(Date)]
data[Success == 1,"Past Success":= cumsum(Success), by = 'Name']
这里有两种方法。其中一个几乎和@FALL Gora一样,但另一个来自基地R
# these two steps are assuming you have data.table
# modify them accordingly if you have data.frame
data <- data[order(Name, Date)]
data[is.na(Success), Success := 0]
### tapply
data$past_success <- unlist(with(data, tapply(Success, Name, cumsum)))
### data.table
data[, past_success_dt := cumsum(Success), by = Name]
data
Name Success Date past_success past_success_dt
1: David 1 2017 1 1
2: David 1 2018 2 2
3: Peter 0 2004 0 0
4: Peter 1 2005 1 1
5: Peter 0 2010 1 1
6: Peter 1 2015 2 2
7: Peter 1 2016 3 3
8: Peter 0 2017 3 3
备案:数据帧的 dplyr 方法
library(tidyverse)
data<-data%>%
arrange(Name, Date) %>%
group_by(Name) %>%
mutate(Success = replace_na(Success, 0),
PastSuccess = cumsum(Success))
data
> data
# A tibble: 8 x 4
# Groups: Name [2]
Name Success Date PastSuccess
<fct> <dbl> <dbl> <dbl>
1 David 1 2017 1
2 David 1 2018 2
3 Peter 0 2004 0
4 Peter 1 2005 1
5 Peter 0 2010 1
6 Peter 1 2015 2
7 Peter 1 2016 3
8 Peter 0 2017 3
我有三列 "Name"、"success Dummy" 和 "Date"。对于每个 NAME,我想检查该 NAME 的 PAST SUCCESS。
因此,例如,如果名称 "Peter" 出现了三次,每次我都想计算 "Peter" 和 "Success ==1" 的数量,并且 Date 发生在之前。
我需要为 "Past Success" 列获取的输出示例。
Name Success Date Past Success
David 1 2018 1
Peter 0 2017 3
Peter 1 2016 2
David 1 2017 0
Peter 1 2015 1
Peter 0 2010 1
Peter 1 2005 0
Peter NA 2004 0
有什么方法可以快速做到吗?
另外我需要它非常快,因为我的数据很大。
我所做的是根据姓名和日期对数据进行排序,并对照之前的 100 个观察结果检查每个观察结果(因为姓名的频率最大值为 100)。
请告知是否有更好的方法。
试试这个数据table方法:
library(data.table)
data <-data.table(Name = rep(c("David","Peter","David","Peter"), c(1,2,1,4)),
Success = c(1,0,1,1,1,0,1,NA),
Date = c(2018,2017,2016,2017,2015,2010,2005,2004)
)
data <- data[order(Date)]
data[Success == 1,"Past Success":= cumsum(Success), by = 'Name']
这里有两种方法。其中一个几乎和@FALL Gora一样,但另一个来自基地R
# these two steps are assuming you have data.table
# modify them accordingly if you have data.frame
data <- data[order(Name, Date)]
data[is.na(Success), Success := 0]
### tapply
data$past_success <- unlist(with(data, tapply(Success, Name, cumsum)))
### data.table
data[, past_success_dt := cumsum(Success), by = Name]
data
Name Success Date past_success past_success_dt
1: David 1 2017 1 1
2: David 1 2018 2 2
3: Peter 0 2004 0 0
4: Peter 1 2005 1 1
5: Peter 0 2010 1 1
6: Peter 1 2015 2 2
7: Peter 1 2016 3 3
8: Peter 0 2017 3 3
备案:数据帧的 dplyr 方法
library(tidyverse)
data<-data%>%
arrange(Name, Date) %>%
group_by(Name) %>%
mutate(Success = replace_na(Success, 0),
PastSuccess = cumsum(Success))
data
> data
# A tibble: 8 x 4
# Groups: Name [2]
Name Success Date PastSuccess
<fct> <dbl> <dbl> <dbl>
1 David 1 2017 1
2 David 1 2018 2
3 Peter 0 2004 0
4 Peter 1 2005 1
5 Peter 0 2010 1
6 Peter 1 2015 2
7 Peter 1 2016 3
8 Peter 0 2017 3