r - 生成开始日期和结束日期之间的累计总和、总和和唯一标识符
r - Generating cumulative sum, total sum, and unique identifiers between start and end dates
我想做什么
我有一个美国抗议事件的数据集。有些事件是独立事件,而另一些事件则日复一日地持续存在(“多日事件”)。我的数据集是按日构建的,因此为期三天的多日活动分布在三行中。
我想完成以下任务:
- 在任何给定的多日活动中创建迄今为止天数的累计总和。具体来说,我想计算任何链接事件的“第一天”和“最后一天” 之间的天数 。
- 将每个多事件的总天数作为变量
- 通过将抗议发生的州和每个州从 1 开始并向上延伸的连续标识号连接起来,为每个多日事件“命名”。
数据
这是一个可重现的例子:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
这会生成以下数据帧:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
我要创建的内容:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
附带问题:为什么 test$sequence[11] 是 NA 而不是“第一天”?
我认为创建特定函数来进行计数比尝试在单个管道中完成所有操作更容易。
我在输出中保留了所有中间步骤和中间列,因此您可以看到每个步骤在做什么。您很可能不需要保留所有这些列,一旦您理解了该方法,您可能可以简化这些步骤。
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
我不确定这些是您正在寻找的具体数字,但这代表了在我看来更简单、更惯用的 tidyverse 方法:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
在这里,我将任何超过一天的间隔标记为新事件,计算累计总和,并用它来定义每个事件的持续时间。
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
data.table::rleid
在这里很有用,可以根据条件 if days_last == 1
或 days_next == 1
(即连续日期)创建 运行 长度。如果您想要不同的事件长度,您可以编辑该条件。
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
输出
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
如果需要,您可以将 days_last
列中的 NA
用于其他行中的 NA
值。
Side question: Why is test$sequence[11] an NA and not "First day"?
通常,在 R NA
中传播,这意味着如果 NA
是评估的一部分,那么通常会返回 NA
。当您定义 sequence
时,您的第一个 ifelse
条件是 link_last == 0 & link_next == 1
。在第 11 行,link_last = NA
和 link_next = 1
。所以你评估的是:
NA == 0 & 1 == 1
[1] NA
相反,您的嵌套条件应该放在第一位。你的 ifelse
目前是如何写的嵌套条件是 not 被评估:
is.na(NA) & 1 == 1
[1] TRUE
这是一个data.table
方法。
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
由 reprex package (v0.3.0)
于 2021 年 3 月 16 日创建
我想做什么
我有一个美国抗议事件的数据集。有些事件是独立事件,而另一些事件则日复一日地持续存在(“多日事件”)。我的数据集是按日构建的,因此为期三天的多日活动分布在三行中。
我想完成以下任务:
- 在任何给定的多日活动中创建迄今为止天数的累计总和。具体来说,我想计算任何链接事件的“第一天”和“最后一天” 之间的天数 。
- 将每个多事件的总天数作为变量
- 通过将抗议发生的州和每个州从 1 开始并向上延伸的连续标识号连接起来,为每个多日事件“命名”。
数据
这是一个可重现的例子:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
这会生成以下数据帧:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
我要创建的内容:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
附带问题:为什么 test$sequence[11] 是 NA 而不是“第一天”?
我认为创建特定函数来进行计数比尝试在单个管道中完成所有操作更容易。
我在输出中保留了所有中间步骤和中间列,因此您可以看到每个步骤在做什么。您很可能不需要保留所有这些列,一旦您理解了该方法,您可能可以简化这些步骤。
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
我不确定这些是您正在寻找的具体数字,但这代表了在我看来更简单、更惯用的 tidyverse 方法:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
在这里,我将任何超过一天的间隔标记为新事件,计算累计总和,并用它来定义每个事件的持续时间。
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
data.table::rleid
在这里很有用,可以根据条件 if days_last == 1
或 days_next == 1
(即连续日期)创建 运行 长度。如果您想要不同的事件长度,您可以编辑该条件。
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
输出
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
如果需要,您可以将 days_last
列中的 NA
用于其他行中的 NA
值。
Side question: Why is test$sequence[11] an NA and not "First day"?
通常,在 R NA
中传播,这意味着如果 NA
是评估的一部分,那么通常会返回 NA
。当您定义 sequence
时,您的第一个 ifelse
条件是 link_last == 0 & link_next == 1
。在第 11 行,link_last = NA
和 link_next = 1
。所以你评估的是:
NA == 0 & 1 == 1
[1] NA
相反,您的嵌套条件应该放在第一位。你的 ifelse
目前是如何写的嵌套条件是 not 被评估:
is.na(NA) & 1 == 1
[1] TRUE
这是一个data.table
方法。
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
由 reprex package (v0.3.0)
于 2021 年 3 月 16 日创建