r - 生成开始日期和结束日期之间的累计总和、总和和唯一标识符

r - Generating cumulative sum, total sum, and unique identifiers between start and end dates

我想做什么

我有一个美国抗议事件的数据集。有些事件是独立事件,而另一些事件则日复一日地持续存在(“多日事件”)。我的数据集是按日构建的,因此为期三天的多日活动分布在三行中。

我想完成以下任务:

  1. 在任何给定的多日活动中创建迄今为止天数的累计总和。具体来说,我想计算任何链接事件的“第一天”和“最后一天” 之间的天数
  2. 将每个多事件的总天数作为变量
  3. 通过将抗议发生的州和每个州从 1 开始并向上延伸的连续标识号连接起来,为每个多日事件“命名”。

数据

这是一个可重现的例子:

# Library
library(tidyverse) # Brings in dplyr

# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
                   date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15", 
                            "2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
                            "2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
                            "2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))

# Now create some lags and leads
test <- test %>%
  group_by(state) %>%
  mutate(date_lag = lag(date),
         date_lead = lead(date),
         days_last = date - date_lag,
         days_next = date_lead - date,
         link_last = if_else(days_last <= 1, 1, 0),
         link_next = if_else(days_next <= 1, 1, 0),
         sequence = if_else(link_last == 0 & link_next == 1, "First day",
                            if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
                                    if_else(link_last == 1 & link_next == 1, "Ongoing",
                                            if_else(link_last == 1 & link_next == 0, "Last day", 
                                                    if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))

这会生成以下数据帧:

   state      date       date_lag   date_lead  days_last days_next link_last link_next sequence  
   <chr>      <date>     <date>     <date>     <drtn>    <drtn>        <dbl>     <dbl> <chr>     
 1 Washington 2021-01-01 NA         2021-01-03 NA days    2 days          NA         0 NA        
 2 Washington 2021-01-03 2021-01-01 2021-01-04  2 days    1 days           0         1 First day 
 3 Washington 2021-01-04 2021-01-03 2021-01-10  1 days    6 days           1         0 Last day  
 4 Washington 2021-01-10 2021-01-04 2021-01-15  6 days    5 days           0         0 Not linked
 5 Washington 2021-01-15 2021-01-10 2021-01-16  5 days    1 days           0         1 First day 
 6 Washington 2021-01-16 2021-01-15 2021-01-17  1 days    1 days           1         1 Ongoing   
 7 Washington 2021-01-17 2021-01-16 2021-01-18  1 days    1 days           1         1 Ongoing   
 8 Washington 2021-01-18 2021-01-17 2021-01-19  1 days    1 days           1         1 Ongoing   
 9 Washington 2021-01-19 2021-01-18 2021-01-28  1 days    9 days           1         0 Last day  
10 Washington 2021-01-28 2021-01-19 NA          9 days   NA days           0        NA NA        
11 Idaho      2021-01-12 NA         2021-01-13 NA days    1 days          NA         1 NA        
12 Idaho      2021-01-13 2021-01-12 2021-01-14  1 days    1 days           1         1 Ongoing   
13 Idaho      2021-01-14 2021-01-13 2021-02-01  1 days   18 days           1         0 Last day  
14 Idaho      2021-02-01 2021-01-14 2021-02-03 18 days    2 days           0         0 Not linked
15 Idaho      2021-02-03 2021-02-01 2021-02-04  2 days    1 days           0         1 First day 
16 Idaho      2021-02-04 2021-02-03 2021-02-05  1 days    1 days           1         1 Ongoing   
17 Idaho      2021-02-05 2021-02-04 2021-02-08  1 days    3 days           1         0 Last day  
18 Idaho      2021-02-08 2021-02-05 2021-02-10  3 days    2 days           0         0 Not linked
19 Idaho      2021-02-10 2021-02-08 2021-02-14  2 days    4 days           0         0 Not linked
20 Idaho      2021-02-14 2021-02-10 NA          4 days   NA days           0        NA NA    

我要创建的内容:

   state      date       date_lag   date_lead  days_last days_next link_last link_next sequence   cumulative duration name        
   <chr>      <date>     <date>     <date>     <drtn>    <drtn>        <dbl>     <dbl> <chr>           <dbl>    <dbl> <chr>       
 1 Washington 2021-01-01 NA         2021-01-03 NA days    2 days          NA         0 NA                 NA        0 NA          
 2 Washington 2021-01-03 2021-01-01 2021-01-04  2 days    1 days           0         1 First day           1        2 Washington.1
 3 Washington 2021-01-04 2021-01-03 2021-01-10  1 days    6 days           1         0 Last day            2        2 Washington.1
 4 Washington 2021-01-10 2021-01-04 2021-01-15  6 days    5 days           0         0 Not linked         NA        0 NA          
 5 Washington 2021-01-15 2021-01-10 2021-01-16  5 days    1 days           0         1 First day           1        5 Washington.2
 6 Washington 2021-01-16 2021-01-15 2021-01-17  1 days    1 days           1         1 Ongoing             2        5 Washington.2
 7 Washington 2021-01-17 2021-01-16 2021-01-18  1 days    1 days           1         1 Ongoing             3        5 Washington.2
 8 Washington 2021-01-18 2021-01-17 2021-01-19  1 days    1 days           1         1 Ongoing             4        5 Washington.2
 9 Washington 2021-01-19 2021-01-18 2021-01-28  1 days    9 days           1         0 Last day            5        5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA          9 days   NA days           0        NA NA                 NA       NA NA          
11 Idaho      2021-01-12 NA         2021-01-13 NA days    1 days          NA         1 NA                  1        3 Idaho.1     
12 Idaho      2021-01-13 2021-01-12 2021-01-14  1 days    1 days           1         1 Ongoing             2        3 Idaho.1     
13 Idaho      2021-01-14 2021-01-13 2021-02-01  1 days   18 days           1         0 Last day            3        3 Idaho.1     
14 Idaho      2021-02-01 2021-01-14 2021-02-03 18 days    2 days           0         0 Not linked         NA       NA NA          
15 Idaho      2021-02-03 2021-02-01 2021-02-04  2 days    1 days           0         1 First day           1        3 Idaho.2     
16 Idaho      2021-02-04 2021-02-03 2021-02-05  1 days    1 days           1         1 Ongoing             2        3 Idaho.2     
17 Idaho      2021-02-05 2021-02-04 2021-02-08  1 days    3 days           1         0 Last day            3        3 Idaho.2     
18 Idaho      2021-02-08 2021-02-05 2021-02-10  3 days    2 days           0         0 Not linked         NA       NA NA          
19 Idaho      2021-02-10 2021-02-08 2021-02-14  2 days    4 days           0         0 Not linked         NA       NA NA          
20 Idaho      2021-02-14 2021-02-10 NA          4 days   NA days           0        NA NA                 NA       NA NA  

附带问题:为什么 test$sequence[11] 是 NA 而不是“第一天”?

我认为创建特定函数来进行计数比尝试在单个管道中完成所有操作更容易。

我在输出中保留了所有中间步骤和中间列,因此您可以看到每个步骤在做什么。您很可能不需要保留所有这些列,一旦您理解了该方法,您可能可以简化这些步骤。

library(tidyverse) # Brings in dplyr

# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
                   date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15", 
                                           "2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
                                           "2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
                                           "2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))

event_count <- function(v){
  cnt <- 0
  result <- integer(length(v))
  for(idx in seq_along(v)) {
    if(v[idx]) {
      cnt <- 0
    } else {
      cnt <- cnt + 1
    }
    
    result[idx] <- cnt
  }
  
  result
}

need_name <- function(cnt) {
  result <- logical(length(cnt))
  for(idx in seq_along(cnt)){
     if(cnt[idx] == 0){
       if(idx == length(cnt)){
         result[idx] <- FALSE
         break
       }
       
       result[idx] <- (cnt[idx + 1] != 0)
     } else{
       result[idx] <- TRUE
     }
  }
  
  result
}

running_count <- function(v) {
  cnt <- 0
  flag <- FALSE
  result <- integer(length(v))
  for(idx in seq_along(v)){
    if(v[idx]) {
      
      if(!flag) {
        cnt <- cnt + 1
        flag <- !flag
      }
      
      result[idx] <- cnt
    } else{
      result[idx] <- 0
      flag <- FALSE
    }
  }
  
  result
}


test %>%
  group_by(state) %>%
  arrange(date, .by_group = TRUE) %>%
  mutate(
    duration = date - lag(date),   # --- Compute durations
    is_first = duration != 1       # --- Check if it is the first day of a protest
  ) %>%
  replace_na(list(is_first = TRUE)) %>%  # --- No more NAs
  ungroup %>%
  mutate(
    cnt = event_count(is_first),  # --- How many days does this event have?
    need_name = need_name(cnt)    # --- Should we name this event?
  ) %>%
  group_by(state) %>%
  mutate(
    name_number = running_count(need_name)  # --- What's the event count?
  ) %>%
  mutate(
    name = paste0(state, ".", name_number),  # ---- Create names
    cumulative = cnt + 1  # --- Start counting from one instead of zero
  ) %>% 
  group_by(name) %>% 
  mutate(
    duration = max(duration)  # --- Calc total duration
  ) %>% 
  ungroup() %>% 
  mutate(  # --- Adding the NAs back
    name = if_else(name_number == 0, NA_character_, name),
    duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
    cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
  )

我不确定这些是您正在寻找的具体数字,但这代表了在我看来更简单、更惯用的 tidyverse 方法:

test %>%
  group_by(state) %>%
  mutate(days_last = as.numeric(date - lag(date)),
         new_section = 1*(is.na(days_last) | days_last > 1),   # EDIT
         section = cumsum(new_section),
         name = paste(state,section, sep = ".")) %>%
  group_by(name) %>%
  mutate(duration = as.numeric(max(date) - min(date) + 1),
     sequence = case_when(duration == 1 ~ "Unlinked",
                          row_number() == 1 ~ "First Day",
                          row_number() == n() ~ "Last Day",
                          TRUE ~ "Ongoing")) %>%
  ungroup()

在这里,我将任何超过一天的间隔标记为新事件,计算累计总和,并用它来定义每个事件的持续时间。

# A tibble: 20 x 8
   state      date       days_last new_section section name         duration sequence 
   <chr>      <date>         <dbl>       <dbl>   <dbl> <chr>           <dbl> <chr>    
 1 Washington 2021-01-01        NA           1       1 Washington.1        1 Unlinked 
 2 Washington 2021-01-03         2           1       2 Washington.2        2 First Day
 3 Washington 2021-01-04         1           0       2 Washington.2        2 Last Day 
 4 Washington 2021-01-10         6           1       3 Washington.3        1 Unlinked 
 5 Washington 2021-01-15         5           1       4 Washington.4        5 First Day
 6 Washington 2021-01-16         1           0       4 Washington.4        5 Ongoing  
 7 Washington 2021-01-17         1           0       4 Washington.4        5 Ongoing  
 8 Washington 2021-01-18         1           0       4 Washington.4        5 Ongoing  
 9 Washington 2021-01-19         1           0       4 Washington.4        5 Last Day 
10 Washington 2021-01-28         9           1       5 Washington.5        1 Unlinked 
11 Idaho      2021-01-12        NA           1       1 Idaho.1             3 First Day
12 Idaho      2021-01-13         1           0       1 Idaho.1             3 Ongoing  
13 Idaho      2021-01-14         1           0       1 Idaho.1             3 Last Day 
14 Idaho      2021-02-01        18           1       2 Idaho.2             1 Unlinked 
15 Idaho      2021-02-03         2           1       3 Idaho.3             3 First Day
16 Idaho      2021-02-04         1           0       3 Idaho.3             3 Ongoing  
17 Idaho      2021-02-05         1           0       3 Idaho.3             3 Last Day 
18 Idaho      2021-02-08         3           1       4 Idaho.4             1 Unlinked 
19 Idaho      2021-02-10         2           1       5 Idaho.5             1 Unlinked 
20 Idaho      2021-02-14         4           1       6 Idaho.6             1 Unlinked 

data.table::rleid 在这里很有用,可以根据条件 if days_last == 1days_next == 1(即连续日期)创建 运行 长度。如果您想要不同的事件长度,您可以编辑该条件。

library(dplyr)
library(data.table)

test %>% 
  dplyr::group_by(state) %>% 
  dplyr::mutate(days_last = c(NA, diff(date)),
                days_next = as.numeric(lead(date) - date),
                name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>% 
  dplyr::group_by(name) %>% 
  dplyr::mutate(sequence = case_when(
    n() == 1 ~ "Not Linked",
    row_number() == 1 ~ "First day",
    n() == row_number() ~ "Last day",
    T ~ "Ongoing"),
  duration = n(),
  cumulative = seq_along(name)) %>% 
  dplyr::ungroup() 

输出

 state      date       days_last days_next name         sequence   duration cumulative
   <chr>      <date>         <dbl>     <dbl> <chr>        <chr>         <int>      <int>
 1 Washington 2021-01-01        NA         2 Washington.1 Not Linked        1          1
 2 Washington 2021-01-03         2         1 Washington.2 First day         2          1
 3 Washington 2021-01-04         1         6 Washington.2 Last day          2          2
 4 Washington 2021-01-10         6         5 Washington.3 Not Linked        1          1
 5 Washington 2021-01-15         5         1 Washington.4 First day         5          1
 6 Washington 2021-01-16         1         1 Washington.4 Ongoing           5          2
 7 Washington 2021-01-17         1         1 Washington.4 Ongoing           5          3
 8 Washington 2021-01-18         1         1 Washington.4 Ongoing           5          4
 9 Washington 2021-01-19         1         9 Washington.4 Last day          5          5
10 Washington 2021-01-28         9        NA Washington.5 Not Linked        1          1
11 Idaho      2021-01-12        NA         1 Idaho.1      First day         3          1
12 Idaho      2021-01-13         1         1 Idaho.1      Ongoing           3          2
13 Idaho      2021-01-14         1        18 Idaho.1      Last day          3          3
14 Idaho      2021-02-01        18         2 Idaho.2      Not Linked        1          1
15 Idaho      2021-02-03         2         1 Idaho.3      First day         3          1
16 Idaho      2021-02-04         1         1 Idaho.3      Ongoing           3          2
17 Idaho      2021-02-05         1         3 Idaho.3      Last day          3          3
18 Idaho      2021-02-08         3         2 Idaho.4      First day         2          1
19 Idaho      2021-02-10         2         4 Idaho.4      Last day          2          2
20 Idaho      2021-02-14         4        NA Idaho.5      Not Linked        1          1

如果需要,您可以将 days_last 列中的 NA 用于其他行中的 NA 值。


Side question: Why is test$sequence[11] an NA and not "First day"?

通常,在 R NA 中传播,这意味着如果 NA 是评估的一部分,那么通常会返回 NA。当您定义 sequence 时,您的第一个 ifelse 条件是 link_last == 0 & link_next == 1。在第 11 行,link_last = NAlink_next = 1。所以你评估的是:

NA == 0 & 1 == 1
[1] NA

相反,您的嵌套条件应该放在第一位。你的 ifelse 目前是如何写的嵌套条件是 not 被评估:

is.na(NA) & 1 == 1
[1] TRUE

这是一个data.table方法。

library(data.table)

# Convert from data.frame to data.table
setDT(test)

# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last), 
                  days_next = as.numeric(days_next), sequence)]

# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
     by = state][
       , ':='(duration = .N,
              cumulative = seq(1:.N)),
       by = name
     ][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!

# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date', 
                              'sequence', 'duration', 
                              'cumulative'))
# Print first 15 rows
print(test2[1:15,])

#>          state         name       date   sequence duration cumulative
#>  1: Washington Washington.1 2021-01-01       <NA>        1          1
#>  2: Washington Washington.2 2021-01-03  First day        2          1
#>  3: Washington Washington.2 2021-01-04   Last day        2          2
#>  4: Washington Washington.3 2021-01-10 Not linked        1          1
#>  5: Washington Washington.4 2021-01-15  First day        5          1
#>  6: Washington Washington.4 2021-01-16    Ongoing        5          2
#>  7: Washington Washington.4 2021-01-17    Ongoing        5          3
#>  8: Washington Washington.4 2021-01-18    Ongoing        5          4
#>  9: Washington Washington.4 2021-01-19   Last day        5          5
#> 10: Washington Washington.5 2021-01-28       <NA>        1          1
#> 11:      Idaho      Idaho.1 2021-01-12       <NA>        3          1
#> 12:      Idaho      Idaho.1 2021-01-13    Ongoing        3          2
#> 13:      Idaho      Idaho.1 2021-01-14   Last day        3          3
#> 14:      Idaho      Idaho.2 2021-02-01 Not linked        1          1
#> 15:      Idaho      Idaho.3 2021-02-03  First day        3          1

reprex package (v0.3.0)

于 2021 年 3 月 16 日创建