分组,取持续时间并在 R (dplyr, r) 内设置条件

Group, take duration and set condition within R (dplyr, r)

我有一个数据集,df:(该数据集包含超过 4000 行)

  DATEB

  9/9/2019 7:51:58 PM
  9/9/2019 7:51:59 PM
  9/9/2019 7:51:59 PM
  9/9/2019 7:52:00 PM
  9/9/2019 7:52:01 PM
  9/9/2019 7:52:01 PM
  9/9/2019 7:52:02 PM
  9/9/2019 7:52:03 PM
  9/9/2019 7:54:00 PM
  9/9/2019 7:54:02 PM
  9/10/2019 8:00:00PM

如果日期时间之间的时间超过 120 秒,我想将它们放在不同的组中,并计算持续时间。

期望的输出:

Group   Duration

 a       5 sec
 b       2 sec
 c       0 sec




 dput:


  structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L, 
  7L, 8L, 9L, 1L), .Label = c("      9/10/2019 8:00:00 PM", "      9/9/2019 7:51:58 PM", 
  "      9/9/2019 7:51:59 PM", "      9/9/2019 7:52:00 PM", "      9/9/2019 7:52:01 PM", 
  "      9/9/2019 7:52:02 PM", "      9/9/2019 7:52:03 PM", "      9/9/2019 7:54:00 PM", 
  "      9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA, 
  -11L))

我已经尝试了下面的代码,效果很好,除了我希望 7:51:59 和 7:52:00 在同一组中。持续时间唯一应该中断并创建新组的时间是日期时间之间的时间超过 120 秒。

   df %>%
   mutate(DATEB = lubridate::mdy_hms(DATEB), 
   temp = floor_date(DATEB, "120 secs")) %>%
   group_by(temp) %>%
   summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))

欢迎任何建议。

我们可以在这里使用 cut :

library(dplyr)
df %>%
  mutate(DATEB = lubridate::mdy_hms(DATEB), 
        temp = cut(DATEB, breaks = "2 mins")) %>%
  group_by(temp) %>%
  summarise(duration = difftime(max(DATEB), min(DATEB), units = "secs"))

# A tibble: 3 x 2
#  temp                duration
#  <fct>               <drtn>  
#1 2019-09-09 19:51:00 5 secs  
#2 2019-09-09 19:53:00 2 secs  
#3 2019-09-10 19:59:00 0 secs  

OP 要求:

The only time the duration should break and create a new group, is when the time in between datetimes exceed 120 secs.

单词“日期时间之间的时间”表明 OP 正在寻找 gappause。 (好吧,如果我得到了一个有序日期时间的向量并负责对数据进行分组,这就是我要寻找的东西。)

很遗憾,预期结果和接受的答案与此解释不符。

但是,我会这样做:

gap_threshold <- 10
df %>%
  mutate(DATEB = lubridate::mdy_hms(DATEB), 
         gap = c(0, diff(DATEB))) %>% 
  group_by(grp = cumsum(gap > gap_threshold)) %>% 
  summarise(begin = min(DATEB), end = max(DATEB), duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4
    grp begin               end                 duration
  <int> <dttm>              <dttm>              <drtn>  
1     0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs  
2     1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs  
3     2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs