使用循环从日期时间序列中删除复杂模式

Question

背景：

我有一个数据集 df，我想在其中遵循特定的时间戳模式。我想先

1. Identify the 'Connect' value timestamp
2. Check the action that follows, and check to see if the next action
   is an 'Ended' or 'Attempt' with a less than or equal to 60 second gap
3. If this <= gap of 60 second is present, I wish for the code to Skip these timestamps
   and keep iterating until it comes to the next 'Ended' value, and to record this value.

输出模式应始终遵循 'Connect' 和 'Ended'

We start with:

Connect            4/6/2020 1:11:41 PM

Then look to the next line:

Ended              4/6/2020 1:14:20 PM

Now look to the line that follows:

Attempt            4/6/2020 1:15:20 PM





These two timestamps are less than or equal to 60 seconds, so we keep going    

until we come across an Ended value where these conditions do not apply. 

So the Ended value of 

Ended              4/6/2020 2:05:18 PM    gets recorded.









Action             Time

Connect            4/6/2020 1:11:41 PM

Ended              4/6/2020 1:14:20 PM

Attempt            4/6/2020 1:15:20 PM

Connect            4/6/2020 1:15:21 PM

Ended              4/6/2020 2:05:18 PM

Connect            3/31/2020 11:00:08 AM

Ended              3/31/2020 11:14:54 AM

Ended              3/31/2020 4:17:43 PM

正如我们在下面看到的，这些行已被删除，因为 1:14:20PM 和 1:15:20PM 彼此相差不超过 60 秒 3/31/2020 4:17:43 PM 不是我们遇到的下一个即时 'Ended' 值。

Ended              4/6/2020 1:14:20 PM

Attempt            4/6/2020 1:15:20 PM

Connect            4/6/2020 1:15:21 PM

Ended              3/31/2020 4:17:43 PM

期望输出：

Action              Time



Connect             4/6/2020 1:11:41 PM        

Ended               4/6/2020 2:05:18 PM

Connect             3/31/2020 11:00:08 AM

Ended               3/31/2020 11:14:54 AM

输出模式应始终遵循 'Connect' 和 'Ended'

输出：

structure(list(Action = structure(c(2L, 3L, 1L, 2L, 3L, 2L, 3L, 

3L), .Label = c("Attempt", "Connect", "Ended"), class =     "factor"), 

 Time = structure(c(4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L), .Label =      c("3/31/2020 11:00:08 AM", 

 "3/31/2020 11:14:54 AM", "3/31/2020 4:17:43 PM", "4/6/2020      1:11:41 PM", 

  "4/6/2020 1:14:20 PM", "4/6/2020 1:15:20 PM", "4/6/2020  1:15:21   PM", 

 "4/6/2020 2:05:18 PM"), class = "factor")), class =     "data.frame", row.names = c(NA, 

-8L))

这是我试过的：

我在想我应该使用循环，但不确定如何构造它。感谢任何帮助。

  library(lubridate)

  if (value <= 60) {

   print("") 

   } else {

   Expr2

   }

Answer 1

这是 dplyr、data.table 和 lubridate 的方法。

首先，我们计算数据集中已经过去的累计时间。接下来，我们使用 cumsum 将数据集分解为间隔大于 60 秒的连接尝试。然后，我们按连接尝试分组，并且仅保留在第一次连接尝试后超过 60 秒后发生的非连接事件。然后借用@akrun 的方法，过滤重复的连续动作。

library(lubridate)
library(dplyr)
library(data.table)
df %>% 
  mutate(Time = mdy_hms(Time)) %>%
  dplyr::arrange(Time) %>%
  mutate(CumTime = cumsum(time_length(Time - dplyr::lag(Time, 1L,default = as.integer(min(mdy_hms(df$Time))))))) %>%
  group_by(Action) %>%
  mutate(LastConnect = if_else(Action == "Connect", time_length(CumTime - dplyr::lag(CumTime, 1L, 0)), 0)) %>%
  ungroup %>%
  mutate(ConnectionInterval = cumsum(Action == "Connect" & LastConnect > 60)) %>%
  dplyr::select(-LastConnect) %>%
  group_by(ConnectionInterval) %>%
  mutate(ConnectCumTime = time_length(Time - dplyr::lag(Time, 1L))) %>% 
  filter(Action == "Connect" | ConnectCumTime > 60 & !duplicated(rleid(Action)))
## A tibble: 6 x 5
## Groups:   ConnectionInterval [3]
#  Action  Time                CumTime ConnectionInterval ConnectCumTime
#  <fct>   <dttm>                <dbl>              <int>          <dbl>
#1 Connect 2020-03-31 11:00:08       0                  0             NA
#2 Ended   2020-03-31 11:14:54     886                  0            886
#3 Connect 2020-04-06 13:11:41  526293                  1             NA
#4 Ended   2020-04-06 13:14:20  526452                  1            159
#5 Connect 2020-04-06 13:15:21  526513                  2             NA
#6 Ended   2020-04-06 14:05:18  529510                  2           2997

Answer 2

我们可以将 'Time' 转换为 Datetime class，从 lubridate 转换为 mdy_hms，根据 [=20] 的出现创建分组变量=] in 'Action'，得到'Time'个元素的差（'Diff'），filter出差小于等于60的行，然后filter 出 duplicated 相似元素行 'Action'

library(dplyr)
library(lubridate)
library(data.table)
df1 %>%
   mutate(Time1 = mdy_hms(Time)) %>%
   group_by(grp = cumsum(Action == 'Connect')) %>% 
   mutate(Diff = difftime(Time1, lag(Time1), unit = 'sec'),
     Diff = case_when(any(Diff <=60) ~ 60, TRUE ~ as.numeric(Diff))) %>%
   filter(Action == 'Connect'|Diff >60) %>%
   ungroup %>% 
   filter(!duplicated(rleid(Action))) %>% 
   select(Action, Time)
# A tibble: 4 x 2
#  Action  Time                    
#  <fct>   <fct>                   
#1 Connect 4/6/2020      1:11:41 PM
#2 Ended   4/6/2020 2:05:18 PM     
#3 Connect 3/31/2020 11:00:08 AM   
#4 Ended   3/31/2020 11:14:54 AM

使用循环从日期时间序列中删除复杂模式

Remove complex pattern from datetime sequence using loops

loops

r

lubridate

dplyr

tidyr