使用循环从日期时间序列中删除复杂模式
Remove complex pattern from datetime sequence using loops
背景:
我有一个数据集 df,我想在其中遵循特定的时间戳模式。
我想先
1. Identify the 'Connect' value timestamp
2. Check the action that follows, and check to see if the next action
is an 'Ended' or 'Attempt' with a less than or equal to 60 second gap
3. If this <= gap of 60 second is present, I wish for the code to Skip these timestamps
and keep iterating until it comes to the next 'Ended' value, and to record this value.
输出模式应始终遵循 'Connect' 和 'Ended'
We start with:
Connect 4/6/2020 1:11:41 PM
Then look to the next line:
Ended 4/6/2020 1:14:20 PM
Now look to the line that follows:
Attempt 4/6/2020 1:15:20 PM
These two timestamps are less than or equal to 60 seconds, so we keep going
until we come across an Ended value where these conditions do not apply.
So the Ended value of
Ended 4/6/2020 2:05:18 PM gets recorded.
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
Ended 3/31/2020 4:17:43 PM
正如我们在下面看到的,这些行已被删除,因为 1:14:20PM 和 1:15:20PM 彼此相差不超过 60 秒
3/31/2020 4:17:43 PM 不是我们遇到的下一个即时 'Ended' 值。
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 3/31/2020 4:17:43 PM
期望输出:
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
输出模式应始终遵循 'Connect' 和 'Ended'
输出:
structure(list(Action = structure(c(2L, 3L, 1L, 2L, 3L, 2L, 3L,
3L), .Label = c("Attempt", "Connect", "Ended"), class = "factor"),
Time = structure(c(4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L), .Label = c("3/31/2020 11:00:08 AM",
"3/31/2020 11:14:54 AM", "3/31/2020 4:17:43 PM", "4/6/2020 1:11:41 PM",
"4/6/2020 1:14:20 PM", "4/6/2020 1:15:20 PM", "4/6/2020 1:15:21 PM",
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
这是我试过的:
我在想我应该使用循环,但不确定如何构造它。感谢任何帮助。
library(lubridate)
if (value <= 60) {
print("")
} else {
Expr2
}
这是 dplyr
、data.table
和 lubridate
的方法。
首先,我们计算数据集中已经过去的累计时间。接下来,我们使用 cumsum
将数据集分解为间隔大于 60 秒的连接尝试。然后,我们按连接尝试分组,并且仅保留在第一次连接尝试后超过 60 秒后发生的非连接事件。然后借用@akrun 的方法,过滤重复的连续动作。
library(lubridate)
library(dplyr)
library(data.table)
df %>%
mutate(Time = mdy_hms(Time)) %>%
dplyr::arrange(Time) %>%
mutate(CumTime = cumsum(time_length(Time - dplyr::lag(Time, 1L,default = as.integer(min(mdy_hms(df$Time))))))) %>%
group_by(Action) %>%
mutate(LastConnect = if_else(Action == "Connect", time_length(CumTime - dplyr::lag(CumTime, 1L, 0)), 0)) %>%
ungroup %>%
mutate(ConnectionInterval = cumsum(Action == "Connect" & LastConnect > 60)) %>%
dplyr::select(-LastConnect) %>%
group_by(ConnectionInterval) %>%
mutate(ConnectCumTime = time_length(Time - dplyr::lag(Time, 1L))) %>%
filter(Action == "Connect" | ConnectCumTime > 60 & !duplicated(rleid(Action)))
## A tibble: 6 x 5
## Groups: ConnectionInterval [3]
# Action Time CumTime ConnectionInterval ConnectCumTime
# <fct> <dttm> <dbl> <int> <dbl>
#1 Connect 2020-03-31 11:00:08 0 0 NA
#2 Ended 2020-03-31 11:14:54 886 0 886
#3 Connect 2020-04-06 13:11:41 526293 1 NA
#4 Ended 2020-04-06 13:14:20 526452 1 159
#5 Connect 2020-04-06 13:15:21 526513 2 NA
#6 Ended 2020-04-06 14:05:18 529510 2 2997
我们可以将 'Time' 转换为 Datetime
class,从 lubridate
转换为 mdy_hms
,根据 [=20] 的出现创建分组变量=] in 'Action',得到'Time'个元素的差('Diff'),filter
出差小于等于60的行,然后filter
出 duplicated
相似元素行 'Action'
library(dplyr)
library(lubridate)
library(data.table)
df1 %>%
mutate(Time1 = mdy_hms(Time)) %>%
group_by(grp = cumsum(Action == 'Connect')) %>%
mutate(Diff = difftime(Time1, lag(Time1), unit = 'sec'),
Diff = case_when(any(Diff <=60) ~ 60, TRUE ~ as.numeric(Diff))) %>%
filter(Action == 'Connect'|Diff >60) %>%
ungroup %>%
filter(!duplicated(rleid(Action))) %>%
select(Action, Time)
# A tibble: 4 x 2
# Action Time
# <fct> <fct>
#1 Connect 4/6/2020 1:11:41 PM
#2 Ended 4/6/2020 2:05:18 PM
#3 Connect 3/31/2020 11:00:08 AM
#4 Ended 3/31/2020 11:14:54 AM
背景:
我有一个数据集 df,我想在其中遵循特定的时间戳模式。 我想先
1. Identify the 'Connect' value timestamp
2. Check the action that follows, and check to see if the next action
is an 'Ended' or 'Attempt' with a less than or equal to 60 second gap
3. If this <= gap of 60 second is present, I wish for the code to Skip these timestamps
and keep iterating until it comes to the next 'Ended' value, and to record this value.
输出模式应始终遵循 'Connect' 和 'Ended'
We start with:
Connect 4/6/2020 1:11:41 PM
Then look to the next line:
Ended 4/6/2020 1:14:20 PM
Now look to the line that follows:
Attempt 4/6/2020 1:15:20 PM
These two timestamps are less than or equal to 60 seconds, so we keep going
until we come across an Ended value where these conditions do not apply.
So the Ended value of
Ended 4/6/2020 2:05:18 PM gets recorded.
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
Ended 3/31/2020 4:17:43 PM
正如我们在下面看到的,这些行已被删除,因为 1:14:20PM 和 1:15:20PM 彼此相差不超过 60 秒 3/31/2020 4:17:43 PM 不是我们遇到的下一个即时 'Ended' 值。
Ended 4/6/2020 1:14:20 PM
Attempt 4/6/2020 1:15:20 PM
Connect 4/6/2020 1:15:21 PM
Ended 3/31/2020 4:17:43 PM
期望输出:
Action Time
Connect 4/6/2020 1:11:41 PM
Ended 4/6/2020 2:05:18 PM
Connect 3/31/2020 11:00:08 AM
Ended 3/31/2020 11:14:54 AM
输出模式应始终遵循 'Connect' 和 'Ended'
输出:
structure(list(Action = structure(c(2L, 3L, 1L, 2L, 3L, 2L, 3L,
3L), .Label = c("Attempt", "Connect", "Ended"), class = "factor"),
Time = structure(c(4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L), .Label = c("3/31/2020 11:00:08 AM",
"3/31/2020 11:14:54 AM", "3/31/2020 4:17:43 PM", "4/6/2020 1:11:41 PM",
"4/6/2020 1:14:20 PM", "4/6/2020 1:15:20 PM", "4/6/2020 1:15:21 PM",
"4/6/2020 2:05:18 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
这是我试过的:
我在想我应该使用循环,但不确定如何构造它。感谢任何帮助。
library(lubridate)
if (value <= 60) {
print("")
} else {
Expr2
}
这是 dplyr
、data.table
和 lubridate
的方法。
首先,我们计算数据集中已经过去的累计时间。接下来,我们使用 cumsum
将数据集分解为间隔大于 60 秒的连接尝试。然后,我们按连接尝试分组,并且仅保留在第一次连接尝试后超过 60 秒后发生的非连接事件。然后借用@akrun 的方法,过滤重复的连续动作。
library(lubridate)
library(dplyr)
library(data.table)
df %>%
mutate(Time = mdy_hms(Time)) %>%
dplyr::arrange(Time) %>%
mutate(CumTime = cumsum(time_length(Time - dplyr::lag(Time, 1L,default = as.integer(min(mdy_hms(df$Time))))))) %>%
group_by(Action) %>%
mutate(LastConnect = if_else(Action == "Connect", time_length(CumTime - dplyr::lag(CumTime, 1L, 0)), 0)) %>%
ungroup %>%
mutate(ConnectionInterval = cumsum(Action == "Connect" & LastConnect > 60)) %>%
dplyr::select(-LastConnect) %>%
group_by(ConnectionInterval) %>%
mutate(ConnectCumTime = time_length(Time - dplyr::lag(Time, 1L))) %>%
filter(Action == "Connect" | ConnectCumTime > 60 & !duplicated(rleid(Action)))
## A tibble: 6 x 5
## Groups: ConnectionInterval [3]
# Action Time CumTime ConnectionInterval ConnectCumTime
# <fct> <dttm> <dbl> <int> <dbl>
#1 Connect 2020-03-31 11:00:08 0 0 NA
#2 Ended 2020-03-31 11:14:54 886 0 886
#3 Connect 2020-04-06 13:11:41 526293 1 NA
#4 Ended 2020-04-06 13:14:20 526452 1 159
#5 Connect 2020-04-06 13:15:21 526513 2 NA
#6 Ended 2020-04-06 14:05:18 529510 2 2997
我们可以将 'Time' 转换为 Datetime
class,从 lubridate
转换为 mdy_hms
,根据 [=20] 的出现创建分组变量=] in 'Action',得到'Time'个元素的差('Diff'),filter
出差小于等于60的行,然后filter
出 duplicated
相似元素行 'Action'
library(dplyr)
library(lubridate)
library(data.table)
df1 %>%
mutate(Time1 = mdy_hms(Time)) %>%
group_by(grp = cumsum(Action == 'Connect')) %>%
mutate(Diff = difftime(Time1, lag(Time1), unit = 'sec'),
Diff = case_when(any(Diff <=60) ~ 60, TRUE ~ as.numeric(Diff))) %>%
filter(Action == 'Connect'|Diff >60) %>%
ungroup %>%
filter(!duplicated(rleid(Action))) %>%
select(Action, Time)
# A tibble: 4 x 2
# Action Time
# <fct> <fct>
#1 Connect 4/6/2020 1:11:41 PM
#2 Ended 4/6/2020 2:05:18 PM
#3 Connect 3/31/2020 11:00:08 AM
#4 Ended 3/31/2020 11:14:54 AM