计算两个事件之间的时间差,同时忽略不匹配的事件
Calculate time difference between two events while disregarding unmatched events
我有一个结构如下的数据集:
structure(list(id = c(43956L, 46640L, 71548L, 71548L, 71548L,
72029L, 72029L, 74558L, 74558L, 100596L, 100596L, 100596L, 104630L,
104630L, 104630L, 104630L, 104630L, 104630L, 104630L, 104630L
), event = c("LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN",
"LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGIN",
"LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT"
), timestamp = c("2017-03-27 09:19:29", "2016-06-10 00:09:08",
"2016-01-27 12:00:25", "2016-06-20 11:34:29", "2016-06-20 11:35:44",
"2016-12-28 10:43:25", "2016-12-28 10:56:30", "2016-10-15 15:08:39",
"2016-10-15 15:10:06", "2016-03-09 14:30:48", "2016-03-09 14:31:10",
"2017-04-03 10:36:54", "2016-01-11 16:52:08", "2016-02-03 14:40:32",
"2016-03-30 12:34:56", "2016-05-26 13:14:25", "2016-08-22 15:20:02",
"2016-08-22 15:21:53", "2016-08-22 15:22:23", "2016-08-22 15:23:08"
)), .Names = c("id", "event", "timestamp"), row.names = c(5447L,
5446L, 5443L, 5444L, 5445L, 5441L, 5442L, 5439L, 5440L, 5436L,
5437L, 5438L, 5425L, 5426L, 5427L, 5428L, 5429L, 5430L, 5431L,
5432L), class = "data.frame")
id event timestamp
5447 43956 LOGIN 2017-03-27 09:19:29
5446 46640 LOGIN 2016-06-10 00:09:08
5443 71548 LOGIN 2016-01-27 12:00:25
5444 71548 LOGIN 2016-06-20 11:34:29
5445 71548 LOGOUT 2016-06-20 11:35:44
5441 72029 LOGIN 2016-12-28 10:43:25
5442 72029 LOGOUT 2016-12-28 10:56:30
5439 74558 LOGIN 2016-10-15 15:08:39
5440 74558 LOGOUT 2016-10-15 15:10:06
5436 100596 LOGIN 2016-03-09 14:30:48
5437 100596 LOGOUT 2016-03-09 14:31:10
5438 100596 LOGIN 2017-04-03 10:36:54
5425 104630 LOGIN 2016-01-11 16:52:08
5426 104630 LOGIN 2016-02-03 14:40:32
5427 104630 LOGIN 2016-03-30 12:34:56
5428 104630 LOGIN 2016-05-26 13:14:25
5429 104630 LOGIN 2016-08-22 15:20:02
5430 104630 LOGOUT 2016-08-22 15:21:53
5431 104630 LOGIN 2016-08-22 15:22:23
5432 104630 LOGOUT 2016-08-22 15:23:08
我想计算 LOGIN
和 LOGOUT
(会话持续时间)之间以及 LOGOUT
和 LOGIN
(会话间隔)之间的时间差。不幸的是,我有 LOGIN
个事件没有匹配的 LOGOUT
个事件。
正确的 LOGOUT
事件总是跟随其对应的 LOGIN
事件(因为我根据 id
和 timestamp
订购了数据框。我尝试调整 ,但没有运气。我也尝试创建一个事件标识符,但由于我找不到一种方法来获取 LOGOUT
事件的编号以匹配 LOGIN
的编号事件,我不确定这样的标识符会有多大用处:
df$eventNum <- as.numeric(ave(as.character(df$id), df$id, as.character(df$event), FUN = seq_along))
假设任何用户都将无限期地保持登录状态直到注销,看来数据可以按某种方式排序,这样一个简单的“lag”函数就可以解决问题.
使用库 dplyr 并假设您已经调用数据框 "df" 并且您已经将时间戳转换为 日期格式 例如 POSIXct:
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp)) %>%
ungroup() %>%
arrange(id, rank,timestamp) %>%
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
一行一行。
首先,我们按 "id" 和 "timestamp" 对数据进行排序,然后按 "id" 和 "event" 分组以分配登录和注销事件的等级。同一用户的首次登录将具有 "rank" 1,该用户的首次注销也将具有 "rank" 1.
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp))
然后,我们删除数据分组,并按 id、等级和时间戳再次排序。这将产生一个顺序正确的数据帧,每个用户的登录事件后跟注销事件,因此我们可以应用滞后计算。
ungroup() %>%
arrange(id, rank,timestamp) %>%
最后,我们再次按 "id" 分组,我们使用 mutate 计算仅用于 LOGOUT 事件的时间戳滞后。
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
这应该会产生一个数据框,例如:
id event timestamp rank duration
<int> <chr> <dttm> <int> <dbl>
1 43956 LOGIN 2017-03-27 09:19:29 1 NA
2 46640 LOGIN 2016-06-10 00:09:08 1 NA
3 71548 LOGIN 2016-01-27 12:00:25 1 NA
4 71548 LOGOUT 2016-06-20 11:35:44 1 208715.31667
5 71548 LOGIN 2016-06-20 11:34:29 2 NA
6 72029 LOGIN 2016-12-28 10:43:25 1 NA
7 72029 LOGOUT 2016-12-28 10:56:30 1 13.08333
8 74558 LOGIN 2016-10-15 15:08:39 1 NA
9 74558 LOGOUT 2016-10-15 15:10:06 1 1.45000
10 100596 LOGIN 2016-03-09 14:30:48 1 NA
11 100596 LOGOUT 2016-03-09 14:31:10 1 22.00000
这是我会采用的方法:
首先,我将 event
变量转换为有序因子,因为以这种方式考虑它的值是有意义的(即登录 < 注销,就顺序而言),并且因为它将使行之间的比较更容易:
df$event <- factor(df$event, levels = c("LOGIN", "LOGOUT"), ordered = T)
然后,假设 timestamp
是一种可行的格式,因为这将提供:
df$timestamp <- lubridate::parse_date_time(df$timestamp, "%Y-%m-%d %H:%M:%S")
您可以通过按 ID 分组然后使用 ifelse
函数调用 mutate
来有条件地改变 data.frame:
df %>% group_by(id) %>% mutate(
timeElapsed = ifelse(event != lag(event), lubridate::seconds_to_period(timestamp - lag(timestamp)), NA),
eventType = ifelse(event > lag(event), 'Duration', ifelse(event < lag(event), 'Interval', NA))
)
# id event timestamp timeElapsed eventType
# <int> <ord> <dttm> <dbl> <chr>
# 1 43956 LOGIN 2017-03-27 09:19:29 NA <NA>
# 2 46640 LOGIN 2016-06-10 00:09:08 NA <NA>
# 3 71548 LOGIN 2016-01-27 12:00:25 NA <NA>
# 4 71548 LOGIN 2016-06-20 11:34:29 NA <NA>
# 5 71548 LOGOUT 2016-06-20 11:35:44 1.25000 Duration
# 6 72029 LOGIN 2016-12-28 10:43:25 NA <NA>
# 7 72029 LOGOUT 2016-12-28 10:56:30 13.08333 Duration
# 8 74558 LOGIN 2016-10-15 15:08:39 NA <NA>
# 9 74558 LOGOUT 2016-10-15 15:10:06 1.45000 Duration
# 10 100596 LOGIN 2016-03-09 14:30:48 NA <NA>
# 11 100596 LOGOUT 2016-03-09 14:31:10 22.00000 Duration
# 12 100596 LOGIN 2017-04-03 10:36:54 44.00000 Interval
# 13 104630 LOGIN 2016-01-11 16:52:08 NA <NA>
# 14 104630 LOGIN 2016-02-03 14:40:32 NA <NA>
# 15 104630 LOGIN 2016-03-30 12:34:56 NA <NA>
# 16 104630 LOGIN 2016-05-26 13:14:25 NA <NA>
# 17 104630 LOGIN 2016-08-22 15:20:02 NA <NA>
# 18 104630 LOGOUT 2016-08-22 15:21:53 51.00000 Duration
# 19 104630 LOGIN 2016-08-22 15:22:23 30.00000 Interval
# 20 104630 LOGOUT 2016-08-22 15:23:08 45.00000 Duration
使用 lubridate::seconds_to_period
将以“%d %H %M %S”格式为您提供时差。
我有一个结构如下的数据集:
structure(list(id = c(43956L, 46640L, 71548L, 71548L, 71548L,
72029L, 72029L, 74558L, 74558L, 100596L, 100596L, 100596L, 104630L,
104630L, 104630L, 104630L, 104630L, 104630L, 104630L, 104630L
), event = c("LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN",
"LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT", "LOGIN", "LOGIN",
"LOGIN", "LOGIN", "LOGIN", "LOGIN", "LOGOUT", "LOGIN", "LOGOUT"
), timestamp = c("2017-03-27 09:19:29", "2016-06-10 00:09:08",
"2016-01-27 12:00:25", "2016-06-20 11:34:29", "2016-06-20 11:35:44",
"2016-12-28 10:43:25", "2016-12-28 10:56:30", "2016-10-15 15:08:39",
"2016-10-15 15:10:06", "2016-03-09 14:30:48", "2016-03-09 14:31:10",
"2017-04-03 10:36:54", "2016-01-11 16:52:08", "2016-02-03 14:40:32",
"2016-03-30 12:34:56", "2016-05-26 13:14:25", "2016-08-22 15:20:02",
"2016-08-22 15:21:53", "2016-08-22 15:22:23", "2016-08-22 15:23:08"
)), .Names = c("id", "event", "timestamp"), row.names = c(5447L,
5446L, 5443L, 5444L, 5445L, 5441L, 5442L, 5439L, 5440L, 5436L,
5437L, 5438L, 5425L, 5426L, 5427L, 5428L, 5429L, 5430L, 5431L,
5432L), class = "data.frame")
id event timestamp
5447 43956 LOGIN 2017-03-27 09:19:29
5446 46640 LOGIN 2016-06-10 00:09:08
5443 71548 LOGIN 2016-01-27 12:00:25
5444 71548 LOGIN 2016-06-20 11:34:29
5445 71548 LOGOUT 2016-06-20 11:35:44
5441 72029 LOGIN 2016-12-28 10:43:25
5442 72029 LOGOUT 2016-12-28 10:56:30
5439 74558 LOGIN 2016-10-15 15:08:39
5440 74558 LOGOUT 2016-10-15 15:10:06
5436 100596 LOGIN 2016-03-09 14:30:48
5437 100596 LOGOUT 2016-03-09 14:31:10
5438 100596 LOGIN 2017-04-03 10:36:54
5425 104630 LOGIN 2016-01-11 16:52:08
5426 104630 LOGIN 2016-02-03 14:40:32
5427 104630 LOGIN 2016-03-30 12:34:56
5428 104630 LOGIN 2016-05-26 13:14:25
5429 104630 LOGIN 2016-08-22 15:20:02
5430 104630 LOGOUT 2016-08-22 15:21:53
5431 104630 LOGIN 2016-08-22 15:22:23
5432 104630 LOGOUT 2016-08-22 15:23:08
我想计算 LOGIN
和 LOGOUT
(会话持续时间)之间以及 LOGOUT
和 LOGIN
(会话间隔)之间的时间差。不幸的是,我有 LOGIN
个事件没有匹配的 LOGOUT
个事件。
正确的 LOGOUT
事件总是跟随其对应的 LOGIN
事件(因为我根据 id
和 timestamp
订购了数据框。我尝试调整 LOGOUT
事件的编号以匹配 LOGIN
的编号事件,我不确定这样的标识符会有多大用处:
df$eventNum <- as.numeric(ave(as.character(df$id), df$id, as.character(df$event), FUN = seq_along))
假设任何用户都将无限期地保持登录状态直到注销,看来数据可以按某种方式排序,这样一个简单的“lag”函数就可以解决问题.
使用库 dplyr 并假设您已经调用数据框 "df" 并且您已经将时间戳转换为 日期格式 例如 POSIXct:
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp)) %>%
ungroup() %>%
arrange(id, rank,timestamp) %>%
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
一行一行。
首先,我们按 "id" 和 "timestamp" 对数据进行排序,然后按 "id" 和 "event" 分组以分配登录和注销事件的等级。同一用户的首次登录将具有 "rank" 1,该用户的首次注销也将具有 "rank" 1.
df %>% arrange(id,timestamp) %>%
group_by(id,event)%>%
mutate(rank = dense_rank(timestamp))
然后,我们删除数据分组,并按 id、等级和时间戳再次排序。这将产生一个顺序正确的数据帧,每个用户的登录事件后跟注销事件,因此我们可以应用滞后计算。
ungroup() %>%
arrange(id, rank,timestamp) %>%
最后,我们再次按 "id" 分组,我们使用 mutate 计算仅用于 LOGOUT 事件的时间戳滞后。
group_by(id)%>%
mutate(duration = ifelse(event == "LOGOUT", timestamp- lag(timestamp),NA))
这应该会产生一个数据框,例如:
id event timestamp rank duration
<int> <chr> <dttm> <int> <dbl>
1 43956 LOGIN 2017-03-27 09:19:29 1 NA
2 46640 LOGIN 2016-06-10 00:09:08 1 NA
3 71548 LOGIN 2016-01-27 12:00:25 1 NA
4 71548 LOGOUT 2016-06-20 11:35:44 1 208715.31667
5 71548 LOGIN 2016-06-20 11:34:29 2 NA
6 72029 LOGIN 2016-12-28 10:43:25 1 NA
7 72029 LOGOUT 2016-12-28 10:56:30 1 13.08333
8 74558 LOGIN 2016-10-15 15:08:39 1 NA
9 74558 LOGOUT 2016-10-15 15:10:06 1 1.45000
10 100596 LOGIN 2016-03-09 14:30:48 1 NA
11 100596 LOGOUT 2016-03-09 14:31:10 1 22.00000
这是我会采用的方法:
首先,我将 event
变量转换为有序因子,因为以这种方式考虑它的值是有意义的(即登录 < 注销,就顺序而言),并且因为它将使行之间的比较更容易:
df$event <- factor(df$event, levels = c("LOGIN", "LOGOUT"), ordered = T)
然后,假设 timestamp
是一种可行的格式,因为这将提供:
df$timestamp <- lubridate::parse_date_time(df$timestamp, "%Y-%m-%d %H:%M:%S")
您可以通过按 ID 分组然后使用 ifelse
函数调用 mutate
来有条件地改变 data.frame:
df %>% group_by(id) %>% mutate(
timeElapsed = ifelse(event != lag(event), lubridate::seconds_to_period(timestamp - lag(timestamp)), NA),
eventType = ifelse(event > lag(event), 'Duration', ifelse(event < lag(event), 'Interval', NA))
)
# id event timestamp timeElapsed eventType
# <int> <ord> <dttm> <dbl> <chr>
# 1 43956 LOGIN 2017-03-27 09:19:29 NA <NA>
# 2 46640 LOGIN 2016-06-10 00:09:08 NA <NA>
# 3 71548 LOGIN 2016-01-27 12:00:25 NA <NA>
# 4 71548 LOGIN 2016-06-20 11:34:29 NA <NA>
# 5 71548 LOGOUT 2016-06-20 11:35:44 1.25000 Duration
# 6 72029 LOGIN 2016-12-28 10:43:25 NA <NA>
# 7 72029 LOGOUT 2016-12-28 10:56:30 13.08333 Duration
# 8 74558 LOGIN 2016-10-15 15:08:39 NA <NA>
# 9 74558 LOGOUT 2016-10-15 15:10:06 1.45000 Duration
# 10 100596 LOGIN 2016-03-09 14:30:48 NA <NA>
# 11 100596 LOGOUT 2016-03-09 14:31:10 22.00000 Duration
# 12 100596 LOGIN 2017-04-03 10:36:54 44.00000 Interval
# 13 104630 LOGIN 2016-01-11 16:52:08 NA <NA>
# 14 104630 LOGIN 2016-02-03 14:40:32 NA <NA>
# 15 104630 LOGIN 2016-03-30 12:34:56 NA <NA>
# 16 104630 LOGIN 2016-05-26 13:14:25 NA <NA>
# 17 104630 LOGIN 2016-08-22 15:20:02 NA <NA>
# 18 104630 LOGOUT 2016-08-22 15:21:53 51.00000 Duration
# 19 104630 LOGIN 2016-08-22 15:22:23 30.00000 Interval
# 20 104630 LOGOUT 2016-08-22 15:23:08 45.00000 Duration
使用 lubridate::seconds_to_period
将以“%d %H %M %S”格式为您提供时差。