如何根据多个条件延迟日期并在发生后重置延迟
How to lag dates based on multiple conditions and reseting the lag after it occurs
我有一个车站维修的数据框。
工作流程是这样的:机械师去一个站点,他们按下一个按钮,记录一个名为 release
的 action
。在他们修好电台后,他们再次按下按钮,现在的动作是 return
.
您可以在下面看到 row 1
和 row 2
是一个已完成的任务,需要 Jane Jetson
10 秒完成。
dt name foo_id foo_role bikeId station_name station_id action
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Release
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Return
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1234.89 Return
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567.12 Release
我想要发生的事情:
- 我想知道每个
mechanic
使用action
Release
然后跟随Return
修复站需要多长时间。
- 如果
Release
没有 Return
,我想用 Sys.time()
从 dt
中减去它。你会看到 row 5
和 row 6
我这样做了:(我不是 100% 确定我需要之前的操作,但我包括以防万一。)
library(dplyr)
library(tidyr)
foo = arrange(foo, foo_id, name, foo_role, bikeId, station_id) %>%
group_by(foo_id,name, foo_role, bikeId, station_name,station_id) %>%
mutate(prev_dt = lag(dt, order_by = foo_id),
prev_action = lag(action, order_by=foo_id, default = 'NaN'))
foo$timediffsecs = as.numeric(difftime(foo$dt,foo$prev_dt,units='secs'))
> foo
# A tibble: 6 x 11
# Groups: foo_id, name, foo_role, bikeId, station_name, station_id [3]
dt name foo_id foo_role bikeId station_name station_id action prev_dt prev_action timediffsecs
<dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <dttm> <chr> <dbl>
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release NA NaN NA
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:05:47 Release 10
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release NA NaN NA
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:06:16 Release 3
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2019-12-12 13:06:19 Return 57
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release NA NaN NA
问题:
row 5
是一个新的循环,因为 action
Release
和 Return
之前发生过,但 timediffsecs
记录了 57 秒。 row 5
Prev_dt
和 prev_action
应该是 NA
和 timediffsecs
= Sys.time() - dt
.
row 6
应该有 timediffsecs
= Sys.time() - dt
我认为可行的方法:
我将 prev_action
NA 更改为 NaN 这样我就可以做一些 if else 语句,但我不太确定如何为此构造一个。我想将 prev_dt
中的 NA 更改为默认 dt
,但这样做有问题。我想尝试这个的原因是我可以使用条件语句,但如果不需要,则无需更改 NA。
tl;dr: 我希望 timediffsecs
记录正确的秒数。 row 5
和 row 6
有问题。 row 5
应该是 Sys.time() - dt
。 row 6
我要returnSys.time() - dt
数据:
structure(list(dt = structure(c(1576173947, 1576173957, 1576173976,
1576173979, 1576174036, 1576177636), class = c("POSIXct", "POSIXt"
), tzone = ""), name = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Jane Jetson",
"John Doe"), class = "factor"), foo_id = c(106337L, 106337L,
106338L, 106338L, 106338L, 106338L), foo_role = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Mechanic", class = "factor"),
bikeId = c(12345L, 12345L, 12345L, 12345L, 12345L, 56789L
), station_name = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("FooStation",
"Some Station"), class = "factor"), station_id = c(1234.89,
1234.89, 1234.89, 1234.89, 1234.89, 4567.12), action = c("Release",
"Return", "Release", "Return", "Release", "Release")), row.names = c(NA,
-6L), class = "data.frame")
使用 dplyr
的一种方法可能是为每个 name
(或 foo_id
)创建组,并且每次 "Release"
发生。在这个组中,如果存在 'Return'
,我们计算 'Return'
和 'Release'
的时间差,或者计算 'Release'
和当前时间之间的差值。
library(dplyr)
df %>%
group_by(name, group = cumsum(action == "Release")) %>%
mutate(timediffsecs = if (any(action == 'Return'))
dt[action == 'Return'] - dt[action == 'Release'] else Sys.time() - dt,
#If we want to replace Release values with NA
timediffsecs = replace(timediffsecs, n() > 1 & action == 'Release', NA))
# dt name foo_id foo_role bikeId station_name station_id action group timediffsecs
# <dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <int> <drtn>
#1 2019-12-13 02:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release 1 NA secs
#2 2019-12-13 02:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 1 10 secs
#3 2019-12-13 02:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2 NA secs
#4 2019-12-13 02:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2 3 secs
#5 2019-12-13 02:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 3 472603 secs
#6 2019-12-13 03:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release 4 469003 secs
虽然这适用于给定的示例,但可能需要根据数据进行一些调整。
我有一个车站维修的数据框。
工作流程是这样的:机械师去一个站点,他们按下一个按钮,记录一个名为 release
的 action
。在他们修好电台后,他们再次按下按钮,现在的动作是 return
.
您可以在下面看到 row 1
和 row 2
是一个已完成的任务,需要 Jane Jetson
10 秒完成。
dt name foo_id foo_role bikeId station_name station_id action
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Release
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1234.89 Return
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1234.89 Return
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1234.89 Release
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567.12 Release
我想要发生的事情:
- 我想知道每个
mechanic
使用action
Release
然后跟随Return
修复站需要多长时间。 - 如果
Release
没有Return
,我想用Sys.time()
从dt
中减去它。你会看到row 5
和row 6
我这样做了:(我不是 100% 确定我需要之前的操作,但我包括以防万一。)
library(dplyr)
library(tidyr)
foo = arrange(foo, foo_id, name, foo_role, bikeId, station_id) %>%
group_by(foo_id,name, foo_role, bikeId, station_name,station_id) %>%
mutate(prev_dt = lag(dt, order_by = foo_id),
prev_action = lag(action, order_by=foo_id, default = 'NaN'))
foo$timediffsecs = as.numeric(difftime(foo$dt,foo$prev_dt,units='secs'))
> foo
# A tibble: 6 x 11
# Groups: foo_id, name, foo_role, bikeId, station_name, station_id [3]
dt name foo_id foo_role bikeId station_name station_id action prev_dt prev_action timediffsecs
<dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <dttm> <chr> <dbl>
1 2019-12-12 13:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release NA NaN NA
2 2019-12-12 13:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:05:47 Release 10
3 2019-12-12 13:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release NA NaN NA
4 2019-12-12 13:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2019-12-12 13:06:16 Release 3
5 2019-12-12 13:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2019-12-12 13:06:19 Return 57
6 2019-12-12 14:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release NA NaN NA
问题:
row 5
是一个新的循环,因为action
Release
和Return
之前发生过,但timediffsecs
记录了 57 秒。row 5
Prev_dt
和prev_action
应该是NA
和timediffsecs
=Sys.time() - dt
.row 6
应该有timediffsecs
=Sys.time() - dt
我认为可行的方法:
我将 prev_action
NA 更改为 NaN 这样我就可以做一些 if else 语句,但我不太确定如何为此构造一个。我想将 prev_dt
中的 NA 更改为默认 dt
,但这样做有问题。我想尝试这个的原因是我可以使用条件语句,但如果不需要,则无需更改 NA。
tl;dr: 我希望 timediffsecs
记录正确的秒数。 row 5
和 row 6
有问题。 row 5
应该是 Sys.time() - dt
。 row 6
我要returnSys.time() - dt
数据:
structure(list(dt = structure(c(1576173947, 1576173957, 1576173976,
1576173979, 1576174036, 1576177636), class = c("POSIXct", "POSIXt"
), tzone = ""), name = structure(c(1L, 1L, 2L, 2L, 2L, 2L), .Label = c("Jane Jetson",
"John Doe"), class = "factor"), foo_id = c(106337L, 106337L,
106338L, 106338L, 106338L, 106338L), foo_role = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "Mechanic", class = "factor"),
bikeId = c(12345L, 12345L, 12345L, 12345L, 12345L, 56789L
), station_name = structure(c(1L, 1L, 1L, 1L, 1L, 2L), .Label = c("FooStation",
"Some Station"), class = "factor"), station_id = c(1234.89,
1234.89, 1234.89, 1234.89, 1234.89, 4567.12), action = c("Release",
"Return", "Release", "Return", "Release", "Release")), row.names = c(NA,
-6L), class = "data.frame")
使用 dplyr
的一种方法可能是为每个 name
(或 foo_id
)创建组,并且每次 "Release"
发生。在这个组中,如果存在 'Return'
,我们计算 'Return'
和 'Release'
的时间差,或者计算 'Release'
和当前时间之间的差值。
library(dplyr)
df %>%
group_by(name, group = cumsum(action == "Release")) %>%
mutate(timediffsecs = if (any(action == 'Return'))
dt[action == 'Return'] - dt[action == 'Release'] else Sys.time() - dt,
#If we want to replace Release values with NA
timediffsecs = replace(timediffsecs, n() > 1 & action == 'Release', NA))
# dt name foo_id foo_role bikeId station_name station_id action group timediffsecs
# <dttm> <fct> <int> <fct> <int> <fct> <dbl> <chr> <int> <drtn>
#1 2019-12-13 02:05:47 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Release 1 NA secs
#2 2019-12-13 02:05:57 Jane Jetson 106337 Mechanic 12345 FooStation 1235. Return 1 10 secs
#3 2019-12-13 02:06:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 2 NA secs
#4 2019-12-13 02:06:19 John Doe 106338 Mechanic 12345 FooStation 1235. Return 2 3 secs
#5 2019-12-13 02:07:16 John Doe 106338 Mechanic 12345 FooStation 1235. Release 3 472603 secs
#6 2019-12-13 03:07:16 John Doe 106338 Mechanic 56789 Some Station 4567. Release 4 469003 secs
虽然这适用于给定的示例,但可能需要根据数据进行一些调整。