检查日期时间中的前一行,如果时间大于某个值,则放在一个组中并以秒为单位获取其持续时间(R,dplyr,lubridate)
Check previous row in datetime, if time is greater than a certain value, place in a group and take its duration in seconds (R, dplyr, lubridate)
我有一个数据集,df:(该数据集包含超过 4000 行)
DATEB
9/9/2019 7:51:58 PM
9/9/2019 7:51:59 PM
9/9/2019 7:51:59 PM
9/9/2019 7:52:00 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:02 PM
9/9/2019 7:52:03 PM
9/9/2019 7:54:00 PM
9/9/2019 7:54:02 PM
9/10/2019 8:00:00PM
我想分组(如果时间不在上一行的10秒以内),然后取新组的持续时间。
期望的输出:
Group Duration
a 5 sec
b 2 sec
c 0 sec
dput:
structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 8L, 9L, 1L), .Label = c(" 9/10/2019 8:00:00 PM", " 9/9/2019 7:51:58 PM",
" 9/9/2019 7:51:59 PM", " 9/9/2019 7:52:00 PM", " 9/9/2019 7:52:01 PM",
" 9/9/2019 7:52:02 PM", " 9/9/2019 7:52:03 PM", " 9/9/2019 7:54:00 PM",
" 9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
我已经尝试了下面的代码,效果很好,除了,我只想要以秒为单位的单位。下面的代码给出了分钟和秒的单位。
library(dplyr)
library(lubridate)
df2 <- mutate(df,
DATEB = lubridate::mdy_hms(DATEB))
df2$time_since_last_row <- df2$DATEB - lag(df2$DATEB)
df2$time_since_last_row[[1]] <- 0 # replace the first NA
df2$group_10s <- 0
for ( i in 2:nrow(df2))
{
if(df2$time_since_last_row[[i]]>seconds(10))
df2$group_10s[[i]] <- df2$group_10s[[i-1]] +1
else
df2$group_10s[[i]] <- df2$group_10s[[i-1]]
}
df3 <- group_by(df2,
group_10s) %>%
summarise(volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = max_DATEB - min_DATEB)
#nirgrahamuk-R community
欢迎任何建议。
其实我之前也做过类似的事情。您可以修改最后一个块:
df3 <- group_by(df2, group_10s) %>%
summarise(
volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = as.numeric(max_DATEB - min_DATEB, units = "secs")
)
这就是我要做的:
gap_threshold <- 10
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
gap = c(0, diff(DATEB))) %>%
group_by(grp = cumsum(gap > gap_threshold)) %>%
summarise(begin = min(DATEB), end = max(DATEB),
duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4
grp begin end duration
<int> <dttm> <dttm> <drtn>
1 0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs
2 1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs
3 2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs
请注意,输出中的列数比要求的多,只是为了演示。
只要两个后续行之间的间隙大于给定的 gap_threshold
,组计数 grp
就会提前一个。最后,对每个组取 min()
和 max()
,并根据这些计算持续时间。
我有一个数据集,df:(该数据集包含超过 4000 行)
DATEB
9/9/2019 7:51:58 PM
9/9/2019 7:51:59 PM
9/9/2019 7:51:59 PM
9/9/2019 7:52:00 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:01 PM
9/9/2019 7:52:02 PM
9/9/2019 7:52:03 PM
9/9/2019 7:54:00 PM
9/9/2019 7:54:02 PM
9/10/2019 8:00:00PM
我想分组(如果时间不在上一行的10秒以内),然后取新组的持续时间。
期望的输出:
Group Duration
a 5 sec
b 2 sec
c 0 sec
dput:
structure(list(DATEB = structure(c(2L, 3L, 3L, 4L, 5L, 5L, 6L,
7L, 8L, 9L, 1L), .Label = c(" 9/10/2019 8:00:00 PM", " 9/9/2019 7:51:58 PM",
" 9/9/2019 7:51:59 PM", " 9/9/2019 7:52:00 PM", " 9/9/2019 7:52:01 PM",
" 9/9/2019 7:52:02 PM", " 9/9/2019 7:52:03 PM", " 9/9/2019 7:54:00 PM",
" 9/9/2019 7:54:02 PM"), class = "factor")), class = "data.frame", row.names = c(NA,
-11L))
我已经尝试了下面的代码,效果很好,除了,我只想要以秒为单位的单位。下面的代码给出了分钟和秒的单位。
library(dplyr)
library(lubridate)
df2 <- mutate(df,
DATEB = lubridate::mdy_hms(DATEB))
df2$time_since_last_row <- df2$DATEB - lag(df2$DATEB)
df2$time_since_last_row[[1]] <- 0 # replace the first NA
df2$group_10s <- 0
for ( i in 2:nrow(df2))
{
if(df2$time_since_last_row[[i]]>seconds(10))
df2$group_10s[[i]] <- df2$group_10s[[i-1]] +1
else
df2$group_10s[[i]] <- df2$group_10s[[i-1]]
}
df3 <- group_by(df2,
group_10s) %>%
summarise(volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = max_DATEB - min_DATEB)
#nirgrahamuk-R community
欢迎任何建议。
其实我之前也做过类似的事情。您可以修改最后一个块:
df3 <- group_by(df2, group_10s) %>%
summarise(
volume_in_group=n(),
min_DATEB=min(DATEB),
max_DATEB=max(DATEB),
group_duration = as.numeric(max_DATEB - min_DATEB, units = "secs")
)
这就是我要做的:
gap_threshold <- 10
df %>%
mutate(DATEB = lubridate::mdy_hms(DATEB),
gap = c(0, diff(DATEB))) %>%
group_by(grp = cumsum(gap > gap_threshold)) %>%
summarise(begin = min(DATEB), end = max(DATEB),
duration = difftime(end, begin, units = "secs"))
# A tibble: 3 x 4 grp begin end duration <int> <dttm> <dttm> <drtn> 1 0 2019-09-09 19:51:58 2019-09-09 19:52:03 5 secs 2 1 2019-09-09 19:54:00 2019-09-09 19:54:02 2 secs 3 2 2019-09-10 20:00:00 2019-09-10 20:00:00 0 secs
请注意,输出中的列数比要求的多,只是为了演示。
只要两个后续行之间的间隙大于给定的 gap_threshold
,组计数 grp
就会提前一个。最后,对每个组取 min()
和 max()
,并根据这些计算持续时间。