两个事件时间戳之间的填充日期格式为 POSIXct
Fill Dates Between Two Event Timestamps Formatted as POSIXct
我有一个包含事件开始和结束时间戳的数据框。我想创建一系列时间戳来填充这些事件之间的日期。这是我的数据结构:
dat <- structure(list(event_id = 1:2,
start_time = structure(c(1617346800,1617348000),
class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(1617347400, 1617348300),
class = c("POSIXct", "POSIXt"), tzone = "UTC")),
class = "data.frame", row.names = c(NA, -2L))
我希望做的是延长数据框,使行数与两个事件之间的分钟数一样多。所以你最终会得到这样的结果:
final <- structure(list(event_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 2L, 2L, 2L, 2L,
2L), start_time = structure(c(1617346800, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1617348000,
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1617347400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1617348300), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
record_time = c("2/4/2021 7:00", "2/4/2021 7:01", "2/4/2021 7:02",
"2/4/2021 7:03", "2/4/2021 7:04", "2/4/2021 7:05", "2/4/2021 7:06",
"2/4/2021 7:07", "2/4/2021 7:08", "2/4/2021 7:09", "2/4/2021 7:10",
"2/4/2021 7:11", "2/4/2021 7:12", "2/4/2021 7:13", "2/4/2021 7:14",
"2/4/2021 7:15", "2/4/2021 7:16", "2/4/2021 7:17", "2/4/2021 7:18",
"2/4/2021 7:19", "2/4/2021 7:20", "2/4/2021 7:21", "2/4/2021 7:22",
"2/4/2021 7:23", "2/4/2021 7:24", "2/4/2021 7:25")), class = "data.frame", row.names = c(NA,
-26L))
到目前为止,我想出的是创建一个完整序列的数据框:
timeline <- as.POSIXct(
seq.POSIXt(from = min(dat$start_time),
to = max(dat$end_time), by = "min"))
从这里开始,我对如何加入两者感到困惑。使用 for
循环我可以制作一个序列列表,但在那个时候,我仍然不确定如何将这两个东西放在一起(另外,我确信 [=15 可以做得更好=] 但我还不擅长 purrr
)。
event_timelines <- list()
for (row in 1:nrow(dat)) {
event_timelines[[row]] <- seq.POSIXt(from = dat[row,]$start_time,
to = dat[row,]$end_time,
by = "min")
}
谢谢!
尝试这种 dplyr
方法(使用 fuzzyjoin
)。
library(dplyr)
timerange <- range(unlist(dat[,c("start_time", "end_time")]))
attributes(timerange) <- attributes(dat$start_time)
tibble(record_time = seq(timerange[1], timerange[2], by = "min")) %>%
fuzzyjoin::fuzzy_full_join(
dat,
by = c("record_time" = "start_time", "record_time" = "end_time"),
match_fun = list(`>=`, `<=`)
) %>%
group_by(event_id) %>%
mutate(
start_time = start_time[c(1, rep(NA, n()-1))],
end_time = end_time[c(rep(NA, n()-1), 1)]
) %>%
arrange(record_time) %>%
ungroup() %>%
print(n=99)
# # A tibble: 26 x 4
# record_time event_id start_time end_time
# <dttm> <int> <dttm> <dttm>
# 1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 NA
# 2 2021-04-02 07:01:00 1 NA NA
# 3 2021-04-02 07:02:00 1 NA NA
# 4 2021-04-02 07:03:00 1 NA NA
# 5 2021-04-02 07:04:00 1 NA NA
# 6 2021-04-02 07:05:00 1 NA NA
# 7 2021-04-02 07:06:00 1 NA NA
# 8 2021-04-02 07:07:00 1 NA NA
# 9 2021-04-02 07:08:00 1 NA NA
# 10 2021-04-02 07:09:00 1 NA NA
# 11 2021-04-02 07:10:00 1 NA 2021-04-02 07:10:00
# 12 2021-04-02 07:11:00 NA NA NA
# 13 2021-04-02 07:12:00 NA NA NA
# 14 2021-04-02 07:13:00 NA NA NA
# 15 2021-04-02 07:14:00 NA NA NA
# 16 2021-04-02 07:15:00 NA NA NA
# 17 2021-04-02 07:16:00 NA NA NA
# 18 2021-04-02 07:17:00 NA NA NA
# 19 2021-04-02 07:18:00 NA NA NA
# 20 2021-04-02 07:19:00 NA NA NA
# 21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 NA
# 22 2021-04-02 07:21:00 2 NA NA
# 23 2021-04-02 07:22:00 2 NA NA
# 24 2021-04-02 07:23:00 2 NA NA
# 25 2021-04-02 07:24:00 2 NA NA
# 26 2021-04-02 07:25:00 2 NA 2021-04-02 07:25:00
我选择使用 fuzzyjoin
的原因是因为 POSIXt
是浮点 numeric
-like 具有某些属性,有时(由于一般计算挑战,请参阅 Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754),“浮点相等”并不总是得到保证。在这种情况下,fuzzyjoin
正在进行基于范围的连接,实际上是 record_time
在 start_time
和 end_time
之间。
一种使用dplyr
和tidyr
的方法:
使用 complete
、fill
和 NA
以及先前的非 NA 值扩展序列,并使用 replace
调整输出使其类似于预期输出。
library(dplyr)
library(tidyr)
dat %>%
mutate(record_time = start_time) %>%
complete(record_time = seq(min(start_time), max(end_time), by = '1 min')) %>%
fill(everything()) %>%
mutate(event_id = replace(event_id, record_time > end_time, NA),
across(c(start_time, end_time), ~replace(., . != record_time, NA)))
# record_time event_id start_time end_time
#1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 <NA>
#2 2021-04-02 07:01:00 1 <NA> <NA>
#3 2021-04-02 07:02:00 1 <NA> <NA>
#4 2021-04-02 07:03:00 1 <NA> <NA>
#5 2021-04-02 07:04:00 1 <NA> <NA>
#6 2021-04-02 07:05:00 1 <NA> <NA>
#7 2021-04-02 07:06:00 1 <NA> <NA>
#8 2021-04-02 07:07:00 1 <NA> <NA>
#9 2021-04-02 07:08:00 1 <NA> <NA>
#10 2021-04-02 07:09:00 1 <NA> <NA>
#11 2021-04-02 07:10:00 1 <NA> 2021-04-02 07:10:00
#12 2021-04-02 07:11:00 NA <NA> <NA>
#13 2021-04-02 07:12:00 NA <NA> <NA>
#14 2021-04-02 07:13:00 NA <NA> <NA>
#15 2021-04-02 07:14:00 NA <NA> <NA>
#16 2021-04-02 07:15:00 NA <NA> <NA>
#17 2021-04-02 07:16:00 NA <NA> <NA>
#18 2021-04-02 07:17:00 NA <NA> <NA>
#19 2021-04-02 07:18:00 NA <NA> <NA>
#20 2021-04-02 07:19:00 NA <NA> <NA>
#21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 <NA>
#22 2021-04-02 07:21:00 2 <NA> <NA>
#23 2021-04-02 07:22:00 2 <NA> <NA>
#24 2021-04-02 07:23:00 2 <NA> <NA>
#25 2021-04-02 07:24:00 2 <NA> <NA>
#26 2021-04-02 07:25:00 2 <NA> 2021-04-02 07:25:00
你可以坚持你的好 seq.POSIX
方法。使用 by
遵循拆分-应用-组合方法。对于每个事件 ID 计算序列。将结果乘以三倍,并使用 match
,在旧列中将附加值设置为 NA
。
res <- do.call(rbind, by(dat, dat$event_id, function(x) {
s1 <- s2 <- s3 <- do.call(seq, c(as.list(unname(x[-1])), "min"))
s1[-match(x[2], s1)] <- NA
s2[-match(x[3], s2)] <- NA
data.frame(event_id=as.integer(x[1]), start_time=s1, end_time=s2, record_time=s3)
}))
res
# event_id start_time end_time record_time
# 1.1 1 2021-04-02 07:00:00 <NA> 2021-04-02 07:00:00
# 1.2 1 <NA> <NA> 2021-04-02 07:01:00
# 1.3 1 <NA> <NA> 2021-04-02 07:02:00
# 1.4 1 <NA> <NA> 2021-04-02 07:03:00
# 1.5 1 <NA> <NA> 2021-04-02 07:04:00
# 1.6 1 <NA> <NA> 2021-04-02 07:05:00
# 1.7 1 <NA> <NA> 2021-04-02 07:06:00
# 1.8 1 <NA> <NA> 2021-04-02 07:07:00
# 1.9 1 <NA> <NA> 2021-04-02 07:08:00
# 1.10 1 <NA> <NA> 2021-04-02 07:09:00
# 1.11 1 <NA> 2021-04-02 07:10:00 2021-04-02 07:10:00
# 2.1 2 2021-04-02 07:20:00 <NA> 2021-04-02 07:20:00
# 2.2 2 <NA> <NA> 2021-04-02 07:21:00
# 2.3 2 <NA> <NA> 2021-04-02 07:22:00
# 2.4 2 <NA> <NA> 2021-04-02 07:23:00
# 2.5 2 <NA> <NA> 2021-04-02 07:24:00
# 2.6 2 <NA> 2021-04-02 07:25:00 2021-04-02 07:25:00
我有一个包含事件开始和结束时间戳的数据框。我想创建一系列时间戳来填充这些事件之间的日期。这是我的数据结构:
dat <- structure(list(event_id = 1:2,
start_time = structure(c(1617346800,1617348000),
class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(1617347400, 1617348300),
class = c("POSIXct", "POSIXt"), tzone = "UTC")),
class = "data.frame", row.names = c(NA, -2L))
我希望做的是延长数据框,使行数与两个事件之间的分钟数一样多。所以你最终会得到这样的结果:
final <- structure(list(event_id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2L, 2L, 2L, 2L, 2L,
2L), start_time = structure(c(1617346800, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1617348000,
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end_time = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 1617347400, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, 1617348300), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
record_time = c("2/4/2021 7:00", "2/4/2021 7:01", "2/4/2021 7:02",
"2/4/2021 7:03", "2/4/2021 7:04", "2/4/2021 7:05", "2/4/2021 7:06",
"2/4/2021 7:07", "2/4/2021 7:08", "2/4/2021 7:09", "2/4/2021 7:10",
"2/4/2021 7:11", "2/4/2021 7:12", "2/4/2021 7:13", "2/4/2021 7:14",
"2/4/2021 7:15", "2/4/2021 7:16", "2/4/2021 7:17", "2/4/2021 7:18",
"2/4/2021 7:19", "2/4/2021 7:20", "2/4/2021 7:21", "2/4/2021 7:22",
"2/4/2021 7:23", "2/4/2021 7:24", "2/4/2021 7:25")), class = "data.frame", row.names = c(NA,
-26L))
到目前为止,我想出的是创建一个完整序列的数据框:
timeline <- as.POSIXct(
seq.POSIXt(from = min(dat$start_time),
to = max(dat$end_time), by = "min"))
从这里开始,我对如何加入两者感到困惑。使用 for
循环我可以制作一个序列列表,但在那个时候,我仍然不确定如何将这两个东西放在一起(另外,我确信 [=15 可以做得更好=] 但我还不擅长 purrr
)。
event_timelines <- list()
for (row in 1:nrow(dat)) {
event_timelines[[row]] <- seq.POSIXt(from = dat[row,]$start_time,
to = dat[row,]$end_time,
by = "min")
}
谢谢!
尝试这种 dplyr
方法(使用 fuzzyjoin
)。
library(dplyr)
timerange <- range(unlist(dat[,c("start_time", "end_time")]))
attributes(timerange) <- attributes(dat$start_time)
tibble(record_time = seq(timerange[1], timerange[2], by = "min")) %>%
fuzzyjoin::fuzzy_full_join(
dat,
by = c("record_time" = "start_time", "record_time" = "end_time"),
match_fun = list(`>=`, `<=`)
) %>%
group_by(event_id) %>%
mutate(
start_time = start_time[c(1, rep(NA, n()-1))],
end_time = end_time[c(rep(NA, n()-1), 1)]
) %>%
arrange(record_time) %>%
ungroup() %>%
print(n=99)
# # A tibble: 26 x 4
# record_time event_id start_time end_time
# <dttm> <int> <dttm> <dttm>
# 1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 NA
# 2 2021-04-02 07:01:00 1 NA NA
# 3 2021-04-02 07:02:00 1 NA NA
# 4 2021-04-02 07:03:00 1 NA NA
# 5 2021-04-02 07:04:00 1 NA NA
# 6 2021-04-02 07:05:00 1 NA NA
# 7 2021-04-02 07:06:00 1 NA NA
# 8 2021-04-02 07:07:00 1 NA NA
# 9 2021-04-02 07:08:00 1 NA NA
# 10 2021-04-02 07:09:00 1 NA NA
# 11 2021-04-02 07:10:00 1 NA 2021-04-02 07:10:00
# 12 2021-04-02 07:11:00 NA NA NA
# 13 2021-04-02 07:12:00 NA NA NA
# 14 2021-04-02 07:13:00 NA NA NA
# 15 2021-04-02 07:14:00 NA NA NA
# 16 2021-04-02 07:15:00 NA NA NA
# 17 2021-04-02 07:16:00 NA NA NA
# 18 2021-04-02 07:17:00 NA NA NA
# 19 2021-04-02 07:18:00 NA NA NA
# 20 2021-04-02 07:19:00 NA NA NA
# 21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 NA
# 22 2021-04-02 07:21:00 2 NA NA
# 23 2021-04-02 07:22:00 2 NA NA
# 24 2021-04-02 07:23:00 2 NA NA
# 25 2021-04-02 07:24:00 2 NA NA
# 26 2021-04-02 07:25:00 2 NA 2021-04-02 07:25:00
我选择使用 fuzzyjoin
的原因是因为 POSIXt
是浮点 numeric
-like 具有某些属性,有时(由于一般计算挑战,请参阅 Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754),“浮点相等”并不总是得到保证。在这种情况下,fuzzyjoin
正在进行基于范围的连接,实际上是 record_time
在 start_time
和 end_time
之间。
一种使用dplyr
和tidyr
的方法:
使用 complete
、fill
和 NA
以及先前的非 NA 值扩展序列,并使用 replace
调整输出使其类似于预期输出。
library(dplyr)
library(tidyr)
dat %>%
mutate(record_time = start_time) %>%
complete(record_time = seq(min(start_time), max(end_time), by = '1 min')) %>%
fill(everything()) %>%
mutate(event_id = replace(event_id, record_time > end_time, NA),
across(c(start_time, end_time), ~replace(., . != record_time, NA)))
# record_time event_id start_time end_time
#1 2021-04-02 07:00:00 1 2021-04-02 07:00:00 <NA>
#2 2021-04-02 07:01:00 1 <NA> <NA>
#3 2021-04-02 07:02:00 1 <NA> <NA>
#4 2021-04-02 07:03:00 1 <NA> <NA>
#5 2021-04-02 07:04:00 1 <NA> <NA>
#6 2021-04-02 07:05:00 1 <NA> <NA>
#7 2021-04-02 07:06:00 1 <NA> <NA>
#8 2021-04-02 07:07:00 1 <NA> <NA>
#9 2021-04-02 07:08:00 1 <NA> <NA>
#10 2021-04-02 07:09:00 1 <NA> <NA>
#11 2021-04-02 07:10:00 1 <NA> 2021-04-02 07:10:00
#12 2021-04-02 07:11:00 NA <NA> <NA>
#13 2021-04-02 07:12:00 NA <NA> <NA>
#14 2021-04-02 07:13:00 NA <NA> <NA>
#15 2021-04-02 07:14:00 NA <NA> <NA>
#16 2021-04-02 07:15:00 NA <NA> <NA>
#17 2021-04-02 07:16:00 NA <NA> <NA>
#18 2021-04-02 07:17:00 NA <NA> <NA>
#19 2021-04-02 07:18:00 NA <NA> <NA>
#20 2021-04-02 07:19:00 NA <NA> <NA>
#21 2021-04-02 07:20:00 2 2021-04-02 07:20:00 <NA>
#22 2021-04-02 07:21:00 2 <NA> <NA>
#23 2021-04-02 07:22:00 2 <NA> <NA>
#24 2021-04-02 07:23:00 2 <NA> <NA>
#25 2021-04-02 07:24:00 2 <NA> <NA>
#26 2021-04-02 07:25:00 2 <NA> 2021-04-02 07:25:00
你可以坚持你的好 seq.POSIX
方法。使用 by
遵循拆分-应用-组合方法。对于每个事件 ID 计算序列。将结果乘以三倍,并使用 match
,在旧列中将附加值设置为 NA
。
res <- do.call(rbind, by(dat, dat$event_id, function(x) {
s1 <- s2 <- s3 <- do.call(seq, c(as.list(unname(x[-1])), "min"))
s1[-match(x[2], s1)] <- NA
s2[-match(x[3], s2)] <- NA
data.frame(event_id=as.integer(x[1]), start_time=s1, end_time=s2, record_time=s3)
}))
res
# event_id start_time end_time record_time
# 1.1 1 2021-04-02 07:00:00 <NA> 2021-04-02 07:00:00
# 1.2 1 <NA> <NA> 2021-04-02 07:01:00
# 1.3 1 <NA> <NA> 2021-04-02 07:02:00
# 1.4 1 <NA> <NA> 2021-04-02 07:03:00
# 1.5 1 <NA> <NA> 2021-04-02 07:04:00
# 1.6 1 <NA> <NA> 2021-04-02 07:05:00
# 1.7 1 <NA> <NA> 2021-04-02 07:06:00
# 1.8 1 <NA> <NA> 2021-04-02 07:07:00
# 1.9 1 <NA> <NA> 2021-04-02 07:08:00
# 1.10 1 <NA> <NA> 2021-04-02 07:09:00
# 1.11 1 <NA> 2021-04-02 07:10:00 2021-04-02 07:10:00
# 2.1 2 2021-04-02 07:20:00 <NA> 2021-04-02 07:20:00
# 2.2 2 <NA> <NA> 2021-04-02 07:21:00
# 2.3 2 <NA> <NA> 2021-04-02 07:22:00
# 2.4 2 <NA> <NA> 2021-04-02 07:23:00
# 2.5 2 <NA> <NA> 2021-04-02 07:24:00
# 2.6 2 <NA> 2021-04-02 07:25:00 2021-04-02 07:25:00