使用日期时间向量过滤数据框
Using Datetime Vectors to Filter a Dataframe
我正在尝试使用日期时间对象向量 (POSIXct) 过滤数据框。第一个向量包含开始时间,第二个向量包含结束时间。我只想 return 属于日期时间对之一的值,即在一对开始时间和结束时间之间。
下面是一些示例数据:
dat <- structure(list(timestamp = structure(c(1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805),
class = c("POSIXct", "POSIXt"),
tzone = "UTC"),
process_time = c(0L, 1L, 2L, 3L, 4L,5L, 0L, 1L, 2L,
3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L,
0L, 1L, 2L,3L, 4L, 5L, 0L, 1L, 2L,
3L, 4L,5L, 0L, 1L, 2L, 3L, 4L,5L),
process_name = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "C", "C", "C", "C",
"C", "C", "A", "A", "A", "A", "A",
"A")),
class = "data.frame", row.names = c(NA, -36L))
这是我目前得到的:
library(lubridate)
library(dplyr)
start_time <- ymd_hms(c("2020-11-03 07:00:01", "2020-11-04 07:00:01",
"2020-11-05 14:33:21"))
end_time <- ymd_hms(c("2020-11-03 07:00:04", "2020-11-04 07:00:04",
"2020-11-05 14:33:24"))
filtered_dat <- dat %>%
group_by(date(timestamp), process_name) %>%
filter(timestamp >= start_time & timestamp <= end_time)
我正在寻找的是为每个日期和流程类型过滤掉的第一个和最后一个值。它看起来像:
timestamp | process_time | process_name
------------------------------------------------------------
2020-11-03 07:00:01 | 1 | A
2020-11-03 07:00:02 | 2 | A
2020-11-03 07:00:03 | 3 | A
2020-11-03 07:00:04 | 4 | A
2020-11-04 07:00:01 | 1 | A
2020-11-04 07:00:02 | 2 | A
2020-11-04 07:00:03 | 3 | A
2020-11-04 07:00:04 | 4 | A
2020-11-03 07:00:01 | 1 | B
2020-11-03 07:00:02 | 2 | B
2020-11-03 07:00:03 | 3 | B
2020-11-03 07:00:04 | 4 | B
... ... ...
#spacing added for clarity
似乎正在发生的事情是循环开始和结束时间向量。因此它将数据帧的第一行与开始和结束时间向量中的第一项进行比较,然后将第二行与开始和结束时间向量中的第二项进行比较,依此类推。每隔几行它就会匹配一次。
我需要做的是,在移动到数据帧的下一行之前,将数据帧的每一行与 所有 三个日期时间对进行比较。 purrr
似乎就是这里的答案。
阅读 lubridate
我们可以将 intervals
用于此目的。
如果有多个间隔,那么文档建议我们把它变成一个列表:
intervals <- list(interval(start_time, end_time))
假设我们要检查某个日期是否在这些时间间隔内,我们可以使用 %within%
函数:
ts <- ymd_hms(2020-11-03 07:00:02)
ts %within% intervals
结果:
> ts %within% intervals
[1] TRUE FALSE FALSE
或者如果我们想同时检查所有区间:
> any(ts %within% intervals)
[1] TRUE
将此应用于您的数据框:
dat <- dat %>%
rowwise() %>%
mutate(within = any(timestamp %within% intervals))
之后你可以使用简单的过滤。
您可以使用 purrr
中的 map2_lgl
检查每个 date
中的 any
timestamp
是否为 between
start_time
和 end_time
.
library(dplyr)
library(lubridate)
library(purrr)
dat %>%
group_by(date = date(timestamp), process_name) %>%
filter(any(map2_lgl(start_time, end_time, ~any(between(timestamp, .x, .y)))))
我正在尝试使用日期时间对象向量 (POSIXct) 过滤数据框。第一个向量包含开始时间,第二个向量包含结束时间。我只想 return 属于日期时间对之一的值,即在一对开始时间和结束时间之间。
下面是一些示例数据:
dat <- structure(list(timestamp = structure(c(1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604386800, 1604386801,
1604386802, 1604386803,
1604386804, 1604386805,
1604473200, 1604473201,
1604473202, 1604473203,
1604473204, 1604473205,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805,
1604586800, 1604586801,
1604586802, 1604586803,
1604586804, 1604586805),
class = c("POSIXct", "POSIXt"),
tzone = "UTC"),
process_time = c(0L, 1L, 2L, 3L, 4L,5L, 0L, 1L, 2L,
3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L,
0L, 1L, 2L,3L, 4L, 5L, 0L, 1L, 2L,
3L, 4L,5L, 0L, 1L, 2L, 3L, 4L,5L),
process_name = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "C", "C", "C", "C",
"C", "C", "A", "A", "A", "A", "A",
"A")),
class = "data.frame", row.names = c(NA, -36L))
这是我目前得到的:
library(lubridate)
library(dplyr)
start_time <- ymd_hms(c("2020-11-03 07:00:01", "2020-11-04 07:00:01",
"2020-11-05 14:33:21"))
end_time <- ymd_hms(c("2020-11-03 07:00:04", "2020-11-04 07:00:04",
"2020-11-05 14:33:24"))
filtered_dat <- dat %>%
group_by(date(timestamp), process_name) %>%
filter(timestamp >= start_time & timestamp <= end_time)
我正在寻找的是为每个日期和流程类型过滤掉的第一个和最后一个值。它看起来像:
timestamp | process_time | process_name
------------------------------------------------------------
2020-11-03 07:00:01 | 1 | A
2020-11-03 07:00:02 | 2 | A
2020-11-03 07:00:03 | 3 | A
2020-11-03 07:00:04 | 4 | A
2020-11-04 07:00:01 | 1 | A
2020-11-04 07:00:02 | 2 | A
2020-11-04 07:00:03 | 3 | A
2020-11-04 07:00:04 | 4 | A
2020-11-03 07:00:01 | 1 | B
2020-11-03 07:00:02 | 2 | B
2020-11-03 07:00:03 | 3 | B
2020-11-03 07:00:04 | 4 | B
... ... ...
#spacing added for clarity
似乎正在发生的事情是循环开始和结束时间向量。因此它将数据帧的第一行与开始和结束时间向量中的第一项进行比较,然后将第二行与开始和结束时间向量中的第二项进行比较,依此类推。每隔几行它就会匹配一次。
我需要做的是,在移动到数据帧的下一行之前,将数据帧的每一行与 所有 三个日期时间对进行比较。 purrr
似乎就是这里的答案。
阅读 lubridate
我们可以将 intervals
用于此目的。
如果有多个间隔,那么文档建议我们把它变成一个列表:
intervals <- list(interval(start_time, end_time))
假设我们要检查某个日期是否在这些时间间隔内,我们可以使用 %within%
函数:
ts <- ymd_hms(2020-11-03 07:00:02)
ts %within% intervals
结果:
> ts %within% intervals
[1] TRUE FALSE FALSE
或者如果我们想同时检查所有区间:
> any(ts %within% intervals)
[1] TRUE
将此应用于您的数据框:
dat <- dat %>%
rowwise() %>%
mutate(within = any(timestamp %within% intervals))
之后你可以使用简单的过滤。
您可以使用 purrr
中的 map2_lgl
检查每个 date
中的 any
timestamp
是否为 between
start_time
和 end_time
.
library(dplyr)
library(lubridate)
library(purrr)
dat %>%
group_by(date = date(timestamp), process_name) %>%
filter(any(map2_lgl(start_time, end_time, ~any(between(timestamp, .x, .y)))))