使用日期时间向量过滤数据框

Using Datetime Vectors to Filter a Dataframe

我正在尝试使用日期时间对象向量 (POSIXct) 过滤数据框。第一个向量包含开始时间,第二个向量包含结束时间。我只想 return 属于日期时间对之一的值,即在一对开始时间和结束时间之间。

下面是一些示例数据:

dat <- structure(list(timestamp = structure(c(1604386800, 1604386801, 
                                              1604386802, 1604386803,
                                              1604386804, 1604386805,
                                              1604473200, 1604473201,
                                              1604473202, 1604473203,
                                              1604473204, 1604473205,
                                              1604386800, 1604386801,
                                              1604386802, 1604386803,
                                              1604386804, 1604386805,
                                              1604473200, 1604473201,
                                              1604473202, 1604473203,
                                              1604473204, 1604473205,
                                              1604586800, 1604586801, 
                                              1604586802, 1604586803,
                                              1604586804, 1604586805,
                                              1604586800, 1604586801, 
                                              1604586802, 1604586803,
                                              1604586804, 1604586805),
                                            class = c("POSIXct", "POSIXt"),
                                            tzone = "UTC"),
                      process_time = c(0L, 1L, 2L, 3L, 4L,5L, 0L, 1L, 2L,
                                       3L, 4L, 5L, 0L, 1L, 2L, 3L, 4L, 5L,
                                       0L, 1L, 2L,3L, 4L, 5L, 0L, 1L, 2L,
                                       3L, 4L,5L, 0L, 1L, 2L, 3L, 4L,5L),
                      process_name = c("A", "A", "A", "A", "A", "A", "A",
                                       "A", "A", "A", "A", "A", "B", "B", 
                                       "B", "B", "B", "B", "B", "B", "B",
                                       "B", "B", "B", "C", "C", "C", "C", 
                                       "C", "C", "A", "A", "A", "A", "A",
                                       "A")),
                 class = "data.frame", row.names = c(NA, -36L))

这是我目前得到的:

library(lubridate)
library(dplyr)

start_time <- ymd_hms(c("2020-11-03 07:00:01", "2020-11-04 07:00:01",
                        "2020-11-05 14:33:21"))
end_time <- ymd_hms(c("2020-11-03 07:00:04", "2020-11-04 07:00:04",
                      "2020-11-05 14:33:24"))

filtered_dat <- dat %>%
  group_by(date(timestamp), process_name) %>%
  filter(timestamp >= start_time & timestamp <= end_time)

我正在寻找的是为每个日期和流程类型过滤掉的第一个和最后一个值。它看起来像:

timestamp             |   process_time    |   process_name
------------------------------------------------------------
2020-11-03 07:00:01   |         1         |        A
2020-11-03 07:00:02   |         2         |        A
2020-11-03 07:00:03   |         3         |        A
2020-11-03 07:00:04   |         4         |        A

2020-11-04 07:00:01   |         1         |        A
2020-11-04 07:00:02   |         2         |        A
2020-11-04 07:00:03   |         3         |        A
2020-11-04 07:00:04   |         4         |        A

2020-11-03 07:00:01   |         1         |        B
2020-11-03 07:00:02   |         2         |        B
2020-11-03 07:00:03   |         3         |        B
2020-11-03 07:00:04   |         4         |        B
       ...                     ...                ...

#spacing added for clarity

似乎正在发生的事情是循环开始和结束时间向量。因此它将数据帧的第一行与开始和结束时间向量中的第一项进行比较,然后将第二行与开始和结束时间向量中的第二项进行比较,依此类推。每隔几行它就会匹配一次。

我需要做的是,在移动到数据帧的下一行之前,将数据帧的每一行与 所有 三个日期时间对进行比较。 purrr 似乎就是这里的答案。

阅读 lubridate 我们可以将 intervals 用于此目的。 如果有多个间隔,那么文档建议我们把它变成一个列表:

intervals <- list(interval(start_time, end_time))

假设我们要检查某个日期是否在这些时间间隔内,我们可以使用 %within% 函数:

ts <- ymd_hms(2020-11-03 07:00:02)
ts %within% intervals

结果:

> ts %within% intervals
[1]  TRUE FALSE FALSE

或者如果我们想同时检查所有区间:

> any(ts %within% intervals)
[1] TRUE

将此应用于您的数据框:

dat <- dat %>%
  rowwise() %>%
  mutate(within = any(timestamp %within% intervals))

之后你可以使用简单的过滤。

您可以使用 purrr 中的 map2_lgl 检查每个 date 中的 any timestamp 是否为 between start_timeend_time.

library(dplyr)
library(lubridate)
library(purrr)

dat %>%
  group_by(date = date(timestamp), process_name) %>%
  filter(any(map2_lgl(start_time, end_time, ~any(between(timestamp, .x, .y)))))