使用与组关联的时间间隔,使用 dplyr 和 purr 函数对数据进行子集化

Use time intervals associated with groups to subset out data using dplyr and purr functions

我是 purr 包的新手,但想将它用于下面概述的示例,而不是应用函数。我有一个长格式的数据框,其中包含多个组的温度数据:

df <- data.frame(stringsAsFactors=FALSE,
       Date.Time = c("5/29/2016 15:00", "7/20/2016 17:10", "6/2/2016 17:20",
                     "6/10/2016 17:30", "6/28/2016 17:40", "5/29/2016 17:50"),
           TempC = c(22.61, 22.235, 22.11, 22.36, 21.67, 21.54),
            Site = c("DH1", "DL1", "EH1", "EL2", "DH2", "DL2"))

目前,该数据集包含目标时间段之外的记录。我需要使用我在下面生成的间隔来提取属于任何提供的间隔内的每个组的记录。

intervals <- data.frame(stringsAsFactors=FALSE,
            Site = c("DL1", "DH1", "DH2", "DL2", "EL2", "EH1", "EH3", "EH2",
                     "DL3", "DH3"),
   full.interval = c("2016-05-29 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-29 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-30 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-30 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-31 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-05-31 17:00:00 UTC--2016-06-28 16:40:00 UTC",
                     "2016-06-01 17:00:00 UTC--2016-06-28 15:20:00 UTC",
                     "2016-06-01 17:00:00 UTC--2016-06-28 14:00:00 UTC", "2016-06-04 17:00:00 UTC--2016-06-28 14:00:00 UTC",
                     "2016-06-02 17:00:00 UTC--2016-06-28 14:00:00 UTC")
)

我知道我需要使用 purr 的 map() 和 keep() 函数以及 dplyr 的 group_by() 的某种组合,但我不确定如何构建代码来映射两个数据框架和多个组。

所需的输出将是一个包含记录的新数据框:

new.df <- data.frame(stringsAsFactors=FALSE,
Date.Time = c("6/2/2016 17:20","6/10/2016 17:30"),
               TempC = c(22.11, 22.36),
                Site = c("EH1", "EL2"))

提前致谢!

这不使用 purrr,但这里有一个方法:

library(dplyr)
library(lubridate)

# add discrete start/stop columns to intervals
intervals <-
  intervals %>%
  mutate(starts = gsub('--.*$', '', full.interval) %>% ymd_hms,
         stops =  gsub('^.*--', '', full.interval) %>% ymd_hms)

# associate each row in DF with the interval for that site, and filter
df %>%
  merge(intervals, by='Site') %>%
  mutate(in_range = 
           mdy_hm(Date.Time) >= starts &
           mdy_hm(Date.Time) <= stops) %>%
  filter(in_range == TRUE)

更新:当 df 更大时,这也能正常运行:

# make a big version of df (3.7 million rows)
df_long <- df[rep(1:6, length.out=3.7e6),]

# associate each row in DF with the interval for that site, and filter
beg_time <- Sys.time()
results <- df_long %>%
  merge(intervals, by='Site') %>%
  mutate(in_range = 
           mdy_hm(Date.Time) >= starts &
           mdy_hm(Date.Time) <= stops) %>%
  filter(in_range == TRUE)
print(Sys.time() - beg_time)

在我的带有 16mb 内存的 macbook pro 笔记本电脑上运行:

Time difference of 20.35184 secs

根据您上面的评论,这就是我的处理方式。

library(dplyr)
library(tidyr)
df <- df %>% mutate(Date.Time=as.POSIXct(Date.Time,format="%m/%d/%Y %H:%M",tz = "UTC"))
intervals <- intervals %>% 
  separate(full.interval, into=c('Start','End'),sep="--") %>%
  mutate(Start=as.POSIXct(Start,format="%Y-%m-%d %H:%M:%S",tz = "UTC"),
         End=as.POSIXct(End,format="%Y-%m-%d %H:%M:%S",tz = "UTC"))


output <- df %>% inner_join(intervals2,by="Site") %>% filter(Date.Time>Start & Date.Time<End)

> output
            Date.Time TempC Site               Start                 End
1 2016-06-02 17:20:00 22.11  EH1 2016-05-31 17:00:00 2016-06-28 16:40:00
2 2016-06-10 17:30:00 22.36  EL2 2016-05-31 17:00:00 2016-06-28 14:00:00