根据一天内的日期匹配R中的列

Match columns in R based on date within one day

我有两个 table。第一个 table 包含我的参考测量值:

> test
        date site value product
A1 2017-06-10    A   0.6  meter1
A2 2017-06-10    B   0.5  meter1
A3 2017-06-11    C   0.5  meter1
A4 2017-06-18    A   0.1  meter1
A5 2017-06-19    B   0.6  meter1
A6 2017-06-19    C   0.6  meter1

第二个 table 包含来自不同仪器的第二组测量值,这些测量值是在可能匹配或不匹配的其他日期进行的。

> test2
         date site value product
B1 2017-06-07    A   0.4  meter2
B2 2017-06-09    B   0.5  meter2
B3 2017-06-09    C   0.6  meter3
B4 2017-06-09    A   0.2  meter2
B5 2017-06-20    B   0.7  meter3
B6 2017-06-23    B   0.5  meter2

我想确定在特定时间间隔内(例如 1 天内)与第一个 table 匹配的测量值。哪个应该给出这样的东西:

>   test3
        date site value product match
1 2017-06-07    A   0.4  meter2    NA
2 2017-06-09    B   0.5  meter2    A2
3 2017-06-09    C   0.6  meter3    NA
4 2017-06-09    A   0.2  meter2    A1
5 2017-06-20    B   0.7  meter3    A5
6 2017-06-23    B   0.5  meter2    NA

最重要的是,我想根据 ggplot 中的参考测量值绘制这些测量值中的每一个。

我用 lubridate 尝试了不同的方法,但无法让它工作。任何帮助表示赞赏。


  test <- structure(list(date = structure(c(17327, 17327, 17328, 17335,17336, 17336),
                                  class = "Date"),
                 site = c("A", "B", "C", "A","B", "C"),
                 value = c(0.6, 0.5,0.5, 0.1, 0.6, 0.6),
                 product = c("meter1", "meter1", "meter1", "meter1", "meter1", "meter1"))
            , row.names = c("A1", "A2", "A3", "A4", "A5", "A6"),
            class = "data.frame")



  test2 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
                                          class = "Date"),
                         site = c("A", "B", "C", "A","B", "B"),
                         value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
                         product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"))
                    , row.names = c("B1", "B2", "B3", "B4", "B5", "B6"),
                    class = "data.frame")

  test3 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
                                           class = "Date"),
                          site = c("A", "B", "C", "A","B", "B"),
                          value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
                          product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"),
                          match = c("NA", "A2", "NA", "A1", "A5", "NA")),
                     row.names = c("1", "2", "3", "4", "5", "6"),
                     class = "data.frame")


您可能想查看这个 SO 问题,您的问题可能是重复的:Joining data frames by lubridate date %within% intervals

在我看来,软件包 {fuzzyjoin} or {lubridate} 的 %within% 可能会有帮助。

这里还有一个更详细的例子:https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881

On top of that, I would like to plot each of these measurements against the reference measurements in ggplot.

当您以长格式处理数据并在 {ggplot} 中使用组时,这应该很容易。

一种方法是使用 data.tableroll = "nearest" 的滚动连接。请注意 只有 on = 中的最后一个参数将是滚动连接。

一个常见的障碍是 data.table 合并了连接的列,因此您需要先复制它。

library(data.table)
setDT(test); setDT(test2)
test[,date1 := date]
test2[,date2 := date]
test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
         date site value product      date2 i.value i.product      date1   diff
1: 2017-06-10    A   0.2  meter2 2017-06-09     0.6    meter1 2017-06-10 1 days
2: 2017-06-10    B   0.5  meter2 2017-06-09     0.5    meter1 2017-06-10 1 days
3: 2017-06-19    B   0.7  meter3 2017-06-20     0.6    meter1 2017-06-19 1 days

这为您提供了 testtest2 行的所有组合,这些组合在 1 天内彼此。从那里你可以 merge 回到 test 或做你想做的任何其他步骤。

matches <- test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
merge(test,matches[,.(date,site,product,value,date2)],by = c("date", "site"),all.x = TRUE)
         date site value.x product.x      date1 product.y value.y      date2
1: 2017-06-10    A     0.6    meter1 2017-06-10    meter2     0.2 2017-06-09
2: 2017-06-10    B     0.5    meter1 2017-06-10    meter2     0.5 2017-06-09
3: 2017-06-11    C     0.5    meter1 2017-06-11      <NA>      NA       <NA>
4: 2017-06-18    A     0.1    meter1 2017-06-18      <NA>      NA       <NA>
5: 2017-06-19    B     0.6    meter1 2017-06-19    meter3     0.7 2017-06-20
6: 2017-06-19    C     0.6    meter1 2017-06-19      <NA>      NA       <NA>

我使用了以下解决方案,灵感来自 Benedicts 在 fuzzyjoin 上的提示:

temp <- test %>% mutate(dateStart = as.Date(date, format = "%m/%d/%Y") - days(1)) %>%
  mutate(dateEnd = as.Date(date, format = "%m/%d/%Y") + days(1))

temp
library(fuzzyjoin)
temp2 <- fuzzy_inner_join(
  test2, temp,
  by = c(
    "site"="site",
    "date" = "dateStart",
    "date" = "dateEnd"),
  match_fun = list(`==`, `>=`, `<=`))
temp2
> temp2
      date.x site.x value.x product.x     date.y site.y value.y product.y  dateStart    dateEnd
1 2017-06-09      B     0.5    meter2 2017-06-10      B     0.5    meter1 2017-06-09 2017-06-11
2 2017-06-09      A     0.2    meter2 2017-06-10      A     0.6    meter1 2017-06-09 2017-06-11
3 2017-06-20      B     0.7    meter3 2017-06-19      B     0.6    meter1 2017-06-18 2017-06-20

然后可以使用以下方法轻松绘制:

ggplot(temp2, aes(value.x, value.y)) +
  geom_point()