根据一天内的日期匹配R中的列
Match columns in R based on date within one day
我有两个 table。第一个 table 包含我的参考测量值:
> test
date site value product
A1 2017-06-10 A 0.6 meter1
A2 2017-06-10 B 0.5 meter1
A3 2017-06-11 C 0.5 meter1
A4 2017-06-18 A 0.1 meter1
A5 2017-06-19 B 0.6 meter1
A6 2017-06-19 C 0.6 meter1
第二个 table 包含来自不同仪器的第二组测量值,这些测量值是在可能匹配或不匹配的其他日期进行的。
> test2
date site value product
B1 2017-06-07 A 0.4 meter2
B2 2017-06-09 B 0.5 meter2
B3 2017-06-09 C 0.6 meter3
B4 2017-06-09 A 0.2 meter2
B5 2017-06-20 B 0.7 meter3
B6 2017-06-23 B 0.5 meter2
我想确定在特定时间间隔内(例如 1 天内)与第一个 table 匹配的测量值。哪个应该给出这样的东西:
> test3
date site value product match
1 2017-06-07 A 0.4 meter2 NA
2 2017-06-09 B 0.5 meter2 A2
3 2017-06-09 C 0.6 meter3 NA
4 2017-06-09 A 0.2 meter2 A1
5 2017-06-20 B 0.7 meter3 A5
6 2017-06-23 B 0.5 meter2 NA
最重要的是,我想根据 ggplot 中的参考测量值绘制这些测量值中的每一个。
我用 lubridate 尝试了不同的方法,但无法让它工作。任何帮助表示赞赏。
test <- structure(list(date = structure(c(17327, 17327, 17328, 17335,17336, 17336),
class = "Date"),
site = c("A", "B", "C", "A","B", "C"),
value = c(0.6, 0.5,0.5, 0.1, 0.6, 0.6),
product = c("meter1", "meter1", "meter1", "meter1", "meter1", "meter1"))
, row.names = c("A1", "A2", "A3", "A4", "A5", "A6"),
class = "data.frame")
test2 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
class = "Date"),
site = c("A", "B", "C", "A","B", "B"),
value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"))
, row.names = c("B1", "B2", "B3", "B4", "B5", "B6"),
class = "data.frame")
test3 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
class = "Date"),
site = c("A", "B", "C", "A","B", "B"),
value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"),
match = c("NA", "A2", "NA", "A1", "A5", "NA")),
row.names = c("1", "2", "3", "4", "5", "6"),
class = "data.frame")
您可能想查看这个 SO 问题,您的问题可能是重复的:Joining data frames by lubridate date %within% intervals。
在我看来,软件包 {fuzzyjoin} or {lubridate} 的 %within%
可能会有帮助。
这里还有一个更详细的例子:https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881。
On top of that, I would like to plot each of these measurements
against the reference measurements in ggplot.
当您以长格式处理数据并在 {ggplot} 中使用组时,这应该很容易。
一种方法是使用 data.table
与 roll = "nearest"
的滚动连接。请注意 只有 on =
中的最后一个参数将是滚动连接。
一个常见的障碍是 data.table
合并了连接的列,因此您需要先复制它。
library(data.table)
setDT(test); setDT(test2)
test[,date1 := date]
test2[,date2 := date]
test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
date site value product date2 i.value i.product date1 diff
1: 2017-06-10 A 0.2 meter2 2017-06-09 0.6 meter1 2017-06-10 1 days
2: 2017-06-10 B 0.5 meter2 2017-06-09 0.5 meter1 2017-06-10 1 days
3: 2017-06-19 B 0.7 meter3 2017-06-20 0.6 meter1 2017-06-19 1 days
这为您提供了 test
和 test2
行的所有组合,这些组合在 1 天内彼此。从那里你可以 merge
回到 test
或做你想做的任何其他步骤。
matches <- test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
merge(test,matches[,.(date,site,product,value,date2)],by = c("date", "site"),all.x = TRUE)
date site value.x product.x date1 product.y value.y date2
1: 2017-06-10 A 0.6 meter1 2017-06-10 meter2 0.2 2017-06-09
2: 2017-06-10 B 0.5 meter1 2017-06-10 meter2 0.5 2017-06-09
3: 2017-06-11 C 0.5 meter1 2017-06-11 <NA> NA <NA>
4: 2017-06-18 A 0.1 meter1 2017-06-18 <NA> NA <NA>
5: 2017-06-19 B 0.6 meter1 2017-06-19 meter3 0.7 2017-06-20
6: 2017-06-19 C 0.6 meter1 2017-06-19 <NA> NA <NA>
我使用了以下解决方案,灵感来自 Benedicts 在 fuzzyjoin 上的提示:
temp <- test %>% mutate(dateStart = as.Date(date, format = "%m/%d/%Y") - days(1)) %>%
mutate(dateEnd = as.Date(date, format = "%m/%d/%Y") + days(1))
temp
library(fuzzyjoin)
temp2 <- fuzzy_inner_join(
test2, temp,
by = c(
"site"="site",
"date" = "dateStart",
"date" = "dateEnd"),
match_fun = list(`==`, `>=`, `<=`))
temp2
> temp2
date.x site.x value.x product.x date.y site.y value.y product.y dateStart dateEnd
1 2017-06-09 B 0.5 meter2 2017-06-10 B 0.5 meter1 2017-06-09 2017-06-11
2 2017-06-09 A 0.2 meter2 2017-06-10 A 0.6 meter1 2017-06-09 2017-06-11
3 2017-06-20 B 0.7 meter3 2017-06-19 B 0.6 meter1 2017-06-18 2017-06-20
然后可以使用以下方法轻松绘制:
ggplot(temp2, aes(value.x, value.y)) +
geom_point()
我有两个 table。第一个 table 包含我的参考测量值:
> test
date site value product
A1 2017-06-10 A 0.6 meter1
A2 2017-06-10 B 0.5 meter1
A3 2017-06-11 C 0.5 meter1
A4 2017-06-18 A 0.1 meter1
A5 2017-06-19 B 0.6 meter1
A6 2017-06-19 C 0.6 meter1
第二个 table 包含来自不同仪器的第二组测量值,这些测量值是在可能匹配或不匹配的其他日期进行的。
> test2
date site value product
B1 2017-06-07 A 0.4 meter2
B2 2017-06-09 B 0.5 meter2
B3 2017-06-09 C 0.6 meter3
B4 2017-06-09 A 0.2 meter2
B5 2017-06-20 B 0.7 meter3
B6 2017-06-23 B 0.5 meter2
我想确定在特定时间间隔内(例如 1 天内)与第一个 table 匹配的测量值。哪个应该给出这样的东西:
> test3
date site value product match
1 2017-06-07 A 0.4 meter2 NA
2 2017-06-09 B 0.5 meter2 A2
3 2017-06-09 C 0.6 meter3 NA
4 2017-06-09 A 0.2 meter2 A1
5 2017-06-20 B 0.7 meter3 A5
6 2017-06-23 B 0.5 meter2 NA
最重要的是,我想根据 ggplot 中的参考测量值绘制这些测量值中的每一个。
我用 lubridate 尝试了不同的方法,但无法让它工作。任何帮助表示赞赏。
test <- structure(list(date = structure(c(17327, 17327, 17328, 17335,17336, 17336),
class = "Date"),
site = c("A", "B", "C", "A","B", "C"),
value = c(0.6, 0.5,0.5, 0.1, 0.6, 0.6),
product = c("meter1", "meter1", "meter1", "meter1", "meter1", "meter1"))
, row.names = c("A1", "A2", "A3", "A4", "A5", "A6"),
class = "data.frame")
test2 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
class = "Date"),
site = c("A", "B", "C", "A","B", "B"),
value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"))
, row.names = c("B1", "B2", "B3", "B4", "B5", "B6"),
class = "data.frame")
test3 <- structure(list(date = structure(c(17324, 17326, 17326, 17326,17337, 17340),
class = "Date"),
site = c("A", "B", "C", "A","B", "B"),
value = c(0.4, 0.5,0.6, 0.2, 0.7, 0.5),
product = c("meter2", "meter2", "meter3", "meter2", "meter3", "meter2"),
match = c("NA", "A2", "NA", "A1", "A5", "NA")),
row.names = c("1", "2", "3", "4", "5", "6"),
class = "data.frame")
您可能想查看这个 SO 问题,您的问题可能是重复的:Joining data frames by lubridate date %within% intervals。
在我看来,软件包 {fuzzyjoin} or {lubridate} 的 %within%
可能会有帮助。
这里还有一个更详细的例子:https://community.rstudio.com/t/tidy-way-to-range-join-tables-on-an-interval-of-dates/7881。
On top of that, I would like to plot each of these measurements against the reference measurements in ggplot.
当您以长格式处理数据并在 {ggplot} 中使用组时,这应该很容易。
一种方法是使用 data.table
与 roll = "nearest"
的滚动连接。请注意 只有 on =
中的最后一个参数将是滚动连接。
一个常见的障碍是 data.table
合并了连接的列,因此您需要先复制它。
library(data.table)
setDT(test); setDT(test2)
test[,date1 := date]
test2[,date2 := date]
test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
date site value product date2 i.value i.product date1 diff
1: 2017-06-10 A 0.2 meter2 2017-06-09 0.6 meter1 2017-06-10 1 days
2: 2017-06-10 B 0.5 meter2 2017-06-09 0.5 meter1 2017-06-10 1 days
3: 2017-06-19 B 0.7 meter3 2017-06-20 0.6 meter1 2017-06-19 1 days
这为您提供了 test
和 test2
行的所有组合,这些组合在 1 天内彼此。从那里你可以 merge
回到 test
或做你想做的任何其他步骤。
matches <- test2[test,on = c("site","date"), roll = "nearest"][,diff := abs(date2-date1)][diff <= 1,]
merge(test,matches[,.(date,site,product,value,date2)],by = c("date", "site"),all.x = TRUE)
date site value.x product.x date1 product.y value.y date2
1: 2017-06-10 A 0.6 meter1 2017-06-10 meter2 0.2 2017-06-09
2: 2017-06-10 B 0.5 meter1 2017-06-10 meter2 0.5 2017-06-09
3: 2017-06-11 C 0.5 meter1 2017-06-11 <NA> NA <NA>
4: 2017-06-18 A 0.1 meter1 2017-06-18 <NA> NA <NA>
5: 2017-06-19 B 0.6 meter1 2017-06-19 meter3 0.7 2017-06-20
6: 2017-06-19 C 0.6 meter1 2017-06-19 <NA> NA <NA>
我使用了以下解决方案,灵感来自 Benedicts 在 fuzzyjoin 上的提示:
temp <- test %>% mutate(dateStart = as.Date(date, format = "%m/%d/%Y") - days(1)) %>%
mutate(dateEnd = as.Date(date, format = "%m/%d/%Y") + days(1))
temp
library(fuzzyjoin)
temp2 <- fuzzy_inner_join(
test2, temp,
by = c(
"site"="site",
"date" = "dateStart",
"date" = "dateEnd"),
match_fun = list(`==`, `>=`, `<=`))
temp2
> temp2
date.x site.x value.x product.x date.y site.y value.y product.y dateStart dateEnd
1 2017-06-09 B 0.5 meter2 2017-06-10 B 0.5 meter1 2017-06-09 2017-06-11
2 2017-06-09 A 0.2 meter2 2017-06-10 A 0.6 meter1 2017-06-09 2017-06-11
3 2017-06-20 B 0.7 meter3 2017-06-19 B 0.6 meter1 2017-06-18 2017-06-20
然后可以使用以下方法轻松绘制:
ggplot(temp2, aes(value.x, value.y)) +
geom_point()