在R中最近的日期连接两个数据框
Joining two data frames on the closest date in R
我想根据日期加入两个数据集。一个是调查数据集,另一个是不同日期的价格列表。日期不完全匹配,所以我想加入调查数据集中最近的日期(价格数据是每周的)。
以下是调查数据集的简短片段(还有许多其他变量,但这里有两个最相关):
ID
actual.date
20120377
2012-09-26
2020455822
2020-11-23
20126758
2012-10-26
20124241
2012-10-25
2020426572
2020-11-28
这是价格数据集(也更大,但你明白了):
date
price.var1
price.var2
2017-10-30
2.74733926399869
2.73994826674735
2015-03-16
2.77028200438506
2.74079930272231
2010-10-18
3.4265947805337
3.41591263539176
2012-10-29
4.10095806545397
4.14717556976502
2012-01-09
3.87888859352037
3.93074237884497
我想做的是将价格数据集加入调查数据集,在最近的日期加入。
我尝试了很多不同的方法,none 其中的方法令我满意。
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
非常感谢任何人能在这里给我的帮助。
将日期转换为数字,并从 DescTools
的 Closest()
价格调查中找到最接近的日期,然后取该值。
示例数据集
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
调查
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
价格
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6
你快到了。
在 DT[i,on]
语法中,i
应该是 survey
以加入其所有行
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
我想根据日期加入两个数据集。一个是调查数据集,另一个是不同日期的价格列表。日期不完全匹配,所以我想加入调查数据集中最近的日期(价格数据是每周的)。
以下是调查数据集的简短片段(还有许多其他变量,但这里有两个最相关):
ID | actual.date |
---|---|
20120377 | 2012-09-26 |
2020455822 | 2020-11-23 |
20126758 | 2012-10-26 |
20124241 | 2012-10-25 |
2020426572 | 2020-11-28 |
这是价格数据集(也更大,但你明白了):
date | price.var1 | price.var2 |
---|---|---|
2017-10-30 | 2.74733926399869 | 2.73994826674735 |
2015-03-16 | 2.77028200438506 | 2.74079930272231 |
2010-10-18 | 3.4265947805337 | 3.41591263539176 |
2012-10-29 | 4.10095806545397 | 4.14717556976502 |
2012-01-09 | 3.87888859352037 | 3.93074237884497 |
我想做的是将价格数据集加入调查数据集,在最近的日期加入。
我尝试了很多不同的方法,none 其中的方法令我满意。
#reading in sample data
library(data.table)
library(dplyr)
survey <- fread(" ID actual.date
1: 20120377 2012-09-26
2: 2020455822 2020-11-23
3: 20126758 2012-10-26
4: 20124241 2012-10-25
5: 2020426572 2020-11-28
> ") %>% select(-V1)
price <- fread("date price.var1 price.var2
1: 2017-10-30 2.747339 2.739948
2: 2015-03-16 2.770282 2.740799
3: 2010-10-18 3.426595 3.415913
4: 2012-10-29 4.100958 4.147176
5: 2012-01-09 3.878889 3.930742") %>% select(-V1)
#using data.table
setDT(survey)[,DT_DATE := actual.date]
setDT(price)[,DT_DATE := date]
survey_price <- survey[price,on=.(DT_DATE),roll="nearest"]
#This works, and they join, but it drops a ton of observations, which won't work
#using dplyr
library(dplyr)
survey_price <- left_join(survey,price,by=c("actual.date"="date"))
#this joins them without dropping observations, but all of the price variables become NAs
非常感谢任何人能在这里给我的帮助。
将日期转换为数字,并从 DescTools
的 Closest()
价格调查中找到最接近的日期,然后取该值。
示例数据集
survey <- tibble(
ID = sample(20000:40000, 9, replace = TRUE),
actual.date = seq(today() %m+% days(5), today() %m+% days(5) %m+% months(2),
"week")
)
price <- tibble(
date = seq(today(), today() %m+% months(2), by = "week"),
price_1 = sample(2:6, 9, replace = TRUE),
price_2 = sample(2:6, 9, replace = TRUE)
)
调查
# A tibble: 9 x 2
ID actual.date
<int> <date>
1 34592 2022-05-07
2 37846 2022-05-14
3 22715 2022-05-21
4 22510 2022-05-28
5 30143 2022-06-04
6 34348 2022-06-11
7 21538 2022-06-18
8 39802 2022-06-25
9 36493 2022-07-02
价格
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-02 6 6
2 2022-05-09 3 2
3 2022-05-16 6 4
4 2022-05-23 6 2
5 2022-05-30 2 6
6 2022-06-06 2 4
7 2022-06-13 2 2
8 2022-06-20 3 5
9 2022-06-27 5 6
library(tidyverse)
library(lubridate)
library(DescTools)
price <- price %>%
mutate(date = Closest(survey$actual.date %>%
as.numeric, date %>%
as.numeric) %>%
as_date())
# A tibble: 9 x 3
date price_1 price_2
<date> <int> <int>
1 2022-05-07 6 6
2 2022-05-14 3 2
3 2022-05-21 6 4
4 2022-05-28 6 2
5 2022-06-04 2 6
6 2022-06-11 2 4
7 2022-06-18 2 2
8 2022-06-25 3 5
9 2022-07-02 5 6
merge(survey, price, by.x = "actual.date", by.y = "date")
actual.date ID price_1 price_2
1 2022-05-07 34592 6 6
2 2022-05-14 37846 3 2
3 2022-05-21 22715 6 4
4 2022-05-28 22510 6 2
5 2022-06-04 30143 2 6
6 2022-06-11 34348 2 4
7 2022-06-18 21538 2 2
8 2022-06-25 39802 3 5
9 2022-07-02 36493 5 6
你快到了。
在 DT[i,on]
语法中,i
应该是 survey
以加入其所有行
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572