在 r 中加入大数据框并同时进行过滤
Join big dataframe in r and filter in the same time
df1 = data.frame(id=1,start=as.Date("2012-07-05"),end=as.Date("2012-07-15"))
df2 = data.frame(id=rep(1,1371),date = as.Date(as.Date("2012-05-06"):as.Date("2016-02-05")))
output = dplyr::inner_join(x=df1,y=df2,by="id") %>% filter(date>=start & date<= end)
我有两个数据框,每个都有大约一百万行,我想通过 id 连接它们,然后进行过滤,以便对于每一行,列日期的值包含在开始日期和结束日期的值之间。
dplyr::inner_join
无法运行,因为它占用了太多内存。
对于每个 id
,df2
中的日期间隔比 df1
中的日期间隔大很多,所以这就是为什么 inner_join %>% filter
效率不高的原因,可以同时做吗?
Non-equi 从 data.table
或 sqldf
加入包可能比 dplyr
快得多所以试试看
df1 = data.frame(id = 1, start = as.Date("2012-07-05"),
end = as.Date("2012-07-15"))
df1
#> id start end
#> 1 1 2012-07-05 2012-07-15
df2 = data.frame(id = rep(1, 1371),
date = seq(as.Date("2012-05-06"), as.Date("2016-02-05"), by = "1 day"))
head(df2)
#> id date
#> 1 1 2012-05-06
#> 2 1 2012-05-07
#> 3 1 2012-05-08
#> 4 1 2012-05-09
#> 5 1 2012-05-10
#> 6 1 2012-05-11
使用sqldf
包:
library(sqldf)
sqldf("SELECT f1.id, start, end, date
FROM df1 f1, df2 f2
WHERE f1.id = f2.id AND
f2.date >= f1.start AND
f2.date <= f1.end")
#> id start end date
#> 1 1 2012-07-05 2012-07-15 2012-07-05
#> 2 1 2012-07-05 2012-07-15 2012-07-06
#> 3 1 2012-07-05 2012-07-15 2012-07-07
#> 4 1 2012-07-05 2012-07-15 2012-07-08
#> 5 1 2012-07-05 2012-07-15 2012-07-09
#> 6 1 2012-07-05 2012-07-15 2012-07-10
#> 7 1 2012-07-05 2012-07-15 2012-07-11
#> 8 1 2012-07-05 2012-07-15 2012-07-12
#> 9 1 2012-07-05 2012-07-15 2012-07-13
#> 10 1 2012-07-05 2012-07-15 2012-07-14
#> 11 1 2012-07-05 2012-07-15 2012-07-15
使用 non-equi 加入 data.table
包: Benchmark | Video
library(data.table)
## convert both data.frames to data.tables by reference
setDT(df1)
setDT(df2)
# join by id and date within start & end limits
# "x." is used so we can refer to the column in df1 explicitly
df2[df1, .(id, date = x.date, start, end),
on = .(id, date >= start, date <= end)]
#> id date start end
#> 1: 1 2012-07-05 2012-07-05 2012-07-15
#> 2: 1 2012-07-06 2012-07-05 2012-07-15
#> 3: 1 2012-07-07 2012-07-05 2012-07-15
#> 4: 1 2012-07-08 2012-07-05 2012-07-15
#> 5: 1 2012-07-09 2012-07-05 2012-07-15
#> 6: 1 2012-07-10 2012-07-05 2012-07-15
#> 7: 1 2012-07-11 2012-07-05 2012-07-15
#> 8: 1 2012-07-12 2012-07-05 2012-07-15
#> 9: 1 2012-07-13 2012-07-05 2012-07-15
#> 10: 1 2012-07-14 2012-07-05 2012-07-15
#> 11: 1 2012-07-15 2012-07-05 2012-07-15
由 reprex package (v0.2.0) 创建于 2018-03-28。
df1 = data.frame(id=1,start=as.Date("2012-07-05"),end=as.Date("2012-07-15"))
df2 = data.frame(id=rep(1,1371),date = as.Date(as.Date("2012-05-06"):as.Date("2016-02-05")))
output = dplyr::inner_join(x=df1,y=df2,by="id") %>% filter(date>=start & date<= end)
我有两个数据框,每个都有大约一百万行,我想通过 id 连接它们,然后进行过滤,以便对于每一行,列日期的值包含在开始日期和结束日期的值之间。
dplyr::inner_join
无法运行,因为它占用了太多内存。
对于每个 id
,df2
中的日期间隔比 df1
中的日期间隔大很多,所以这就是为什么 inner_join %>% filter
效率不高的原因,可以同时做吗?
Non-equi 从 data.table
或 sqldf
加入包可能比 dplyr
快得多所以试试看
df1 = data.frame(id = 1, start = as.Date("2012-07-05"),
end = as.Date("2012-07-15"))
df1
#> id start end
#> 1 1 2012-07-05 2012-07-15
df2 = data.frame(id = rep(1, 1371),
date = seq(as.Date("2012-05-06"), as.Date("2016-02-05"), by = "1 day"))
head(df2)
#> id date
#> 1 1 2012-05-06
#> 2 1 2012-05-07
#> 3 1 2012-05-08
#> 4 1 2012-05-09
#> 5 1 2012-05-10
#> 6 1 2012-05-11
使用sqldf
包:
library(sqldf)
sqldf("SELECT f1.id, start, end, date
FROM df1 f1, df2 f2
WHERE f1.id = f2.id AND
f2.date >= f1.start AND
f2.date <= f1.end")
#> id start end date
#> 1 1 2012-07-05 2012-07-15 2012-07-05
#> 2 1 2012-07-05 2012-07-15 2012-07-06
#> 3 1 2012-07-05 2012-07-15 2012-07-07
#> 4 1 2012-07-05 2012-07-15 2012-07-08
#> 5 1 2012-07-05 2012-07-15 2012-07-09
#> 6 1 2012-07-05 2012-07-15 2012-07-10
#> 7 1 2012-07-05 2012-07-15 2012-07-11
#> 8 1 2012-07-05 2012-07-15 2012-07-12
#> 9 1 2012-07-05 2012-07-15 2012-07-13
#> 10 1 2012-07-05 2012-07-15 2012-07-14
#> 11 1 2012-07-05 2012-07-15 2012-07-15
使用 non-equi 加入 data.table
包: Benchmark | Video
library(data.table)
## convert both data.frames to data.tables by reference
setDT(df1)
setDT(df2)
# join by id and date within start & end limits
# "x." is used so we can refer to the column in df1 explicitly
df2[df1, .(id, date = x.date, start, end),
on = .(id, date >= start, date <= end)]
#> id date start end
#> 1: 1 2012-07-05 2012-07-05 2012-07-15
#> 2: 1 2012-07-06 2012-07-05 2012-07-15
#> 3: 1 2012-07-07 2012-07-05 2012-07-15
#> 4: 1 2012-07-08 2012-07-05 2012-07-15
#> 5: 1 2012-07-09 2012-07-05 2012-07-15
#> 6: 1 2012-07-10 2012-07-05 2012-07-15
#> 7: 1 2012-07-11 2012-07-05 2012-07-15
#> 8: 1 2012-07-12 2012-07-05 2012-07-15
#> 9: 1 2012-07-13 2012-07-05 2012-07-15
#> 10: 1 2012-07-14 2012-07-05 2012-07-15
#> 11: 1 2012-07-15 2012-07-05 2012-07-15
由 reprex package (v0.2.0) 创建于 2018-03-28。