加入两个数据集并在 r 中填充时间间隔的信息
Join two datasets and fill information for time intervals in r
我有两个看起来像这样的数据集:
country <- c("Albania","Albania","Albania","Albania","Albania",
"Belgium","Belgium","Belgium","Belgium","Belgium",
"Canada","Canada","Canada","Canada","Canada",
"Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)
country.year
country year
1 Albania 1992
2 Albania 1993
3 Albania 1994
4 Albania 1995
5 Albania 1996
6 Belgium 1992
7 Belgium 1993
8 Belgium 1994
9 Belgium 1995
10 Belgium 1996
11 Canada 1992
12 Canada 1993
13 Canada 1994
14 Canada 1995
15 Canada 1996
16 Denmark 1992
17 Denmark 1993
18 Denmark 1994
19 Denmark 1995
20 Denmark 1996
country <- c("Albania","Albania",
"Belgium","Belgium",
"Canada","Canada",
"Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201,
1560, 1566,
220, 440,
880, 819, 870)
cabinet.position2 <- c(12,10,
0, 5,
-9, 2,
1,-15)
begining.date <- c("1991-12-01", "1996-01-10",
"1991-05-07", "1995-04-23",
"1992-01-01", "1996-01-01",
"1991-08-03", "1992-07-01", "1996-06-01")
end.date <- c("1996-01-09", "2000-02-01",
"1995-04-01", "1999-04-23",
"1995-09-01", "1999-11-30",
"1992-02-03", "1996-05-20", "2000-04-01")
cabinets <- data.frame(country, cabinet, begining.date, end.date)
> cabinets
country cabinet begining.date end.date
1 Albania 1200 1991-12-01 1996-01-09
2 Albania 1201 1996-01-10 2000-02-01
3 Belgium 1560 1991-05-07 1995-04-01
4 Belgium 1566 1995-04-23 1999-04-23
5 Canada 220 1992-01-01 1995-09-01
6 Canada 440 1996-01-01 1999-11-30
7 Denmark 880 1991-08-03 1992-02-03
8 Denmark 819 1992-07-01 1996-05-20
9 Denmark 870 1996-06-01 2000-04-01
我想要的是一个数据集,其中分析单位是数据框“country.year”中的国家/地区*年份,但包括数据框“cabinets”中每个机柜的位置变量。这个位置变量涉及内阁的政策立场,所以它确实与数据转换任务无关,但对以后很重要。所以像这样:
country <- c("Albania","Albania","Albania","Albania","Albania",
"Belgium","Belgium","Belgium","Belgium","Belgium",
"Canada","Canada","Canada","Canada","Canada",
"Denmark","Denmark","Denmark","Denmark","Denmark")
year2 <- c(1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996)
cabinet2 <- c(1200,1200,1200,1200, 1201,
1560,1560,1560, 1566, 1566,
220,220,220,220, 440,
819, 819, 819, 819, 870)
cabinet.position2 <- c(12,12,12,12, 10,
0,0,0, 5, 5,
-9,-9,-9,-9, 2,
1, 1, 1, 1, -15)
desired.df <- data.frame(country, year2, cabinet2,cabinet.position2)
desired.df
country year2 cabinet2 cabinet.position2
1 Albania 1992 1200 12
2 Albania 1993 1200 12
3 Albania 1994 1200 12
4 Albania 1995 1200 12
5 Albania 1996 1201 10
6 Belgium 1992 1560 0
7 Belgium 1993 1560 0
8 Belgium 1994 1560 0
9 Belgium 1995 1566 5
10 Belgium 1996 1566 5
11 Canada 1992 220 -9
12 Canada 1993 220 -9
13 Canada 1994 220 -9
14 Canada 1995 220 -9
15 Canada 1996 440 2
16 Denmark 1992 819 1
17 Denmark 1993 819 1
18 Denmark 1994 819 1
19 Denmark 1995 819 1
20 Denmark 1996 870 -15
我这里的主要问题是将橱柜分配给不同的年份。正如您在上面看到的那样,每年需要分配一个内阁及其职位。更重要的是,对我来说真正困难的是,有时一年有多个内阁,所以我需要每一年的内阁都是在那一年作为内阁花费更多时间的内阁(例如,如果在 1995 年,A 内阁来自1月到5月,6月到12月B柜在,1995年应该分配B柜)。
有什么想法吗?
非常感谢!
编辑:新版本包括合并并创建一个新变量来计算在办公室花费的时间,在我重新阅读问题(我的错误)和 OP 对内阁职位的含义的澄清之后。
TidyR 涉及非等值连接的解决方案。
library(dplyr)
library(fuzzyjoin)
library(lubridate)
# putting data as Date
country.year <- country.year %>%
mutate(year = paste0(year,"/01","/01"),
year = as.Date(year, format = "%Y/%m/%d"))
cabinets <- cabinets %>%
mutate(begining.date = as.Date(begining.date),
end.date = as.Date(end.date))
desired.df <- fuzzy_inner_join(country.year,cabinets,
by=c("country"="country",
"year"="begining.date",
"year"="end.date"),
match_fun = list(`==`, `>=`, `<=`))%>%
select(country=country.x,everything())%>%
mutate(year=str_sub(year,1,4),
time.as.cabinet = end.date - begining.date)%>%
group_by(country,year)%>%
filter(time.as.cabinet==max(time.as.cabinet)) %>%
select(country,year,cabinet,cabinet.position2, -country.y)
desired.df %>%
head(10)
country year cabinet cabinet.position2
<fct> <chr> <dbl> <dbl>
1 Albania 1992 1200 12
2 Albania 1993 1200 12
3 Albania 1994 1200 12
4 Albania 1995 1200 12
5 Albania 1996 1200 12
6 Belgium 1992 1560 0
7 Belgium 1993 1560 0
8 Belgium 1994 1560 0
9 Belgium 1995 1560 0
10 Belgium 1996 1566 5
使用data.table,您可以同时进行非等值连接、计算新变量并以非常快速的方式更新数据。这里有一个选项
### Load data.table and convert the data.frames
library(data.table)
setDT(country.year) ; setDT(cabinets)
### Convert date columns to proper dates and create join columns
date_cols <- grep("date", names(cabinets), value = TRUE)
cabinets[, (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols]
cabinets[, paste0(c("start", "end"), "_year") := lapply(.SD, year), .SDcols = date_cols]
### Join by year intervals, while calculating the larget time period and updating the data in place
country.year[
, cabinet.position2 :=
cabinets[.SD,
cabinet.position2[which.max(end.date - as.IDate(paste0(year, "-01-01")))]
, on = .(country, start_year <= year, end_year >= year)
, by = .EACHI]$V1
]
country.year
# country year cabinet.position2
# 1: Albania 1992 12
# 2: Albania 1993 12
# 3: Albania 1994 12
# 4: Albania 1995 12
# 5: Albania 1996 10
# 6: Belgium 1992 0
# 7: Belgium 1993 0
# 8: Belgium 1994 0
# 9: Belgium 1995 5
# 10: Belgium 1996 5
# 11: Canada 1992 -9
# 12: Canada 1993 -9
# 13: Canada 1994 -9
# 14: Canada 1995 -9
# 15: Canada 1996 2
# 16: Denmark 1992 1
# 17: Denmark 1993 1
# 18: Denmark 1994 1
# 19: Denmark 1995 1
# 20: Denmark 1996 -15
这是另一个使用 data.table::foverlaps
的选项:
library(data.table)
setDT(country.year)
setDT(cabinets)
#create start date and end date of the year
country.year[, paste0("yr.", c("start", "end")) := lapply(c("-01-01", "-12-31"),
function(x) as.Date(paste0(year, x), format="%Y-%m-%d"))]
setkey(country.year, country, yr.start, yr.end)
setkey(cabinets, country, beginning.date, end.date)
foverlaps(country.year, cabinets)[, {
k <- which.max(pmin(end.date, yr.end) - yr.start)
.(cabinet2=cabinet[k], cabinet.position2=cabinet.position[k])
}, .(country, year)]
输出:
country year cabinet2 cabinet.position2
1: Albania 1992 1200 12
2: Albania 1993 1200 12
3: Albania 1994 1200 12
4: Albania 1995 1200 12
5: Albania 1996 1201 10
6: Belgium 1992 1560 0
7: Belgium 1993 1560 0
8: Belgium 1994 1560 0
9: Belgium 1995 1566 5
10: Belgium 1996 1566 5
11: Canada 1992 220 -9
12: Canada 1993 220 -9
13: Canada 1994 220 -9
14: Canada 1995 220 -9
15: Canada 1996 440 2
16: Denmark 1992 819 1
17: Denmark 1993 819 1
18: Denmark 1994 819 1
19: Denmark 1995 819 1
20: Denmark 1996 870 -15
数据(日期转换,Ian Campbell 的数据修复和单词开头的小错字):
country <- c("Albania","Albania","Albania","Albania","Albania","Belgium","Belgium","Belgium","Belgium","Belgium","Canada","Canada","Canada","Canada","Canada","Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)
country <- c("Albania","Albania","Belgium","Belgium","Canada","Canada","Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201, 1560, 1566, 220, 440, 880, 819, 870)
cabinet.position <- c(12, 10, 0, 5, -9, 2, NA, 1,-15)
beginning.date <- as.Date(c("1991-12-01", "1996-01-10","1991-05-07", "1995-04-23","1992-01-01", "1996-01-01","1991-08-03", "1992-07-01", "1996-06-01"))
end.date <- as.Date(c("1996-01-09", "2000-02-01","1995-04-01", "1999-04-23","1995-09-01", "1999-11-30","1992-02-03", "1996-05-20", "2000-04-01"))
cabinets <- data.frame(country, cabinet, cabinet.position, beginning.date, end.date)
我有两个看起来像这样的数据集:
country <- c("Albania","Albania","Albania","Albania","Albania",
"Belgium","Belgium","Belgium","Belgium","Belgium",
"Canada","Canada","Canada","Canada","Canada",
"Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)
country.year
country year
1 Albania 1992
2 Albania 1993
3 Albania 1994
4 Albania 1995
5 Albania 1996
6 Belgium 1992
7 Belgium 1993
8 Belgium 1994
9 Belgium 1995
10 Belgium 1996
11 Canada 1992
12 Canada 1993
13 Canada 1994
14 Canada 1995
15 Canada 1996
16 Denmark 1992
17 Denmark 1993
18 Denmark 1994
19 Denmark 1995
20 Denmark 1996
country <- c("Albania","Albania",
"Belgium","Belgium",
"Canada","Canada",
"Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201,
1560, 1566,
220, 440,
880, 819, 870)
cabinet.position2 <- c(12,10,
0, 5,
-9, 2,
1,-15)
begining.date <- c("1991-12-01", "1996-01-10",
"1991-05-07", "1995-04-23",
"1992-01-01", "1996-01-01",
"1991-08-03", "1992-07-01", "1996-06-01")
end.date <- c("1996-01-09", "2000-02-01",
"1995-04-01", "1999-04-23",
"1995-09-01", "1999-11-30",
"1992-02-03", "1996-05-20", "2000-04-01")
cabinets <- data.frame(country, cabinet, begining.date, end.date)
> cabinets
country cabinet begining.date end.date
1 Albania 1200 1991-12-01 1996-01-09
2 Albania 1201 1996-01-10 2000-02-01
3 Belgium 1560 1991-05-07 1995-04-01
4 Belgium 1566 1995-04-23 1999-04-23
5 Canada 220 1992-01-01 1995-09-01
6 Canada 440 1996-01-01 1999-11-30
7 Denmark 880 1991-08-03 1992-02-03
8 Denmark 819 1992-07-01 1996-05-20
9 Denmark 870 1996-06-01 2000-04-01
我想要的是一个数据集,其中分析单位是数据框“country.year”中的国家/地区*年份,但包括数据框“cabinets”中每个机柜的位置变量。这个位置变量涉及内阁的政策立场,所以它确实与数据转换任务无关,但对以后很重要。所以像这样:
country <- c("Albania","Albania","Albania","Albania","Albania",
"Belgium","Belgium","Belgium","Belgium","Belgium",
"Canada","Canada","Canada","Canada","Canada",
"Denmark","Denmark","Denmark","Denmark","Denmark")
year2 <- c(1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996,
1992, 1993, 1994, 1995, 1996)
cabinet2 <- c(1200,1200,1200,1200, 1201,
1560,1560,1560, 1566, 1566,
220,220,220,220, 440,
819, 819, 819, 819, 870)
cabinet.position2 <- c(12,12,12,12, 10,
0,0,0, 5, 5,
-9,-9,-9,-9, 2,
1, 1, 1, 1, -15)
desired.df <- data.frame(country, year2, cabinet2,cabinet.position2)
desired.df
country year2 cabinet2 cabinet.position2
1 Albania 1992 1200 12
2 Albania 1993 1200 12
3 Albania 1994 1200 12
4 Albania 1995 1200 12
5 Albania 1996 1201 10
6 Belgium 1992 1560 0
7 Belgium 1993 1560 0
8 Belgium 1994 1560 0
9 Belgium 1995 1566 5
10 Belgium 1996 1566 5
11 Canada 1992 220 -9
12 Canada 1993 220 -9
13 Canada 1994 220 -9
14 Canada 1995 220 -9
15 Canada 1996 440 2
16 Denmark 1992 819 1
17 Denmark 1993 819 1
18 Denmark 1994 819 1
19 Denmark 1995 819 1
20 Denmark 1996 870 -15
我这里的主要问题是将橱柜分配给不同的年份。正如您在上面看到的那样,每年需要分配一个内阁及其职位。更重要的是,对我来说真正困难的是,有时一年有多个内阁,所以我需要每一年的内阁都是在那一年作为内阁花费更多时间的内阁(例如,如果在 1995 年,A 内阁来自1月到5月,6月到12月B柜在,1995年应该分配B柜)。
有什么想法吗?
非常感谢!
编辑:新版本包括合并并创建一个新变量来计算在办公室花费的时间,在我重新阅读问题(我的错误)和 OP 对内阁职位的含义的澄清之后。
TidyR 涉及非等值连接的解决方案。
library(dplyr)
library(fuzzyjoin)
library(lubridate)
# putting data as Date
country.year <- country.year %>%
mutate(year = paste0(year,"/01","/01"),
year = as.Date(year, format = "%Y/%m/%d"))
cabinets <- cabinets %>%
mutate(begining.date = as.Date(begining.date),
end.date = as.Date(end.date))
desired.df <- fuzzy_inner_join(country.year,cabinets,
by=c("country"="country",
"year"="begining.date",
"year"="end.date"),
match_fun = list(`==`, `>=`, `<=`))%>%
select(country=country.x,everything())%>%
mutate(year=str_sub(year,1,4),
time.as.cabinet = end.date - begining.date)%>%
group_by(country,year)%>%
filter(time.as.cabinet==max(time.as.cabinet)) %>%
select(country,year,cabinet,cabinet.position2, -country.y)
desired.df %>%
head(10)
country year cabinet cabinet.position2
<fct> <chr> <dbl> <dbl>
1 Albania 1992 1200 12
2 Albania 1993 1200 12
3 Albania 1994 1200 12
4 Albania 1995 1200 12
5 Albania 1996 1200 12
6 Belgium 1992 1560 0
7 Belgium 1993 1560 0
8 Belgium 1994 1560 0
9 Belgium 1995 1560 0
10 Belgium 1996 1566 5
使用data.table,您可以同时进行非等值连接、计算新变量并以非常快速的方式更新数据。这里有一个选项
### Load data.table and convert the data.frames
library(data.table)
setDT(country.year) ; setDT(cabinets)
### Convert date columns to proper dates and create join columns
date_cols <- grep("date", names(cabinets), value = TRUE)
cabinets[, (date_cols) := lapply(.SD, as.IDate), .SDcols = date_cols]
cabinets[, paste0(c("start", "end"), "_year") := lapply(.SD, year), .SDcols = date_cols]
### Join by year intervals, while calculating the larget time period and updating the data in place
country.year[
, cabinet.position2 :=
cabinets[.SD,
cabinet.position2[which.max(end.date - as.IDate(paste0(year, "-01-01")))]
, on = .(country, start_year <= year, end_year >= year)
, by = .EACHI]$V1
]
country.year
# country year cabinet.position2
# 1: Albania 1992 12
# 2: Albania 1993 12
# 3: Albania 1994 12
# 4: Albania 1995 12
# 5: Albania 1996 10
# 6: Belgium 1992 0
# 7: Belgium 1993 0
# 8: Belgium 1994 0
# 9: Belgium 1995 5
# 10: Belgium 1996 5
# 11: Canada 1992 -9
# 12: Canada 1993 -9
# 13: Canada 1994 -9
# 14: Canada 1995 -9
# 15: Canada 1996 2
# 16: Denmark 1992 1
# 17: Denmark 1993 1
# 18: Denmark 1994 1
# 19: Denmark 1995 1
# 20: Denmark 1996 -15
这是另一个使用 data.table::foverlaps
的选项:
library(data.table)
setDT(country.year)
setDT(cabinets)
#create start date and end date of the year
country.year[, paste0("yr.", c("start", "end")) := lapply(c("-01-01", "-12-31"),
function(x) as.Date(paste0(year, x), format="%Y-%m-%d"))]
setkey(country.year, country, yr.start, yr.end)
setkey(cabinets, country, beginning.date, end.date)
foverlaps(country.year, cabinets)[, {
k <- which.max(pmin(end.date, yr.end) - yr.start)
.(cabinet2=cabinet[k], cabinet.position2=cabinet.position[k])
}, .(country, year)]
输出:
country year cabinet2 cabinet.position2
1: Albania 1992 1200 12
2: Albania 1993 1200 12
3: Albania 1994 1200 12
4: Albania 1995 1200 12
5: Albania 1996 1201 10
6: Belgium 1992 1560 0
7: Belgium 1993 1560 0
8: Belgium 1994 1560 0
9: Belgium 1995 1566 5
10: Belgium 1996 1566 5
11: Canada 1992 220 -9
12: Canada 1993 220 -9
13: Canada 1994 220 -9
14: Canada 1995 220 -9
15: Canada 1996 440 2
16: Denmark 1992 819 1
17: Denmark 1993 819 1
18: Denmark 1994 819 1
19: Denmark 1995 819 1
20: Denmark 1996 870 -15
数据(日期转换,Ian Campbell 的数据修复和单词开头的小错字):
country <- c("Albania","Albania","Albania","Albania","Albania","Belgium","Belgium","Belgium","Belgium","Belgium","Canada","Canada","Canada","Canada","Canada","Denmark","Denmark","Denmark","Denmark","Denmark")
year <- c(1992, 1993, 1994, 1995, 1996, 1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996,1992, 1993, 1994, 1995, 1996)
country.year <- data.frame(country, year)
country <- c("Albania","Albania","Belgium","Belgium","Canada","Canada","Denmark","Denmark","Denmark")
cabinet <- c(1200, 1201, 1560, 1566, 220, 440, 880, 819, 870)
cabinet.position <- c(12, 10, 0, 5, -9, 2, NA, 1,-15)
beginning.date <- as.Date(c("1991-12-01", "1996-01-10","1991-05-07", "1995-04-23","1992-01-01", "1996-01-01","1991-08-03", "1992-07-01", "1996-06-01"))
end.date <- as.Date(c("1996-01-09", "2000-02-01","1995-04-01", "1999-04-23","1995-09-01", "1999-11-30","1992-02-03", "1996-05-20", "2000-04-01"))
cabinets <- data.frame(country, cabinet, cabinet.position, beginning.date, end.date)