将时间间隔转换为 time/person 矩阵
Converting time intervals into time/person matrix
我正在做一个历史项目,我们需要了解每个月的员工人数。对于数据集中的每个人,我都有他们受雇的时期。 fname_code 不同职位的代码。举个例子,Edmont Privat博士在以下各个时期有两个不同的功能:
pname fname_code begin_date end_date
1 Dr. Edmond Privat 3 1921-09-02 1921-10-07
2 Dr. Edmond Privat 2 1921-12-07 1922-03-06
3 Joseph Louis Marie Charles Avenol 1 1923-02-01 1933-07-01
4 Joseph Louis Marie Charles Avenol 1 1933-07-01 1940-08-31
5 Dr. G. G. Kullmann 2 1931-03-30 1938-12-15
我的想法是将此信息转换为 subject/date 句点 dataframe/matrix,其中 N 表示此时此人不在公司,而数字表示他们被雇用并且他们的地位。这是我的想法的一个例子:
1944-07-01 1944-08-01 1944-09-01 1944-10-01
Albert Dan Meurig Evans N N N N
Genevieve Jeanne Leonie Mayor N 2 3 3
我已经做了一个可以完成这项工作的东西 - 它已经完成了上面的几行 - 但是,代码是 definetley 不适合胆小的程序员(很多前循环!)。我的问题是你们中有没有经验丰富的程序员有提高速度的建议,或者以完全不同的方式实现我的目标。
我试过玩一些 dplyr 函数,但我对它们的经验太少,无法使它们工作。我还考虑过创建一个 if 条件来处理一个人刚刚被雇用一段时间的情况,因为在这种情况下不需要 forloop - 但我不确定在哪里最佳实施它。
我的计算灾难的逻辑是查看数据框和就业数据框的月份间隔之间是否存在重叠:
library(lubridate)
library(tidyverse)
#creating sequence of dates for columns
start_date <- as.Date("1919-01-01")
end_date <- as.Date("1948-12-30")
dates <- seq.Date(start_date, end_date, by ="month")
#dates as columns and names on columns
test.df <- matrix(ncol =length(dates), nrow = nlevels(mdl_df$pname))
test.df <- as.data.frame(test.df)
colnames(test.df) <- dates
rownames(test.df) <- levels(mdl_df$pname)
for (name in 1:nlevels(mdl_df$pname)){
#subsetting the rows for each person
person_rows <- mdl_df %>% filter( mdl_df$pname == rownames(test.df)[name])
for (date in 1:(length(dates)-1)) {
#Creating a month interval consisting of the time between two adjecent months
interval1 <- interval(ymd(colnames(test.df)[date]),ymd(colnames(test.df)[date+1]))
for (row in 1:nrow(person_rows)) {
#check if overlap between df month interval and employment intervals.
interval2 <- interval(ymd(person_rows$begin_date[row]),ymd(person_rows$end_date[row]))
if (int_overlaps(interval1, interval2)){
#checking if df period and work period overlap. If so rank is inserted otherwise N is entered
test.df[name,date] <- test_rows$fname_code[row]
break
}else{
test.df[name,date] <- "N"
}
}
}
}
该数据集由大约 3000 名员工组成,我的计算机需要大约 6-7 个小时才能完成这项工作。在接下来的几周内,我需要在各种数据集上多次 运行 和重新 运行 脚本,因此非常感谢您的帮助!
编辑:数据集前 50 行的 dput 输出。
> dput(droplevels(head(mdl_df, 50)))
structure(list(pname = structure(c(7L, 7L, 24L, 24L, 8L, 19L,
16L, 16L, 16L, 4L, 34L, 11L, 17L, 12L, 23L, 10L, 14L, 14L, 14L,
14L, 14L, 32L, 5L, 22L, 29L, 3L, 13L, 25L, 2L, 6L, 26L, 18L,
21L, 27L, 27L, 28L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 15L, 31L,
33L, 30L, 30L, 1L), .Label = c("A. Gordon Bagnall", "Bertil Gotthard Ohlin",
"Birgit Nissen", "Bryan Fullerton Adams", "C.H. Wykes", "Christian Olsen",
"Dr. Edmond Privat", "Dr. G. G. Kullmann", "Eugène Henri René Vigier",
"Ewan P. Wallis-Jones", "Francis Yeats-Brown", "Francisco Walker-Linares",
"Frank Horsfall Nixon", "Frank Paul Walters", "Franklin Urteaga",
"Gerald Heguerty Furtado Abraham", "Gladys Wade", "Guillaume Théodore Conrad Zwerner",
"Henri Bonnet", "Haakon Vigander", "Ignacio J. Valdes", "Ingvad Nielsen",
"Jessie Irene Wall", "Joseph Louis Marie Charles Avenol", "Julian Nogueira",
"Konni Zilliacus", "Luis Varela-Obregoso", "Marc Veillet-Lavallee",
"Maria Nielsen", "Peter Martin Anker", "Pierre Achille Louis Eugène Quesnay",
"Pierre Henry Watier", "Prof. Fred Alexander", "Robert André Felix Bach"
), class = "factor"), fname_code = c(3L, 2L, 1L, 1L, 2L, 2L,
2L, 0L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 6L,
0L, 0L, 1L, 2L, 3L, 6L, 2L, 2L, 2L, 2L, 2L, 2L), begin_date = structure(c(-17653,
-17557, -17136, -13333, -14157, -17897, -18050, -13789, -8962,
-15010, -11810, -15372, -14003, -14855, -16047, -12900, -18494,
-18254, -14245, -13333, -11172, -12008, -18398, -14360, -15002,
-11802, -17883, -12862, -14245, -17136, -18248, -14975, -13989,
-15494, -15372, -14108, -14738, -18201, -17849, -17849, -11657,
-10592, -10579, -10130, -11436, -16849, -13631, -14033, -11161,
-12620), class = "Date"), end_date = structure(c(-17618, -17468,
-13333, -10715, -11340, -14243, -13789, -11223, -8624, -11178,
-10797, -17543, -13982, -8555, -15628, -12879, -18254, -14245,
-13333, -11172, -10809, -11822, -18255, -14339, -14988, -11781,
-17078, -11158, -13958, -16590, -11401, -14610, -13968, -15434,
-15007, -13920, -14717, -17849, -8524, -8524, -8524, -8524, -8524,
-8524, -11415, -15707, -13613, -11161, -8555, -12614), class = "Date")),
.Names = c("pname", "fname_code", "begin_date", "end_date"), row.names = c(NA, 50L), class = "data.frame")
运行 Rstudio v.1.0.136
附带的包裹:
[1] dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1
[7] tidyverse_1.1.1 lubridate_1.6.0
data.table
软件包的 1.9.8 版(2016 年 11 月 25 日在 CRAN 上)引入了非等值连接,这对于查找就业重叠非常有用以月为间隔的时间。 dcast()
然后用于从长格式重塑为宽格式。
library(data.table)
# coerce to data.table
setDT(mdl_df)[
# right join with sequence of monthly intervals
.(mseq = seq(as.Date("1944-01-01"), length.out = 4L, by = "1 month")),
# using non-equi join conditions
on = .(begin_date <= mseq, end_date >= mseq)][
# reshape from wide to long format,
# show rank (concatenate in case of multiple ranks)
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
pname 1944-01-01 1944-02-01 1944-03-01 1944-04-01
1: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6
2: Francisco Walker-Linares 4 4 4 4
3: Peter Martin Anker 2 2 2 2
编辑
,OP 已要求涵盖 1919-01-01 至 1948-12-30 期间。这里,我们需要修改join参数:
result <- setDT(mdl_df)[
.(mseq = seq(as.Date("1919-01-01"), as.Date("1948-12-30"), by = "1 month")),
on = .(begin_date <= mseq, end_date >= mseq), nomatch = 0L, allow.cartesian = TRUE][
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
result
共27行328列,只能分段打印:
result[, 1:5]
pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01
1: Bertil Gotthard Ohlin
2: Bryan Fullerton Adams
3: C.H. Wykes 2
4: Christian Olsen
5: Dr. Edmond Privat
6: Dr. G. G. Kullmann
7: Eugène Henri René Vigier
8: Francisco Walker-Linares
9: Frank Horsfall Nixon
10: Frank Paul Walters 3 3 3 3
11: Franklin Urteaga
12: Gerald Heguerty Furtado Abraham
13: Gladys Wade
14: Guillaume Théodore Conrad Zwerner
15: Henri Bonnet
16: Haakon Vigander
17: Ignacio J. Valdes
18: Jessie Irene Wall
19: Joseph Louis Marie Charles Avenol
20: Julian Nogueira
21: Konni Zilliacus
22: Luis Varela-Obregoso
23: Marc Veillet-Lavallee
24: Peter Martin Anker
25: Pierre Achille Louis Eugène Quesnay
26: Pierre Henry Watier
27: Robert André Felix Bach
pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01
请注意,显示的第一个日期是 1919-06-01,因为没有更早的匹配项。同样,最后一列 328 是 1946-08-01.
result[, c(1, 328 - 2:0)]
pname 1946-06-01 1946-07-01 1946-08-01
1: Bertil Gotthard Ohlin
2: Bryan Fullerton Adams
3: C.H. Wykes
4: Christian Olsen
5: Dr. Edmond Privat
6: Dr. G. G. Kullmann
7: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6
8: Francisco Walker-Linares 4 4
9: Frank Horsfall Nixon
10: Frank Paul Walters
11: Franklin Urteaga
12: Gerald Heguerty Furtado Abraham
13: Gladys Wade
14: Guillaume Théodore Conrad Zwerner
15: Henri Bonnet
16: Haakon Vigander
17: Ignacio J. Valdes
18: Jessie Irene Wall
19: Joseph Louis Marie Charles Avenol
20: Julian Nogueira
21: Konni Zilliacus
22: Luis Varela-Obregoso
23: Marc Veillet-Lavallee
24: Peter Martin Anker 2 2
25: Pierre Achille Louis Eugène Quesnay
26: Pierre Henry Watier
27: Robert André Felix Bach
pname 1946-06-01 1946-07-01 1946-08-01
请注意 mdl_df
有 mdl_df[, uniqueN(pname)]
34 个唯一名称,而 result
只有 27 个。这是由于数据不一致,即 begin_date
后来 end_date
或周期太短:
# anti-join
mdl_df[!result, on = "pname"]
pname fname_code begin_date end_date
1: Francis Yeats-Brown 2 1927-12-01 1921-12-21
2: Ewan P. Wallis-Jones 2 1934-09-07 1934-09-28
3: Ingvad Nielsen 2 1930-09-08 1930-09-29
4: Maria Nielsen 2 1928-12-05 1928-12-19
5: Birgit Nissen 2 1937-09-09 1937-09-30
6: Prof. Fred Alexander 2 1932-09-06 1932-09-24
7: A. Gordon Bagnall 2 1935-06-14 1935-06-20
我正在做一个历史项目,我们需要了解每个月的员工人数。对于数据集中的每个人,我都有他们受雇的时期。 fname_code 不同职位的代码。举个例子,Edmont Privat博士在以下各个时期有两个不同的功能:
pname fname_code begin_date end_date
1 Dr. Edmond Privat 3 1921-09-02 1921-10-07
2 Dr. Edmond Privat 2 1921-12-07 1922-03-06
3 Joseph Louis Marie Charles Avenol 1 1923-02-01 1933-07-01
4 Joseph Louis Marie Charles Avenol 1 1933-07-01 1940-08-31
5 Dr. G. G. Kullmann 2 1931-03-30 1938-12-15
我的想法是将此信息转换为 subject/date 句点 dataframe/matrix,其中 N 表示此时此人不在公司,而数字表示他们被雇用并且他们的地位。这是我的想法的一个例子:
1944-07-01 1944-08-01 1944-09-01 1944-10-01
Albert Dan Meurig Evans N N N N
Genevieve Jeanne Leonie Mayor N 2 3 3
我已经做了一个可以完成这项工作的东西 - 它已经完成了上面的几行 - 但是,代码是 definetley 不适合胆小的程序员(很多前循环!)。我的问题是你们中有没有经验丰富的程序员有提高速度的建议,或者以完全不同的方式实现我的目标。 我试过玩一些 dplyr 函数,但我对它们的经验太少,无法使它们工作。我还考虑过创建一个 if 条件来处理一个人刚刚被雇用一段时间的情况,因为在这种情况下不需要 forloop - 但我不确定在哪里最佳实施它。
我的计算灾难的逻辑是查看数据框和就业数据框的月份间隔之间是否存在重叠:
library(lubridate)
library(tidyverse)
#creating sequence of dates for columns
start_date <- as.Date("1919-01-01")
end_date <- as.Date("1948-12-30")
dates <- seq.Date(start_date, end_date, by ="month")
#dates as columns and names on columns
test.df <- matrix(ncol =length(dates), nrow = nlevels(mdl_df$pname))
test.df <- as.data.frame(test.df)
colnames(test.df) <- dates
rownames(test.df) <- levels(mdl_df$pname)
for (name in 1:nlevels(mdl_df$pname)){
#subsetting the rows for each person
person_rows <- mdl_df %>% filter( mdl_df$pname == rownames(test.df)[name])
for (date in 1:(length(dates)-1)) {
#Creating a month interval consisting of the time between two adjecent months
interval1 <- interval(ymd(colnames(test.df)[date]),ymd(colnames(test.df)[date+1]))
for (row in 1:nrow(person_rows)) {
#check if overlap between df month interval and employment intervals.
interval2 <- interval(ymd(person_rows$begin_date[row]),ymd(person_rows$end_date[row]))
if (int_overlaps(interval1, interval2)){
#checking if df period and work period overlap. If so rank is inserted otherwise N is entered
test.df[name,date] <- test_rows$fname_code[row]
break
}else{
test.df[name,date] <- "N"
}
}
}
}
该数据集由大约 3000 名员工组成,我的计算机需要大约 6-7 个小时才能完成这项工作。在接下来的几周内,我需要在各种数据集上多次 运行 和重新 运行 脚本,因此非常感谢您的帮助!
编辑:数据集前 50 行的 dput 输出。
> dput(droplevels(head(mdl_df, 50)))
structure(list(pname = structure(c(7L, 7L, 24L, 24L, 8L, 19L,
16L, 16L, 16L, 4L, 34L, 11L, 17L, 12L, 23L, 10L, 14L, 14L, 14L,
14L, 14L, 32L, 5L, 22L, 29L, 3L, 13L, 25L, 2L, 6L, 26L, 18L,
21L, 27L, 27L, 28L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 15L, 31L,
33L, 30L, 30L, 1L), .Label = c("A. Gordon Bagnall", "Bertil Gotthard Ohlin",
"Birgit Nissen", "Bryan Fullerton Adams", "C.H. Wykes", "Christian Olsen",
"Dr. Edmond Privat", "Dr. G. G. Kullmann", "Eugène Henri René Vigier",
"Ewan P. Wallis-Jones", "Francis Yeats-Brown", "Francisco Walker-Linares",
"Frank Horsfall Nixon", "Frank Paul Walters", "Franklin Urteaga",
"Gerald Heguerty Furtado Abraham", "Gladys Wade", "Guillaume Théodore Conrad Zwerner",
"Henri Bonnet", "Haakon Vigander", "Ignacio J. Valdes", "Ingvad Nielsen",
"Jessie Irene Wall", "Joseph Louis Marie Charles Avenol", "Julian Nogueira",
"Konni Zilliacus", "Luis Varela-Obregoso", "Marc Veillet-Lavallee",
"Maria Nielsen", "Peter Martin Anker", "Pierre Achille Louis Eugène Quesnay",
"Pierre Henry Watier", "Prof. Fred Alexander", "Robert André Felix Bach"
), class = "factor"), fname_code = c(3L, 2L, 1L, 1L, 2L, 2L,
2L, 0L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 6L,
0L, 0L, 1L, 2L, 3L, 6L, 2L, 2L, 2L, 2L, 2L, 2L), begin_date = structure(c(-17653,
-17557, -17136, -13333, -14157, -17897, -18050, -13789, -8962,
-15010, -11810, -15372, -14003, -14855, -16047, -12900, -18494,
-18254, -14245, -13333, -11172, -12008, -18398, -14360, -15002,
-11802, -17883, -12862, -14245, -17136, -18248, -14975, -13989,
-15494, -15372, -14108, -14738, -18201, -17849, -17849, -11657,
-10592, -10579, -10130, -11436, -16849, -13631, -14033, -11161,
-12620), class = "Date"), end_date = structure(c(-17618, -17468,
-13333, -10715, -11340, -14243, -13789, -11223, -8624, -11178,
-10797, -17543, -13982, -8555, -15628, -12879, -18254, -14245,
-13333, -11172, -10809, -11822, -18255, -14339, -14988, -11781,
-17078, -11158, -13958, -16590, -11401, -14610, -13968, -15434,
-15007, -13920, -14717, -17849, -8524, -8524, -8524, -8524, -8524,
-8524, -11415, -15707, -13613, -11161, -8555, -12614), class = "Date")),
.Names = c("pname", "fname_code", "begin_date", "end_date"), row.names = c(NA, 50L), class = "data.frame")
运行 Rstudio v.1.0.136
附带的包裹:
[1] dplyr_0.7.1 purrr_0.2.2.2 readr_1.1.1 tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1
[7] tidyverse_1.1.1 lubridate_1.6.0
data.table
软件包的 1.9.8 版(2016 年 11 月 25 日在 CRAN 上)引入了非等值连接,这对于查找就业重叠非常有用以月为间隔的时间。 dcast()
然后用于从长格式重塑为宽格式。
library(data.table)
# coerce to data.table
setDT(mdl_df)[
# right join with sequence of monthly intervals
.(mseq = seq(as.Date("1944-01-01"), length.out = 4L, by = "1 month")),
# using non-equi join conditions
on = .(begin_date <= mseq, end_date >= mseq)][
# reshape from wide to long format,
# show rank (concatenate in case of multiple ranks)
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
pname 1944-01-01 1944-02-01 1944-03-01 1944-04-01 1: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 2: Francisco Walker-Linares 4 4 4 4 3: Peter Martin Anker 2 2 2 2
编辑
result <- setDT(mdl_df)[
.(mseq = seq(as.Date("1919-01-01"), as.Date("1948-12-30"), by = "1 month")),
on = .(begin_date <= mseq, end_date >= mseq), nomatch = 0L, allow.cartesian = TRUE][
, dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
result
共27行328列,只能分段打印:
result[, 1:5]
pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01 1: Bertil Gotthard Ohlin 2: Bryan Fullerton Adams 3: C.H. Wykes 2 4: Christian Olsen 5: Dr. Edmond Privat 6: Dr. G. G. Kullmann 7: Eugène Henri René Vigier 8: Francisco Walker-Linares 9: Frank Horsfall Nixon 10: Frank Paul Walters 3 3 3 3 11: Franklin Urteaga 12: Gerald Heguerty Furtado Abraham 13: Gladys Wade 14: Guillaume Théodore Conrad Zwerner 15: Henri Bonnet 16: Haakon Vigander 17: Ignacio J. Valdes 18: Jessie Irene Wall 19: Joseph Louis Marie Charles Avenol 20: Julian Nogueira 21: Konni Zilliacus 22: Luis Varela-Obregoso 23: Marc Veillet-Lavallee 24: Peter Martin Anker 25: Pierre Achille Louis Eugène Quesnay 26: Pierre Henry Watier 27: Robert André Felix Bach pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01
请注意,显示的第一个日期是 1919-06-01,因为没有更早的匹配项。同样,最后一列 328 是 1946-08-01.
result[, c(1, 328 - 2:0)]
pname 1946-06-01 1946-07-01 1946-08-01 1: Bertil Gotthard Ohlin 2: Bryan Fullerton Adams 3: C.H. Wykes 4: Christian Olsen 5: Dr. Edmond Privat 6: Dr. G. G. Kullmann 7: Eugène Henri René Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 8: Francisco Walker-Linares 4 4 9: Frank Horsfall Nixon 10: Frank Paul Walters 11: Franklin Urteaga 12: Gerald Heguerty Furtado Abraham 13: Gladys Wade 14: Guillaume Théodore Conrad Zwerner 15: Henri Bonnet 16: Haakon Vigander 17: Ignacio J. Valdes 18: Jessie Irene Wall 19: Joseph Louis Marie Charles Avenol 20: Julian Nogueira 21: Konni Zilliacus 22: Luis Varela-Obregoso 23: Marc Veillet-Lavallee 24: Peter Martin Anker 2 2 25: Pierre Achille Louis Eugène Quesnay 26: Pierre Henry Watier 27: Robert André Felix Bach pname 1946-06-01 1946-07-01 1946-08-01
请注意 mdl_df
有 mdl_df[, uniqueN(pname)]
34 个唯一名称,而 result
只有 27 个。这是由于数据不一致,即 begin_date
后来 end_date
或周期太短:
# anti-join
mdl_df[!result, on = "pname"]
pname fname_code begin_date end_date 1: Francis Yeats-Brown 2 1927-12-01 1921-12-21 2: Ewan P. Wallis-Jones 2 1934-09-07 1934-09-28 3: Ingvad Nielsen 2 1930-09-08 1930-09-29 4: Maria Nielsen 2 1928-12-05 1928-12-19 5: Birgit Nissen 2 1937-09-09 1937-09-30 6: Prof. Fred Alexander 2 1932-09-06 1932-09-24 7: A. Gordon Bagnall 2 1935-06-14 1935-06-20