将时间间隔转换为 time/person 矩阵

Converting time intervals into time/person matrix

我正在做一个历史项目,我们需要了解每个月的员工人数。对于数据集中的每个人,我都有他们受雇的时期。 fname_code 不同职位的代码。举个例子,Edmont Privat博士在以下各个时期有两个不同的功能:

                             pname  fname_code begin_date   end_date
1                 Dr. Edmond Privat          3 1921-09-02 1921-10-07
2                 Dr. Edmond Privat          2 1921-12-07 1922-03-06
3 Joseph Louis Marie Charles Avenol          1 1923-02-01 1933-07-01
4 Joseph Louis Marie Charles Avenol          1 1933-07-01 1940-08-31
5               Dr. G. G.  Kullmann          2 1931-03-30 1938-12-15

我的想法是将此信息转换为 subject/date 句点 dataframe/matrix,其中 N 表示此时此人不在公司,而数字表示他们被雇用并且他们的地位。这是我的想法的一个例子:

                               1944-07-01 1944-08-01 1944-09-01 1944-10-01 
Albert Dan Meurig Evans                 N          N          N          N               
Genevieve Jeanne Leonie Mayor           N          2          3          3      

我已经做了一个可以完成这项工作的东西 - 它已经完成了上面的几行 - 但是,代码是 definetley 不适合胆小的程序员(很多前循环!)。我的问题是你们中有没有经验丰富的程序员有提高速度的建议,或者以完全不同的方式实现我的目标。 我试过玩一些 dplyr 函数,但我对它们的经验太少,无法使它们工作。我还考虑过创建一个 if 条件来处理一个人刚刚被雇用一段时间的情况,因为在这种情况下不需要 forloop - 但我不确定在哪里最佳实施它。

我的计算灾难的逻辑是查看数据框和就业数据框的月份间隔之间是否存在重叠:

library(lubridate)
library(tidyverse)
#creating sequence of dates for columns
start_date <- as.Date("1919-01-01")
end_date <- as.Date("1948-12-30")
dates <- seq.Date(start_date, end_date, by ="month")

#dates as columns and names on columns
test.df <- matrix(ncol =length(dates), nrow = nlevels(mdl_df$pname))
test.df <- as.data.frame(test.df)
colnames(test.df) <- dates
rownames(test.df) <- levels(mdl_df$pname)

for (name in 1:nlevels(mdl_df$pname)){ 
#subsetting the rows for each person
  person_rows <- mdl_df %>% filter( mdl_df$pname == rownames(test.df)[name]) 


  for (date in 1:(length(dates)-1)) { 

    #Creating a month interval consisting of the time between two adjecent months
    interval1 <- interval(ymd(colnames(test.df)[date]),ymd(colnames(test.df)[date+1]))  

    for (row in 1:nrow(person_rows)) {
        #check if overlap between df month interval and employment intervals.

       interval2 <- interval(ymd(person_rows$begin_date[row]),ymd(person_rows$end_date[row]))    

       if (int_overlaps(interval1, interval2)){
       #checking if df period and work period overlap. If so rank is inserted otherwise N is entered

         test.df[name,date] <- test_rows$fname_code[row]
         break
       }else{
         test.df[name,date] <- "N"
    }
   }    
  }
}

该数据集由大约 3000 名员工组成,我的计算机需要大约 6-7 个小时才能完成这项工作。在接下来的几周内,我需要在各种数据集上多次 运行 和重新 运行 脚本,因此非常感谢您的帮助!

编辑:数据集前 50 行的 dput 输出。

> dput(droplevels(head(mdl_df, 50)))
structure(list(pname = structure(c(7L, 7L, 24L, 24L, 8L, 19L, 
16L, 16L, 16L, 4L, 34L, 11L, 17L, 12L, 23L, 10L, 14L, 14L, 14L, 
14L, 14L, 32L, 5L, 22L, 29L, 3L, 13L, 25L, 2L, 6L, 26L, 18L, 
21L, 27L, 27L, 28L, 20L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 15L, 31L, 
33L, 30L, 30L, 1L), .Label = c("A. Gordon Bagnall", "Bertil Gotthard Ohlin", 
"Birgit Nissen", "Bryan Fullerton Adams", "C.H. Wykes", "Christian Olsen", 
"Dr. Edmond Privat", "Dr. G. G.  Kullmann", "Eugène Henri René  Vigier", 
"Ewan P. Wallis-Jones", "Francis Yeats-Brown", "Francisco Walker-Linares", 
"Frank Horsfall Nixon", "Frank Paul Walters", "Franklin Urteaga", 
"Gerald Heguerty Furtado Abraham", "Gladys Wade", "Guillaume Théodore Conrad Zwerner", 
"Henri Bonnet", "Haakon Vigander", "Ignacio J. Valdes", "Ingvad Nielsen", 
"Jessie Irene Wall", "Joseph Louis Marie Charles Avenol", "Julian Nogueira", 
"Konni Zilliacus", "Luis Varela-Obregoso", "Marc Veillet-Lavallee", 
"Maria Nielsen", "Peter Martin Anker", "Pierre Achille Louis Eugène Quesnay", 
"Pierre Henry Watier", "Prof. Fred Alexander", "Robert André Felix Bach"
), class = "factor"), fname_code = c(3L, 2L, 1L, 1L, 2L, 2L, 
2L, 0L, 2L, 4L, 2L, 2L, 2L, 4L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 2L, 2L, 6L, 
0L, 0L, 1L, 2L, 3L, 6L, 2L, 2L, 2L, 2L, 2L, 2L), begin_date = structure(c(-17653, 
-17557, -17136, -13333, -14157, -17897, -18050, -13789, -8962, 
-15010, -11810, -15372, -14003, -14855, -16047, -12900, -18494, 
-18254, -14245, -13333, -11172, -12008, -18398, -14360, -15002, 
-11802, -17883, -12862, -14245, -17136, -18248, -14975, -13989, 
-15494, -15372, -14108, -14738, -18201, -17849, -17849, -11657, 
-10592, -10579, -10130, -11436, -16849, -13631, -14033, -11161, 
-12620), class = "Date"), end_date = structure(c(-17618, -17468, 
-13333, -10715, -11340, -14243, -13789, -11223, -8624, -11178, 
-10797, -17543, -13982, -8555, -15628, -12879, -18254, -14245, 
-13333, -11172, -10809, -11822, -18255, -14339, -14988, -11781, 
-17078, -11158, -13958, -16590, -11401, -14610, -13968, -15434, 
-15007, -13920, -14717, -17849, -8524, -8524, -8524, -8524, -8524, 
-8524, -11415, -15707, -13613, -11161, -8555, -12614), class = "Date")), 
.Names = c("pname", "fname_code", "begin_date", "end_date"), row.names = c(NA, 50L), class = "data.frame")

运行 Rstudio v.1.0.136

附带的包裹:

[1] dplyr_0.7.1     purrr_0.2.2.2   readr_1.1.1     tidyr_0.6.3     tibble_1.3.3    ggplot2_2.2.1  
[7] tidyverse_1.1.1 lubridate_1.6.0

data.table 软件包的 1.9.8 版(2016 年 11 月 25 日在 CRAN 上)引入了非等值连接,这对于查找就业重叠非常有用以月为间隔的时间。 dcast() 然后用于从长格式重塑为宽格式。

library(data.table)
# coerce to data.table
setDT(mdl_df)[
  # right join with sequence of monthly intervals
  .(mseq = seq(as.Date("1944-01-01"), length.out = 4L, by = "1 month")), 
  # using non-equi join conditions
  on = .(begin_date <= mseq, end_date >= mseq)][
    # reshape from wide to long format,
    # show rank (concatenate in case of multiple ranks)
    , dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]
                         pname    1944-01-01    1944-02-01    1944-03-01    1944-04-01
1: Eugène Henri René  Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6
2:    Francisco Walker-Linares             4             4             4             4
3:          Peter Martin Anker             2             2             2             2

编辑

,OP 已要求涵盖 1919-01-01 至 1948-12-30 期间。这里,我们需要修改join参数:

result <- setDT(mdl_df)[
  .(mseq = seq(as.Date("1919-01-01"), as.Date("1948-12-30"), by = "1 month")), 
  on = .(begin_date <= mseq, end_date >= mseq), nomatch = 0L, allow.cartesian = TRUE][
    , dcast(unique(.SD), pname ~ end_date, toString, value.var = "fname_code")]

result共27行328列,只能分​​段打印:

result[, 1:5]
                                   pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01
 1:                Bertil Gotthard Ohlin                                            
 2:                Bryan Fullerton Adams                                            
 3:                           C.H. Wykes                                           2
 4:                      Christian Olsen                                            
 5:                    Dr. Edmond Privat                                            
 6:                  Dr. G. G.  Kullmann                                            
 7:          Eugène Henri René  Vigier                                            
 8:             Francisco Walker-Linares                                            
 9:                 Frank Horsfall Nixon                                            
10:                   Frank Paul Walters          3          3          3          3
11:                     Franklin Urteaga                                            
12:      Gerald Heguerty Furtado Abraham                                            
13:                          Gladys Wade                                            
14:   Guillaume Théodore Conrad Zwerner                                            
15:                         Henri Bonnet                                            
16:                      Haakon Vigander                                            
17:                    Ignacio J. Valdes                                            
18:                    Jessie Irene Wall                                            
19:    Joseph Louis Marie Charles Avenol                                            
20:                      Julian Nogueira                                            
21:                      Konni Zilliacus                                            
22:                 Luis Varela-Obregoso                                            
23:                Marc Veillet-Lavallee                                            
24:                   Peter Martin Anker                                            
25: Pierre Achille Louis Eugène Quesnay                                            
26:                  Pierre Henry Watier                                            
27:             Robert André Felix Bach                                            
                                   pname 1919-06-01 1919-07-01 1919-08-01 1919-09-01

请注意,显示的第一个日期是 1919-06-01,因为没有更早的匹配项。同样,最后一列 328 是 1946-08-01.

result[, c(1, 328 - 2:0)]
                                   pname    1946-06-01    1946-07-01    1946-08-01
 1:                Bertil Gotthard Ohlin                                          
 2:                Bryan Fullerton Adams                                          
 3:                           C.H. Wykes                                          
 4:                      Christian Olsen                                          
 5:                    Dr. Edmond Privat                                          
 6:                  Dr. G. G.  Kullmann                                          
 7:          Eugène Henri René  Vigier 0, 1, 2, 3, 6 0, 1, 2, 3, 6 0, 1, 2, 3, 6
 8:             Francisco Walker-Linares             4             4              
 9:                 Frank Horsfall Nixon                                          
10:                   Frank Paul Walters                                          
11:                     Franklin Urteaga                                          
12:      Gerald Heguerty Furtado Abraham                                          
13:                          Gladys Wade                                          
14:   Guillaume Théodore Conrad Zwerner                                          
15:                         Henri Bonnet                                          
16:                      Haakon Vigander                                          
17:                    Ignacio J. Valdes                                          
18:                    Jessie Irene Wall                                          
19:    Joseph Louis Marie Charles Avenol                                          
20:                      Julian Nogueira                                          
21:                      Konni Zilliacus                                          
22:                 Luis Varela-Obregoso                                          
23:                Marc Veillet-Lavallee                                          
24:                   Peter Martin Anker             2             2              
25: Pierre Achille Louis Eugène Quesnay                                          
26:                  Pierre Henry Watier                                          
27:             Robert André Felix Bach                                          
                                   pname    1946-06-01    1946-07-01    1946-08-01

请注意 mdl_dfmdl_df[, uniqueN(pname)] 34 个唯一名称,而 result 只有 27 个。这是由于数据不一致,即 begin_date 后来 end_date 或周期太短:

# anti-join
mdl_df[!result, on = "pname"]
                  pname fname_code begin_date   end_date
1:  Francis Yeats-Brown          2 1927-12-01 1921-12-21
2: Ewan P. Wallis-Jones          2 1934-09-07 1934-09-28
3:       Ingvad Nielsen          2 1930-09-08 1930-09-29
4:        Maria Nielsen          2 1928-12-05 1928-12-19
5:        Birgit Nissen          2 1937-09-09 1937-09-30
6: Prof. Fred Alexander          2 1932-09-06 1932-09-24
7:    A. Gordon Bagnall          2 1935-06-14 1935-06-20