R:将连续日期从单列转换为 2 列范围

R: Converting consecutive dates from a single column into a 2-column range

我想弄清楚如何合并具有单列日期的行,这样新的 table/data frame/tibble 将有两列:一列用于开始日期,另一列对于结束日期,但仅适用于连续日期(即日期中的任何间隔应在新 table 中分隔成新行)。它还将按不同的类别进行分组。

我正在处理的数据类型示例如下:

   Person ID   Department   Date     
   351581      JE           12/1/2019
   351581      JE           12/2/2019
   351581      FR           12/2/2019
   351581      JE           12/3/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
   615418      AB           12/22/2019

期望的结果是:

   Person ID   Department   Start Date      End Date
   351581      JE           12/1/2019       12/3/2019
   351581      FR           12/2/2019       12/2/2019
   598168      GH           12/16/2019      12/16/2019
   351581      JE           12/8/2019       12/9/2019
   615418      AB           12/20/2019      12/20/2019
   615418      AB           12/22/2019      12/22/2019

到目前为止,我的搜索已经找到了几个可能相关的问题,这些问题涉及组合日期范围,但我不确定如何将它们仅应用于单个日期列:

dplyr

为了未来的人的利益而添加这个,我最终使用 dplyr 应用了公认的解决方案,只是因为我更喜欢 table 语法。

df %>%
  mutate(Date = as.Date(Date)) %>%
  arrange(`Person ID`, Department, Date) %>%
  group_by(`Person ID`, Department, 
           g = cumsum(c(0, diff(Date)) != 1)
           ) %>%
  summarize(Start = min(Date), End = max(Date)) %>%
  ungroup %>%
  select(-g)

我们在这里假设所询问的是 Person_ID 和 Departmwent 的每个连续组中我们想要的最小和最大日期。

1) data.table 首先将 Date 列转换为 Date class 然后按 [=21= 分组]取最小值和最大值。

library(data.table)
library(lubridate)

DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
   , list(start = min(Date), end = max(Date)), 
   by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]

给予:

   Person_ID Department      start        end
1:    351581         GH 2019-12-01 2019-12-03
2:    351581         FR 2019-12-02 2019-12-02
3:    598168         GH 2019-12-16 2019-12-16
4:    351581         JE 2019-12-08 2019-12-09
5:    615418         AB 2019-12-20 2019-12-20

2) Base RDate 转换为 Date class 然后使用 [=创建分组变量 g =25=]。然后定义一个 Range 函数输出给定组的 startend 并将其应用于每个组。

DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))

给予:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

3) dplyr/data.table 我们使用 data.table 中的 rleid 的混合方法,否则使用 dplyr下列。使用 lubridate 和 rleid 组转换日期,Person_ID 和部门。最后两个是确保它们包含在输出中。计算开始和结束,然后删除分组列。

library(dplyr)
library(data.table)
library(lubridate)

DF0 %>%
  mutate(Date = mdy(Date)) %>%
  group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
  summarize(start = min(Date), end = max(Date)) %>%
  ungroup %>%
  select(-g)

给予:

# A tibble: 5 x 4
  Person_ID Department start      end       
      <int> <fct>      <date>     <date>    
1    351581 GH         2019-12-01 2019-12-03
2    351581 FR         2019-12-02 2019-12-02
3    598168 GH         2019-12-16 2019-12-16
4    351581 JE         2019-12-08 2019-12-09
5    615418 AB         2019-12-20 2019-12-20

4) sqldf定义组Grp在内部select然后通过Grp找到最小和最大日期。

library(sqldf)

DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))

sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select 
    rowid r, 
    Person_ID, 
    Department, 
    Date, 
    Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
  from DF
) group by Grp order by r", method = "name__class")

给予:

  Person_ID Department      start        end
1    351581         GH 2019-12-01 2019-12-03
2    351581         FR 2019-12-02 2019-12-02
3    598168         GH 2019-12-16 2019-12-16
4    351581         JE 2019-12-08 2019-12-09
5    615418         AB 2019-12-20 2019-12-20

备注

假设输入为:

Lines <- "Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019"

DF0 <- read.table(text = Lines, header = TRUE)

假设您已经过滤掉了有间隙的数据,这在我看来是一个非常干净的解决方案。是您要找的帽子吗?


require(dplyr)

df <- tibble::tribble(~`Person ID`, ~`Department`,    ~`Date`,
                      "351581"    ,          "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "351581"    ,          "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
                      "351581"    ,          "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
                      "598168"    ,          "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
                      "351581"    ,          "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
                      "615418"    ,          "AB", as.Date("12/20/2019", format = "%m/%d/%y"))

df %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(Date),
            `End Date` = max(Date)) %>% 
  ungroup()

#> # A tibble: 5 x 4
#>   `Person ID` Department `Start Date` `End Date`
#>   <chr>       <chr>      <date>       <date>    
#> 1 351581      FR         2020-12-02   2020-12-02
#> 2 351581      GH         2020-12-01   2020-12-03
#> 3 351581      JE         2020-12-08   2020-12-09
#> 4 598168      GH         2020-12-16   2020-12-16
#> 5 615418      AB         2020-12-20   2020-12-20

使用 dplyr

假设您有关于 data.frame 的数据,您可以通过 Pearson_idDepartment:

实现结果分组
library(dplyr)
data %>%
  group_by(`Person ID`, Department) %>%
  summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")), 
            `End Date` = max(as.Date(Date, format = "%m/%d/%Y")))

输出将是:

# A tibble: 5 x 4
# Groups:   Person_id [3]
  Person ID Department `Start Date` `End Date`
      <int> <fct>      <date>       <date>    
1    351581 FR         2019-12-02   2019-12-02
2    351581 GH         2019-12-01   2019-12-03
3    351581 JE         2019-12-08   2019-12-09
4    598168 GH         2019-12-16   2019-12-16
5    615418 AB         2019-12-20   2019-12-20

希望对您有所帮助。

这是一个基本的 R 解决方案

dfout <- do.call(rbind,
                 c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))), 
                          function(x) data.frame(unique(x[-3]),
                                                 `Start Date` = head(x[,3],1),
                                                 `End Date` = tail(x[,3],1))),
                   make.row.names = F)
                 )

这样

> dfout
  Person.ID Department Start.Date   End.Date
1    351581         GH  12/1/2019  12/3/2019
2    351581         FR  12/2/2019  12/2/2019
3    598168         GH 12/16/2019 12/16/2019
4    351581         JE  12/8/2019  12/9/2019
5    615418         AB 12/20/2019 12/20/2019

这里我检查的是和上一个日期(diff(Date))的差值是否不为1,如果是,则开始新的一组(取这个指标的cumsum意味着g会增加1 每当 TRUE).

library(data.table)
setDT(df)

df[, Date := as.Date(Date, format = '%m/%d/%Y')]


df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

#    Person_ID Department g      start        end
# 1:    351581         GH 1 2019-12-01 2019-12-03
# 2:    351581         FR 2 2019-12-02 2019-12-02
# 3:    598168         GH 3 2019-12-16 2019-12-16
# 4:    351581         JE 4 2019-12-08 2019-12-09
# 5:    615418         AB 5 2019-12-20 2019-12-20
# 6:    615418         AB 6 2019-12-22 2019-12-22

如果您的数据尚未在(Person_ID,部门)组中按日期排序,您可以将 order(Date) 添加到 df[i, j, k]i 部分,即更改上面的代码为

df[order(Date), .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]

请注意,对于这个更新的示例,这与按 Person_ID 和部门

分组不同
df[, .(start = min(Date), end = max(Date)),
   by = .(Person_ID, Department)]

#    Person_ID Department      start        end
# 1:    351581         GH 2019-12-01 2019-12-03
# 2:    351581         FR 2019-12-02 2019-12-02
# 3:    598168         GH 2019-12-16 2019-12-16
# 4:    351581         JE 2019-12-08 2019-12-09
# 5:    615418         AB 2019-12-20 2019-12-22

使用的数据:

df <- fread('
   Person_ID   Department   Date     
   351581      GH           12/1/2019
   351581      GH           12/2/2019
   351581      GH           12/3/2019
   351581      FR           12/2/2019
   598168      GH           12/16/2019
   351581      JE           12/8/2019
   351581      JE           12/9/2019
   615418      AB           12/20/2019
  615418      AB           12/22/2019
')