R:将连续日期从单列转换为 2 列范围
R: Converting consecutive dates from a single column into a 2-column range
我想弄清楚如何合并具有单列日期的行,这样新的 table/data frame/tibble 将有两列:一列用于开始日期,另一列对于结束日期,但仅适用于连续日期(即日期中的任何间隔应在新 table 中分隔成新行)。它还将按不同的类别进行分组。
我正在处理的数据类型示例如下:
Person ID Department Date
351581 JE 12/1/2019
351581 JE 12/2/2019
351581 FR 12/2/2019
351581 JE 12/3/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
期望的结果是:
Person ID Department Start Date End Date
351581 JE 12/1/2019 12/3/2019
351581 FR 12/2/2019 12/2/2019
598168 GH 12/16/2019 12/16/2019
351581 JE 12/8/2019 12/9/2019
615418 AB 12/20/2019 12/20/2019
615418 AB 12/22/2019 12/22/2019
到目前为止,我的搜索已经找到了几个可能相关的问题,这些问题涉及组合日期范围,但我不确定如何将它们仅应用于单个日期列:
dplyr
为了未来的人的利益而添加这个,我最终使用 dplyr 应用了公认的解决方案,只是因为我更喜欢 table 语法。
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(`Person ID`, Department, Date) %>%
group_by(`Person ID`, Department,
g = cumsum(c(0, diff(Date)) != 1)
) %>%
summarize(Start = min(Date), End = max(Date)) %>%
ungroup %>%
select(-g)
我们在这里假设所询问的是 Person_ID 和 Departmwent 的每个连续组中我们想要的最小和最大日期。
1) data.table 首先将 Date
列转换为 Date
class 然后按 [=21= 分组]取最小值和最大值。
library(data.table)
library(lubridate)
DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
, list(start = min(Date), end = max(Date)),
by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]
给予:
Person_ID Department start end
1: 351581 GH 2019-12-01 2019-12-03
2: 351581 FR 2019-12-02 2019-12-02
3: 598168 GH 2019-12-16 2019-12-16
4: 351581 JE 2019-12-08 2019-12-09
5: 615418 AB 2019-12-20 2019-12-20
2) Base R 将 Date
转换为 Date
class 然后使用 [=创建分组变量 g
=25=]。然后定义一个 Range
函数输出给定组的 start
和 end
并将其应用于每个组。
DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))
给予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
3) dplyr/data.table 我们使用 data.table 中的 rleid
的混合方法,否则使用 dplyr下列。使用 lubridate 和 rleid 组转换日期,Person_ID 和部门。最后两个是确保它们包含在输出中。计算开始和结束,然后删除分组列。
library(dplyr)
library(data.table)
library(lubridate)
DF0 %>%
mutate(Date = mdy(Date)) %>%
group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
summarize(start = min(Date), end = max(Date)) %>%
ungroup %>%
select(-g)
给予:
# A tibble: 5 x 4
Person_ID Department start end
<int> <fct> <date> <date>
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
4) sqldf定义组Grp
在内部select然后通过Grp
找到最小和最大日期。
library(sqldf)
DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select
rowid r,
Person_ID,
Department,
Date,
Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
from DF
) group by Grp order by r", method = "name__class")
给予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
备注
假设输入为:
Lines <- "Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019"
DF0 <- read.table(text = Lines, header = TRUE)
假设您已经过滤掉了有间隙的数据,这在我看来是一个非常干净的解决方案。是您要找的帽子吗?
require(dplyr)
df <- tibble::tribble(~`Person ID`, ~`Department`, ~`Date`,
"351581" , "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
"351581" , "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
"598168" , "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
"615418" , "AB", as.Date("12/20/2019", format = "%m/%d/%y"))
df %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(Date),
`End Date` = max(Date)) %>%
ungroup()
#> # A tibble: 5 x 4
#> `Person ID` Department `Start Date` `End Date`
#> <chr> <chr> <date> <date>
#> 1 351581 FR 2020-12-02 2020-12-02
#> 2 351581 GH 2020-12-01 2020-12-03
#> 3 351581 JE 2020-12-08 2020-12-09
#> 4 598168 GH 2020-12-16 2020-12-16
#> 5 615418 AB 2020-12-20 2020-12-20
使用 dplyr
假设您有关于 data.frame
的数据,您可以通过 Pearson_id
和 Department
:
实现结果分组
library(dplyr)
data %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")),
`End Date` = max(as.Date(Date, format = "%m/%d/%Y")))
输出将是:
# A tibble: 5 x 4
# Groups: Person_id [3]
Person ID Department `Start Date` `End Date`
<int> <fct> <date> <date>
1 351581 FR 2019-12-02 2019-12-02
2 351581 GH 2019-12-01 2019-12-03
3 351581 JE 2019-12-08 2019-12-09
4 598168 GH 2019-12-16 2019-12-16
5 615418 AB 2019-12-20 2019-12-20
希望对您有所帮助。
这是一个基本的 R 解决方案
dfout <- do.call(rbind,
c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))),
function(x) data.frame(unique(x[-3]),
`Start Date` = head(x[,3],1),
`End Date` = tail(x[,3],1))),
make.row.names = F)
)
这样
> dfout
Person.ID Department Start.Date End.Date
1 351581 GH 12/1/2019 12/3/2019
2 351581 FR 12/2/2019 12/2/2019
3 598168 GH 12/16/2019 12/16/2019
4 351581 JE 12/8/2019 12/9/2019
5 615418 AB 12/20/2019 12/20/2019
这里我检查的是和上一个日期(diff(Date)
)的差值是否不为1,如果是,则开始新的一组(取这个指标的cumsum意味着g
会增加1 每当 TRUE
).
library(data.table)
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')]
df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
# Person_ID Department g start end
# 1: 351581 GH 1 2019-12-01 2019-12-03
# 2: 351581 FR 2 2019-12-02 2019-12-02
# 3: 598168 GH 3 2019-12-16 2019-12-16
# 4: 351581 JE 4 2019-12-08 2019-12-09
# 5: 615418 AB 5 2019-12-20 2019-12-20
# 6: 615418 AB 6 2019-12-22 2019-12-22
如果您的数据尚未在(Person_ID,部门)组中按日期排序,您可以将 order(Date)
添加到 df[i, j, k]
的 i
部分,即更改上面的代码为
df[order(Date), .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
请注意,对于这个更新的示例,这与按 Person_ID 和部门
分组不同
df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department)]
# Person_ID Department start end
# 1: 351581 GH 2019-12-01 2019-12-03
# 2: 351581 FR 2019-12-02 2019-12-02
# 3: 598168 GH 2019-12-16 2019-12-16
# 4: 351581 JE 2019-12-08 2019-12-09
# 5: 615418 AB 2019-12-20 2019-12-22
使用的数据:
df <- fread('
Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
')
我想弄清楚如何合并具有单列日期的行,这样新的 table/data frame/tibble 将有两列:一列用于开始日期,另一列对于结束日期,但仅适用于连续日期(即日期中的任何间隔应在新 table 中分隔成新行)。它还将按不同的类别进行分组。
我正在处理的数据类型示例如下:
Person ID Department Date
351581 JE 12/1/2019
351581 JE 12/2/2019
351581 FR 12/2/2019
351581 JE 12/3/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
期望的结果是:
Person ID Department Start Date End Date
351581 JE 12/1/2019 12/3/2019
351581 FR 12/2/2019 12/2/2019
598168 GH 12/16/2019 12/16/2019
351581 JE 12/8/2019 12/9/2019
615418 AB 12/20/2019 12/20/2019
615418 AB 12/22/2019 12/22/2019
到目前为止,我的搜索已经找到了几个可能相关的问题,这些问题涉及组合日期范围,但我不确定如何将它们仅应用于单个日期列:
dplyr
为了未来的人的利益而添加这个,我最终使用 dplyr 应用了公认的解决方案,只是因为我更喜欢 table 语法。
df %>%
mutate(Date = as.Date(Date)) %>%
arrange(`Person ID`, Department, Date) %>%
group_by(`Person ID`, Department,
g = cumsum(c(0, diff(Date)) != 1)
) %>%
summarize(Start = min(Date), End = max(Date)) %>%
ungroup %>%
select(-g)
我们在这里假设所询问的是 Person_ID 和 Departmwent 的每个连续组中我们想要的最小和最大日期。
1) data.table 首先将 Date
列转换为 Date
class 然后按 [=21= 分组]取最小值和最大值。
library(data.table)
library(lubridate)
DT <- as.data.table(DF0)
DT[, Date := mdy(Date)][
, list(start = min(Date), end = max(Date)),
by = .(rleid(Person_ID, Department), Person_ID, Department)][-1]
给予:
Person_ID Department start end
1: 351581 GH 2019-12-01 2019-12-03
2: 351581 FR 2019-12-02 2019-12-02
3: 598168 GH 2019-12-16 2019-12-16
4: 351581 JE 2019-12-08 2019-12-09
5: 615418 AB 2019-12-20 2019-12-20
2) Base R 将 Date
转换为 Date
class 然后使用 [=创建分组变量 g
=25=]。然后定义一个 Range
函数输出给定组的 start
和 end
并将其应用于每个组。
DF <- transform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
g <- with(rle(paste(DF$Person_ID, DF$Department)), rep(seq_along(lengths), lengths))
Range <- function(x) data.frame(x[1, 1:2], start = min(x$Date), end = max(x$Date))
do.call("rbind", by(DF, g, Range))
给予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
3) dplyr/data.table 我们使用 data.table 中的 rleid
的混合方法,否则使用 dplyr下列。使用 lubridate 和 rleid 组转换日期,Person_ID 和部门。最后两个是确保它们包含在输出中。计算开始和结束,然后删除分组列。
library(dplyr)
library(data.table)
library(lubridate)
DF0 %>%
mutate(Date = mdy(Date)) %>%
group_by(g = rleid(Person_ID, Department), Person_ID, Department) %>%
summarize(start = min(Date), end = max(Date)) %>%
ungroup %>%
select(-g)
给予:
# A tibble: 5 x 4
Person_ID Department start end
<int> <fct> <date> <date>
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
4) sqldf定义组Grp
在内部select然后通过Grp
找到最小和最大日期。
library(sqldf)
DF <- trnsform(DF0, Date = as.Date(Date, "%m/%d/%Y"))
sqldf("select Person_ID, Department, min(Date) as start__Date, max(Date) as end__Date
from ( select
rowid r,
Person_ID,
Department,
Date,
Date - dense_rank() over (partition by Person_ID, Department order by rowid) as Grp
from DF
) group by Grp order by r", method = "name__class")
给予:
Person_ID Department start end
1 351581 GH 2019-12-01 2019-12-03
2 351581 FR 2019-12-02 2019-12-02
3 598168 GH 2019-12-16 2019-12-16
4 351581 JE 2019-12-08 2019-12-09
5 615418 AB 2019-12-20 2019-12-20
备注
假设输入为:
Lines <- "Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019"
DF0 <- read.table(text = Lines, header = TRUE)
假设您已经过滤掉了有间隙的数据,这在我看来是一个非常干净的解决方案。是您要找的帽子吗?
require(dplyr)
df <- tibble::tribble(~`Person ID`, ~`Department`, ~`Date`,
"351581" , "GH", as.Date("12/1/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/2/2019", format = "%m/%d/%y"),
"351581" , "GH", as.Date("12/3/2019", format = "%m/%d/%y"),
"351581" , "FR", as.Date("12/2/2019", format = "%m/%d/%y"),
"598168" , "GH", as.Date("12/16/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/8/2019", format = "%m/%d/%y"),
"351581" , "JE", as.Date("12/9/2019", format = "%m/%d/%y"),
"615418" , "AB", as.Date("12/20/2019", format = "%m/%d/%y"))
df %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(Date),
`End Date` = max(Date)) %>%
ungroup()
#> # A tibble: 5 x 4
#> `Person ID` Department `Start Date` `End Date`
#> <chr> <chr> <date> <date>
#> 1 351581 FR 2020-12-02 2020-12-02
#> 2 351581 GH 2020-12-01 2020-12-03
#> 3 351581 JE 2020-12-08 2020-12-09
#> 4 598168 GH 2020-12-16 2020-12-16
#> 5 615418 AB 2020-12-20 2020-12-20
使用 dplyr
假设您有关于 data.frame
的数据,您可以通过 Pearson_id
和 Department
:
library(dplyr)
data %>%
group_by(`Person ID`, Department) %>%
summarise(`Start Date` = min(as.Date(Date, format = "%m/%d/%Y")),
`End Date` = max(as.Date(Date, format = "%m/%d/%Y")))
输出将是:
# A tibble: 5 x 4
# Groups: Person_id [3]
Person ID Department `Start Date` `End Date`
<int> <fct> <date> <date>
1 351581 FR 2019-12-02 2019-12-02
2 351581 GH 2019-12-01 2019-12-03
3 351581 JE 2019-12-08 2019-12-09
4 598168 GH 2019-12-16 2019-12-16
5 615418 AB 2019-12-20 2019-12-20
希望对您有所帮助。
这是一个基本的 R 解决方案
dfout <- do.call(rbind,
c(lapply(split(df,cut(1:nrow(df),c(0,cumsum(rle(df$Department)$lengths)))),
function(x) data.frame(unique(x[-3]),
`Start Date` = head(x[,3],1),
`End Date` = tail(x[,3],1))),
make.row.names = F)
)
这样
> dfout
Person.ID Department Start.Date End.Date
1 351581 GH 12/1/2019 12/3/2019
2 351581 FR 12/2/2019 12/2/2019
3 598168 GH 12/16/2019 12/16/2019
4 351581 JE 12/8/2019 12/9/2019
5 615418 AB 12/20/2019 12/20/2019
这里我检查的是和上一个日期(diff(Date)
)的差值是否不为1,如果是,则开始新的一组(取这个指标的cumsum意味着g
会增加1 每当 TRUE
).
library(data.table)
setDT(df)
df[, Date := as.Date(Date, format = '%m/%d/%Y')]
df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
# Person_ID Department g start end
# 1: 351581 GH 1 2019-12-01 2019-12-03
# 2: 351581 FR 2 2019-12-02 2019-12-02
# 3: 598168 GH 3 2019-12-16 2019-12-16
# 4: 351581 JE 4 2019-12-08 2019-12-09
# 5: 615418 AB 5 2019-12-20 2019-12-20
# 6: 615418 AB 6 2019-12-22 2019-12-22
如果您的数据尚未在(Person_ID,部门)组中按日期排序,您可以将 order(Date)
添加到 df[i, j, k]
的 i
部分,即更改上面的代码为
df[order(Date), .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department, g = cumsum(c(0, diff(Date)) != 1))]
请注意,对于这个更新的示例,这与按 Person_ID 和部门
分组不同df[, .(start = min(Date), end = max(Date)),
by = .(Person_ID, Department)]
# Person_ID Department start end
# 1: 351581 GH 2019-12-01 2019-12-03
# 2: 351581 FR 2019-12-02 2019-12-02
# 3: 598168 GH 2019-12-16 2019-12-16
# 4: 351581 JE 2019-12-08 2019-12-09
# 5: 615418 AB 2019-12-20 2019-12-22
使用的数据:
df <- fread('
Person_ID Department Date
351581 GH 12/1/2019
351581 GH 12/2/2019
351581 GH 12/3/2019
351581 FR 12/2/2019
598168 GH 12/16/2019
351581 JE 12/8/2019
351581 JE 12/9/2019
615418 AB 12/20/2019
615418 AB 12/22/2019
')