lubridate::interval 个对象中的身份评估错误
Evaluation error of identity in lubridate::interval objects
假设这样的 df:
df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))
id time1 time2
1 1 2008-10-12 2009-03-20
2 1 2008-08-10 2009-06-15
3 2 2006-01-09 2006-02-13
4 2 2008-03-13 2008-04-17
5 3 2008-09-12 2008-10-17
6 3 2007-05-30 2007-07-04
7 4 2003-09-29 2004-01-15
8 4 2003-09-29 2004-01-15
9 5 2003-04-01 2003-07-04
10 5 2003-04-01 2003-07-04
我尝试做的是,首先,在变量 "time1" 和 "time2" 之间创建一个 lubridate
区间。第二,我想按 "id" 分组,比较下一行是否与当前行相同,当前行是否与上一行相同。我可以通过以下方式实现它:
library(tidyverse)
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
问题是,如您所见,对于 id == 2 和 id == 3,这两个条件都被评估为 TRUE,即使间隔不同。对于 id == 1,它正确地评估为 FALSE,对于 id == 4 和 id == 5,它正确地评估为 TRUE。
现在,当我将区间转换为字符时,它可以正常评估:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <chr> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 0 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 0 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
问题是,为什么它评估某些区间是相同的,而实际上它们不是?
我认为这与 lubridate
实际计算的内容有关。
当我计算 date1
和 date2
之间的差异时,会发生这种情况:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = time2 - time1)
id time1 time2 overlap
1 1 2008-10-12 2009-03-20 159 days
2 1 2008-08-10 2009-06-15 309 days
3 2 2006-01-09 2006-02-13 35 days
4 2 2008-03-13 2008-04-17 35 days
5 3 2008-09-12 2008-10-17 35 days
6 3 2007-05-30 2007-07-04 35 days
7 4 2003-09-29 2004-01-15 108 days
8 4 2003-09-29 2004-01-15 108 days
9 5 2003-04-01 2003-07-04 94 days
10 5 2003-04-01 2003-07-04 94 days
所以我们可以看出时间间隔在一天的长度上是相同的。
现在,overlap
实际在计算什么?为了找出答案,我稍微更改了您的代码以报告超前和滞后而不是 1。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))
# A tibble: 10 x 6
# Groups: id [5]
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 3024000
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 3024000
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 9331200
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 8121600
在这里,我们看到 lead
和 lag
实际上计算了特定时间间隔内的差异,而不是查看实际间隔开始和结束日期。这就是为什么它认为某些间隔相等而字符串不相等的原因,因为它们应该是。
再挖掘一下:
我们来看看interval
.
生成的对象
a <- interval(df$time1, df$time2)
str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"
这是一个 S4 class,具有三个插槽:.Data
、start
和 tzone
。
调用 a
显示间隔。
a
[1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
[4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
[7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC
但是当您在 a
上执行计算时,它是在 .Data
上执行的,这是从指定日期开始的秒数序列(请参阅 ?interval
)。
a@.Data
#[1] 13737600 26697600 3024000 3024000 3024000 3024000 9331200 9331200 8121600 8121600
对于间隔的开始日期,我们需要访问 start
槽。
a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"
以及时区...
a@tzone
#[1] "UTC"
我们还可以看看元素之间的关系是什么。最后一个和倒数第二个元素具有相同的间隔。
a[9] == a[10]
#[1] TRUE
而且它们是相同的对象。
identical(a[9], a[10])
#[1] TRUE
但是当您检查元素是否相等时,它真正检查的是什么?元素 3 和 4 具有相同的时间差,但不是相同的间隔。因此,当您检查它们的 lag/leads 是否相等时,它返回 TRUE
。但由于它们具有不同的间隔日期,因此它们不应该如此。所以当我们检查它们是否相同时,只有这样我们才能得到我们期望的结果。
a[3] == a[4]
#[1] TRUE
a[3]@.Data == a[4]@.Data
#[1] TRUE
identical(a[3], a[4])
#[1] FALSE
所以发生了什么事? a[3] == a[4]
真正检查的是 a[3]@.Data == a[4]@.Data
,因此它检查 3024000
是否等于 3024000
。它这样做 returns TRUE
。但是identical检查了所有的slot,发现不一样,因为每个slot中的start
都不一样
然后我考虑使用 identical with lead/lag 这样我们就可以在代码中加入一个逻辑,但是看看这个。
a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC
# now lead
lead(a[9])
#2003-04-01 UTC--NA
输出看起来不像预期的那样 a[10]
。
#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"
所以lead
和lag
对classS4对象有不同的影响。为了更好地处理您第一次尝试输出的内容,我这样做了:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = lead(overlap),
cond2 = lag(overlap))
我收到很多警告消息说
#In mutate_impl(.data, dots) :
# Vectorizing 'Interval' elements may not preserve their attributes
我对 R 对象了解不够,无法理解 S4 class 中的数据是如何存储的,但它看起来肯定与典型的 S3 对象不同。
似乎像您一样使用 as.character
是可行的方法。
更新
如果您查看 Interval
类 的代码,您会看到创建对象时它会存储开始日期,然后计算开始和结束之间的差值并将其存储为.Data
。
interval <- function(start, end = NULL, tzone = tz(start)) {
if (is.null(tzone)) {
tzone <- tz(end)
if (is.null(tzone))
tzone <- "UTC"
}
if (is.character(start) && is.null(end)) {
return(parse_interval(start, tzone))
}
if (is.Date(start)) start <- date_to_posix(start)
if (is.Date(end)) end <- date_to_posix(end)
start <- as_POSIXct(start, tzone)
end <- as_POSIXct(end, tzone)
span <- as.numeric(end) - as.numeric(start)
starts <- start + rep(0, length(span))
if (tzone != tz(starts)) starts <- with_tz(starts, tzone)
new("Interval", span, start = starts, tzone = tzone)
}
也就是说,返回的对象没有"end date"的概念。 end
参数的默认值为 NULL
,这意味着您甚至可以创建没有结束日期的间隔。
interval("2019-03-29")
[1] 2019-03-29 UTC--NA
"end date" 只是在格式化 Interval
对象以供打印时通过计算生成的文本。
format.Interval <- function(x, ...) {
if (length(x@.Data) == 0) return("Interval(0)")
paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
format(x@start + x@.Data, tz = x@tzone, usetz = TRUE), sep = "")
}
int_end <- function(int) int@start + int@.Data
这两个代码片段均来自 https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r。
访问overlap
的底层属性可以让你在不转换为字符的情况下完成比较。您必须检查 start
和 .Data
是否相等。转换为字符要干净得多,但如果你想避免它,这就是你可以这样做的方法。
ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0)
一共采取:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2),
overlap_char = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
cond1_new = ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0),
cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
cond2_new = ifelse(lag(overlap@start) == overlap@start & lag(overlap@.Data) == overlap@.Data, 1, 0))
id time1 time2 overlap overlap_char cond1_original cond1_new cond2_original cond2_new
<int> <date> <date> <S4: Interval> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC 0 0 NA NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC NA NA 0 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC 0 0 NA NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC NA NA 0 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC 0 0 NA NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC NA NA 0 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 1 1 NA NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC NA NA 1 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC 1 1 NA NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC NA NA 1 1
您可以在此处阅读有关 Interval
的更多信息:https://lubridate.tidyverse.org/reference/Interval-class.html
我相信您的确切案例与 ==
比较有关。正如你在上面看到的,"overlap" 是一个列表,
不是向量。从 ?==
,它说:
At least one of x and y must be an atomic vector, but if the other is
a list R attempts to coerce it to the type of the atomic vector: this
will succeed if the list is made up of elements of length one that can
be coerced to the correct type.
If the two arguments are atomic vectors of different types, one is
coerced to the type of the other, the (decreasing) order of precedence
being character, complex, numeric, integer, logical and raw.
我们可以将 "overlap" 强制转换为 numeric
和 character
以查看差异。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
mutate(overlap.n = as.numeric(overlap),
overlap.c = as.character(overlap))
# A tibble: 10 x 8
# Groups: id [5]
id time1 time2 overlap cond1 cond2 overlap.n overlap.c
<int> <date> <date> <S4: Interval> <dbl> <dbl> <dbl> <chr>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA 13737600 2008-10-12 U…
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0 26697600 2008-08-10 U…
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA 3024000 2006-01-09 U…
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1 3024000 2008-03-13 U…
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA 3024000 2008-09-12 U…
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1 3024000 2007-05-30 U…
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA 9331200 2003-09-29 U…
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1 9331200 2003-09-29 U…
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA 8121600 2003-04-01 U…
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1 8121600 2003-04-01 U…
根据上面的输出,我相信使用 ==
会将 "overlap" 间隔强制转换为 numeric
向量,从而导致上面提到的持续时间比较 @hmhensen。当你强制
强制转换为 character
而不是 numeric
,您会得到想要的结果。
假设这样的 df:
df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))
id time1 time2
1 1 2008-10-12 2009-03-20
2 1 2008-08-10 2009-06-15
3 2 2006-01-09 2006-02-13
4 2 2008-03-13 2008-04-17
5 3 2008-09-12 2008-10-17
6 3 2007-05-30 2007-07-04
7 4 2003-09-29 2004-01-15
8 4 2003-09-29 2004-01-15
9 5 2003-04-01 2003-07-04
10 5 2003-04-01 2003-07-04
我尝试做的是,首先,在变量 "time1" 和 "time2" 之间创建一个 lubridate
区间。第二,我想按 "id" 分组,比较下一行是否与当前行相同,当前行是否与上一行相同。我可以通过以下方式实现它:
library(tidyverse)
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
问题是,如您所见,对于 id == 2 和 id == 3,这两个条件都被评估为 TRUE,即使间隔不同。对于 id == 1,它正确地评估为 FALSE,对于 id == 4 和 id == 5,它正确地评估为 TRUE。
现在,当我将区间转换为字符时,它可以正常评估:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0))
id time1 time2 overlap cond1 cond2
<int> <date> <date> <chr> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 0 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 0 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1
问题是,为什么它评估某些区间是相同的,而实际上它们不是?
我认为这与 lubridate
实际计算的内容有关。
当我计算 date1
和 date2
之间的差异时,会发生这种情况:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = time2 - time1)
id time1 time2 overlap
1 1 2008-10-12 2009-03-20 159 days
2 1 2008-08-10 2009-06-15 309 days
3 2 2006-01-09 2006-02-13 35 days
4 2 2008-03-13 2008-04-17 35 days
5 3 2008-09-12 2008-10-17 35 days
6 3 2007-05-30 2007-07-04 35 days
7 4 2003-09-29 2004-01-15 108 days
8 4 2003-09-29 2004-01-15 108 days
9 5 2003-04-01 2003-07-04 94 days
10 5 2003-04-01 2003-07-04 94 days
所以我们可以看出时间间隔在一天的长度上是相同的。
现在,overlap
实际在计算什么?为了找出答案,我稍微更改了您的代码以报告超前和滞后而不是 1。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))
# A tibble: 10 x 6
# Groups: id [5]
id time1 time2 overlap cond1 cond2
<int> <date> <date> <S4: Interval> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000 NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 3024000
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000 NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 3024000
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200 NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 9331200
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600 NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 8121600
在这里,我们看到 lead
和 lag
实际上计算了特定时间间隔内的差异,而不是查看实际间隔开始和结束日期。这就是为什么它认为某些间隔相等而字符串不相等的原因,因为它们应该是。
再挖掘一下:
我们来看看interval
.
a <- interval(df$time1, df$time2)
str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"
这是一个 S4 class,具有三个插槽:.Data
、start
和 tzone
。
调用 a
显示间隔。
a
[1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
[4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
[7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC
但是当您在 a
上执行计算时,它是在 .Data
上执行的,这是从指定日期开始的秒数序列(请参阅 ?interval
)。
a@.Data
#[1] 13737600 26697600 3024000 3024000 3024000 3024000 9331200 9331200 8121600 8121600
对于间隔的开始日期,我们需要访问 start
槽。
a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"
以及时区...
a@tzone
#[1] "UTC"
我们还可以看看元素之间的关系是什么。最后一个和倒数第二个元素具有相同的间隔。
a[9] == a[10]
#[1] TRUE
而且它们是相同的对象。
identical(a[9], a[10])
#[1] TRUE
但是当您检查元素是否相等时,它真正检查的是什么?元素 3 和 4 具有相同的时间差,但不是相同的间隔。因此,当您检查它们的 lag/leads 是否相等时,它返回 TRUE
。但由于它们具有不同的间隔日期,因此它们不应该如此。所以当我们检查它们是否相同时,只有这样我们才能得到我们期望的结果。
a[3] == a[4]
#[1] TRUE
a[3]@.Data == a[4]@.Data
#[1] TRUE
identical(a[3], a[4])
#[1] FALSE
所以发生了什么事? a[3] == a[4]
真正检查的是 a[3]@.Data == a[4]@.Data
,因此它检查 3024000
是否等于 3024000
。它这样做 returns TRUE
。但是identical检查了所有的slot,发现不一样,因为每个slot中的start
都不一样
然后我考虑使用 identical with lead/lag 这样我们就可以在代码中加入一个逻辑,但是看看这个。
a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC
# now lead
lead(a[9])
#2003-04-01 UTC--NA
输出看起来不像预期的那样 a[10]
。
#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"
所以lead
和lag
对classS4对象有不同的影响。为了更好地处理您第一次尝试输出的内容,我这样做了:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = lead(overlap),
cond2 = lag(overlap))
我收到很多警告消息说
#In mutate_impl(.data, dots) :
# Vectorizing 'Interval' elements may not preserve their attributes
我对 R 对象了解不够,无法理解 S4 class 中的数据是如何存储的,但它看起来肯定与典型的 S3 对象不同。
似乎像您一样使用 as.character
是可行的方法。
更新
如果您查看 Interval
类 的代码,您会看到创建对象时它会存储开始日期,然后计算开始和结束之间的差值并将其存储为.Data
。
interval <- function(start, end = NULL, tzone = tz(start)) {
if (is.null(tzone)) {
tzone <- tz(end)
if (is.null(tzone))
tzone <- "UTC"
}
if (is.character(start) && is.null(end)) {
return(parse_interval(start, tzone))
}
if (is.Date(start)) start <- date_to_posix(start)
if (is.Date(end)) end <- date_to_posix(end)
start <- as_POSIXct(start, tzone)
end <- as_POSIXct(end, tzone)
span <- as.numeric(end) - as.numeric(start)
starts <- start + rep(0, length(span))
if (tzone != tz(starts)) starts <- with_tz(starts, tzone)
new("Interval", span, start = starts, tzone = tzone)
}
也就是说,返回的对象没有"end date"的概念。 end
参数的默认值为 NULL
,这意味着您甚至可以创建没有结束日期的间隔。
interval("2019-03-29")
[1] 2019-03-29 UTC--NA
"end date" 只是在格式化 Interval
对象以供打印时通过计算生成的文本。
format.Interval <- function(x, ...) {
if (length(x@.Data) == 0) return("Interval(0)")
paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
format(x@start + x@.Data, tz = x@tzone, usetz = TRUE), sep = "")
}
int_end <- function(int) int@start + int@.Data
这两个代码片段均来自 https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r。
访问overlap
的底层属性可以让你在不转换为字符的情况下完成比较。您必须检查 start
和 .Data
是否相等。转换为字符要干净得多,但如果你想避免它,这就是你可以这样做的方法。
ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0)
一共采取:
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2),
overlap_char = as.character(interval(time1, time2))) %>%
group_by(id) %>%
mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
cond1_new = ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0),
cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
cond2_new = ifelse(lag(overlap@start) == overlap@start & lag(overlap@.Data) == overlap@.Data, 1, 0))
id time1 time2 overlap overlap_char cond1_original cond1_new cond2_original cond2_new
<int> <date> <date> <S4: Interval> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC 0 0 NA NA
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC NA NA 0 0
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC 0 0 NA NA
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC NA NA 0 0
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC 0 0 NA NA
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC NA NA 0 0
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 1 1 NA NA
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC NA NA 1 1
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC 1 1 NA NA
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC NA NA 1 1
您可以在此处阅读有关 Interval
的更多信息:https://lubridate.tidyverse.org/reference/Interval-class.html
我相信您的确切案例与 ==
比较有关。正如你在上面看到的,"overlap" 是一个列表,
不是向量。从 ?==
,它说:
At least one of x and y must be an atomic vector, but if the other is a list R attempts to coerce it to the type of the atomic vector: this will succeed if the list is made up of elements of length one that can be coerced to the correct type.
If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.
我们可以将 "overlap" 强制转换为 numeric
和 character
以查看差异。
df %>%
mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
mutate(overlap = interval(time1, time2)) %>%
group_by(id) %>%
mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
mutate(overlap.n = as.numeric(overlap),
overlap.c = as.character(overlap))
# A tibble: 10 x 8
# Groups: id [5]
id time1 time2 overlap cond1 cond2 overlap.n overlap.c
<int> <date> <date> <S4: Interval> <dbl> <dbl> <dbl> <chr>
1 1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 0 NA 13737600 2008-10-12 U…
2 1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC NA 0 26697600 2008-08-10 U…
3 2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 1 NA 3024000 2006-01-09 U…
4 2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC NA 1 3024000 2008-03-13 U…
5 3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 1 NA 3024000 2008-09-12 U…
6 3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC NA 1 3024000 2007-05-30 U…
7 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 1 NA 9331200 2003-09-29 U…
8 4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC NA 1 9331200 2003-09-29 U…
9 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 1 NA 8121600 2003-04-01 U…
10 5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC NA 1 8121600 2003-04-01 U…
根据上面的输出,我相信使用 ==
会将 "overlap" 间隔强制转换为 numeric
向量,从而导致上面提到的持续时间比较 @hmhensen。当你强制
强制转换为 character
而不是 numeric
,您会得到想要的结果。