lubridate::interval 个对象中的身份评估错误

Question

假设这样的 df:

df <- data.frame(id = c(rep(1:5, each = 2)),
time1 = c("2008-10-12", "2008-08-10", "2006-01-09", "2008-03-13", "2008-09-12", "2007-05-30", "2003-09-29","2003-09-29", "2003-04-01", "2003-04-01"),
time2 = c("2009-03-20", "2009-06-15", "2006-02-13", "2008-04-17", "2008-10-17", "2007-07-04", "2004-01-15", "2004-01-15", "2003-07-04", "2003-07-04"))

   id      time1      time2
1   1 2008-10-12 2009-03-20
2   1 2008-08-10 2009-06-15
3   2 2006-01-09 2006-02-13
4   2 2008-03-13 2008-04-17
5   3 2008-09-12 2008-10-17
6   3 2007-05-30 2007-07-04
7   4 2003-09-29 2004-01-15
8   4 2003-09-29 2004-01-15
9   5 2003-04-01 2003-07-04
10  5 2003-04-01 2003-07-04

我尝试做的是，首先，在变量 "time1" 和 "time2" 之间创建一个 lubridate 区间。第二，我想按 "id" 分组，比较下一行是否与当前行相同，当前行是否与上一行相同。我可以通过以下方式实现它：

library(tidyverse)

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = interval(time1, time2)) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0))

      id time1      time2      overlap                        cond1 cond2
   <int> <date>     <date>     <S4: Interval>                 <dbl> <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     1    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     1
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     1    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     1
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1

问题是，如您所见，对于 id == 2 和 id == 3，这两个条件都被评估为 TRUE，即使间隔不同。对于 id == 1，它正确地评估为 FALSE，对于 id == 4 和 id == 5，它正确地评估为 TRUE。

现在，当我将区间转换为字符时，它可以正常评估：

df %>%
 mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
 mutate(overlap = as.character(interval(time1, time2))) %>%
 group_by(id) %>%
 mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
        cond2 = ifelse(lag(overlap) == overlap, 1, 0)) 

      id time1      time2      overlap                        cond1 cond2
   <int> <date>     <date>     <chr>                          <dbl> <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     0    NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     0
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     0    NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     0
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1

问题是，为什么它评估某些区间是相同的，而实际上它们不是？

Answer 1

我认为这与 lubridate 实际计算的内容有关。

当我计算 date1 和 date2 之间的差异时，会发生这种情况：

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = time2 - time1)

   id      time1      time2  overlap
1   1 2008-10-12 2009-03-20 159 days
2   1 2008-08-10 2009-06-15 309 days
3   2 2006-01-09 2006-02-13  35 days
4   2 2008-03-13 2008-04-17  35 days
5   3 2008-09-12 2008-10-17  35 days
6   3 2007-05-30 2007-07-04  35 days
7   4 2003-09-29 2004-01-15 108 days
8   4 2003-09-29 2004-01-15 108 days
9   5 2003-04-01 2003-07-04  94 days
10  5 2003-04-01 2003-07-04  94 days

所以我们可以看出时间间隔在一天的长度上是相同的。

现在，overlap 实际在计算什么？为了找出答案，我稍微更改了您的代码以报告超前和滞后而不是 1。

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2)) %>%
  group_by(id) %>%
  mutate(cond1 = ifelse(lead(overlap) == overlap, lead(overlap), 0),
         cond2 = ifelse(lag(overlap) == overlap, lag(overlap), 0))

# A tibble: 10 x 6
# Groups:   id [5]
      id time1      time2      overlap                          cond1   cond2
   <int> <date>     <date>     <S4: Interval>                   <dbl>   <dbl>
 1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC       0      NA
 2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC      NA       0
 3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 3024000      NA
 4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC      NA 3024000
 5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 3024000      NA
 6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC      NA 3024000
 7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 9331200      NA
 8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC      NA 9331200
 9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 8121600      NA
10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC      NA 8121600

在这里，我们看到 lead 和 lag 实际上计算了特定时间间隔内的差异，而不是查看实际间隔开始和结束日期。这就是为什么它认为某些间隔相等而字符串不相等的原因，因为它们应该是。

再挖掘一下：

我们来看看interval.

生成的对象

a <- interval(df$time1, df$time2)

str(a)
#Formal class 'Interval' [package "lubridate"] with 3 slots
#..@ .Data: num [1:10] 13737600 26697600 3024000 3024000 3024000 ...
#..@ start: POSIXct[1:10], format: "2008-10-12" "2008-08-10" "2006-01-09" ...
#..@ tzone: chr "UTC"

这是一个 S4 class，具有三个插槽：.Data、start 和 tzone。

调用 a 显示间隔。

a
 [1] 2008-10-12 UTC--2009-03-20 UTC 2008-08-10 UTC--2009-06-15 UTC 2006-01-09 UTC--2006-02-13 UTC
 [4] 2008-03-13 UTC--2008-04-17 UTC 2008-09-12 UTC--2008-10-17 UTC 2007-05-30 UTC--2007-07-04 UTC
 [7] 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC 2003-04-01 UTC--2003-07-04 UTC
[10] 2003-04-01 UTC--2003-07-04 UTC

但是当您在 a 上执行计算时，它是在 .Data 上执行的，这是从指定日期开始的秒数序列（请参阅 ?interval）。

a@.Data
#[1] 13737600 26697600  3024000  3024000  3024000  3024000  9331200  9331200  8121600  8121600

对于间隔的开始日期，我们需要访问 start 槽。

a@start
#[1] "2008-10-12 UTC" "2008-08-10 UTC" "2006-01-09 UTC" "2008-03-13 UTC" "2008-09-12 UTC"
#[6] "2007-05-30 UTC" "2003-09-29 UTC" "2003-09-29 UTC" "2003-04-01 UTC" "2003-04-01 UTC"

以及时区...

a@tzone
#[1] "UTC"

我们还可以看看元素之间的关系是什么。最后一个和倒数第二个元素具有相同的间隔。

a[9] == a[10]
#[1] TRUE

而且它们是相同的对象。

identical(a[9], a[10])
#[1] TRUE

但是当您检查元素是否相等时，它真正检查的是什么？元素 3 和 4 具有相同的时间差，但不是相同的间隔。因此，当您检查它们的 lag/leads 是否相等时，它返回 TRUE。但由于它们具有不同的间隔日期，因此它们不应该如此。所以当我们检查它们是否相同时，只有这样我们才能得到我们期望的结果。

a[3] == a[4]
#[1] TRUE

a[3]@.Data == a[4]@.Data
#[1] TRUE

identical(a[3], a[4])
#[1] FALSE

所以发生了什么事？ a[3] == a[4] 真正检查的是 a[3]@.Data == a[4]@.Data，因此它检查 3024000 是否等于 3024000。它这样做 returns TRUE。但是identical检查了所有的slot，发现不一样，因为每个slot中的start都不一样

然后我考虑使用 identical with lead/lag 这样我们就可以在代码中加入一个逻辑，但是看看这个。

a[9]
#[1] 2003-04-01 UTC--2003-07-04 UTC

# now lead
lead(a[9])
#2003-04-01 UTC--NA

输出看起来不像预期的那样 a[10]。

#now lag
lag(a[9])
#[1] NA
#attr(,"start")
#[1] "2003-04-01 UTC"
#attr(,"tzone")
#[1] "UTC"
#attr(,"class")
#[1] "Interval"
#attr(,"class")attr(,"package")
#[1] "lubridate"

所以lead和lag对classS4对象有不同的影响。为了更好地处理您第一次尝试输出的内容，我这样做了：

df %>%
     mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
     mutate(overlap = interval(time1, time2)) %>%
     group_by(id) %>%
     mutate(cond1 = lead(overlap),
            cond2 = lag(overlap))

我收到很多警告消息说

#In mutate_impl(.data, dots) :
#  Vectorizing 'Interval' elements may not preserve their attributes

我对 R 对象了解不够，无法理解 S4 class 中的数据是如何存储的，但它看起来肯定与典型的 S3 对象不同。

似乎像您一样使用 as.character 是可行的方法。

Answer 2

更新

如果您查看 Interval 类的代码，您会看到创建对象时它会存储开始日期，然后计算开始和结束之间的差值并将其存储为.Data。

interval <- function(start, end = NULL, tzone = tz(start)) {

  if (is.null(tzone)) {
    tzone <- tz(end)
    if (is.null(tzone))
      tzone <- "UTC"
  }

  if (is.character(start) && is.null(end)) {
    return(parse_interval(start, tzone))
  }

  if (is.Date(start)) start <- date_to_posix(start)
  if (is.Date(end)) end <- date_to_posix(end)

  start <- as_POSIXct(start, tzone)
  end <- as_POSIXct(end, tzone)

  span <- as.numeric(end) - as.numeric(start)
  starts <- start + rep(0, length(span))
  if (tzone != tz(starts)) starts <- with_tz(starts, tzone)

  new("Interval", span, start = starts, tzone = tzone)
}

也就是说，返回的对象没有"end date"的概念。 end 参数的默认值为 NULL，这意味着您甚至可以创建没有结束日期的间隔。

interval("2019-03-29")
[1] 2019-03-29 UTC--NA

"end date" 只是在格式化 Interval 对象以供打印时通过计算生成的文本。

format.Interval <- function(x, ...) {
  if (length(x@.Data) == 0) return("Interval(0)")
  paste(format(x@start, tz = x@tzone, usetz = TRUE), "--",
        format(x@start + x@.Data, tz = x@tzone, usetz = TRUE), sep = "")
}

int_end <- function(int) int@start + int@.Data

这两个代码片段均来自 https://github.com/tidyverse/lubridate/blob/f7a7c2782ba91b821f9af04a40d93fbf9820c388/R/intervals.r。

访问overlap的底层属性可以让你在不转换为字符的情况下完成比较。您必须检查 start 和 .Data 是否相等。转换为字符要干净得多，但如果你想避免它，这就是你可以这样做的方法。

ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0)

一共采取：

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2),
         overlap_char = as.character(interval(time1, time2))) %>%
  group_by(id) %>%
  mutate(cond1_original = ifelse(lead(overlap_char) == overlap_char, 1, 0),
         cond1_new = ifelse(lead(overlap@start) == overlap@start & lead(overlap@.Data) == overlap@.Data, 1, 0),
         cond2_original = ifelse(lag(overlap_char) == overlap_char, 1, 0),
         cond2_new = ifelse(lag(overlap@start) == overlap@start & lag(overlap@.Data) == overlap@.Data, 1, 0)) 

id time1      time2      overlap                        overlap_char                   cond1_original cond1_new cond2_original cond2_new
<int> <date>     <date>     <S4: Interval>                 <chr>                                   <dbl>     <dbl>          <dbl>     <dbl>
1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC 2008-10-12 UTC--2009-03-20 UTC              0         0             NA        NA
2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC 2008-08-10 UTC--2009-06-15 UTC             NA        NA              0         0
3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC 2006-01-09 UTC--2006-02-13 UTC              0         0             NA        NA
4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC 2008-03-13 UTC--2008-04-17 UTC             NA        NA              0         0
5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC 2008-09-12 UTC--2008-10-17 UTC              0         0             NA        NA
6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC 2007-05-30 UTC--2007-07-04 UTC             NA        NA              0         0
7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC              1         1             NA        NA
8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC 2003-09-29 UTC--2004-01-15 UTC             NA        NA              1         1
9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC              1         1             NA        NA
10    5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC 2003-04-01 UTC--2003-07-04 UTC             NA        NA              1         1

您可以在此处阅读有关 Interval 的更多信息：https://lubridate.tidyverse.org/reference/Interval-class.html

我相信您的确切案例与 == 比较有关。正如你在上面看到的，"overlap" 是一个列表，不是向量。从 ?==，它说：

At least one of x and y must be an atomic vector, but if the other is a list R attempts to coerce it to the type of the atomic vector: this will succeed if the list is made up of elements of length one that can be coerced to the correct type.

If the two arguments are atomic vectors of different types, one is coerced to the type of the other, the (decreasing) order of precedence being character, complex, numeric, integer, logical and raw.

我们可以将 "overlap" 强制转换为 numeric 和 character 以查看差异。

df %>%
  mutate_at(2:3, funs(as.Date(., format = "%Y-%m-%d"))) %>%
  mutate(overlap = interval(time1, time2)) %>%
  group_by(id) %>%
  mutate(cond1 = ifelse(lead(overlap) == overlap, 1, 0),
         cond2 = ifelse(lag(overlap) == overlap, 1, 0)) %>%
  mutate(overlap.n = as.numeric(overlap),
         overlap.c = as.character(overlap))

# A tibble: 10 x 8
# Groups:   id [5]
id time1      time2      overlap                        cond1 cond2 overlap.n overlap.c    
<int> <date>     <date>     <S4: Interval>                 <dbl> <dbl>     <dbl> <chr>        
  1     1 2008-10-12 2009-03-20 2008-10-12 UTC--2009-03-20 UTC     0    NA  13737600 2008-10-12 U…
  2     1 2008-08-10 2009-06-15 2008-08-10 UTC--2009-06-15 UTC    NA     0  26697600 2008-08-10 U…
  3     2 2006-01-09 2006-02-13 2006-01-09 UTC--2006-02-13 UTC     1    NA   3024000 2006-01-09 U…
  4     2 2008-03-13 2008-04-17 2008-03-13 UTC--2008-04-17 UTC    NA     1   3024000 2008-03-13 U…
  5     3 2008-09-12 2008-10-17 2008-09-12 UTC--2008-10-17 UTC     1    NA   3024000 2008-09-12 U…
  6     3 2007-05-30 2007-07-04 2007-05-30 UTC--2007-07-04 UTC    NA     1   3024000 2007-05-30 U…
  7     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC     1    NA   9331200 2003-09-29 U…
  8     4 2003-09-29 2004-01-15 2003-09-29 UTC--2004-01-15 UTC    NA     1   9331200 2003-09-29 U…
  9     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC     1    NA   8121600 2003-04-01 U…
  10     5 2003-04-01 2003-07-04 2003-04-01 UTC--2003-07-04 UTC    NA     1   8121600 2003-04-01 U…

根据上面的输出，我相信使用 == 会将 "overlap" 间隔强制转换为 numeric 向量，从而导致上面提到的持续时间比较 @hmhensen。当你强制强制转换为 character 而不是 numeric，您会得到想要的结果。

lubridate::interval 个对象中的身份评估错误

Evaluation error of identity in lubridate::interval objects

r

lubridate

tidyverse

再挖掘一下：