在时间序列中按年份选择值的第一个实例

Choosing first instance of a value by year in a time series

我有一个看起来像这样的数据集(但有更多年的数据):

dat <- data.frame(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                   "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                  value = c(0.8,1,0.2,0,0.7,1,0.2,0))

我想选择每年“值”>=0.8 的第一个实例。

所以对于上面的数据集,我希望输出是一个有两行两列的数据框:

new_dat <- data.frame(date = as.Date(c("2000-01-01", "2001-03-31")),
                      value = c(0.8,0.7))
print(new_dat)

我一直在尝试使用 dplyr 来完成此操作:

dat_grouped <- dat %>%
  mutate(year = year(date))%>%
  group_by(year) %>%
  distinct(value >= 0.8, date = date) #wanted to keep the date column

它为“值”列提供了 TRUE FALSE 值,但我似乎无法找到 select 第一个 TRUE 值的好方法。我试过用 first() 包装 distinct(),我试过用管道传送到 which.min(),但都没有用。

我找到了这个 entry,但我希望得到一个简洁的解决方案。我在将该代码适应我的数据集时也遇到了问题。我得到“应用(x,2,my.first)中的错误:dim(X)必须具有正长度”

我也想执行相同的请求,但第一次该值 <= 0.2。但我认为这将是具有不同逻辑请求的相同过程。也许逻辑运算符不是要走的路?

非常感谢任何建议。谢谢。

您可以使用 dplyr::filter 仅获取 >= 0.8 的值,然后按年份分组(您可以使用 lubridate::year 获取),并使用 dplyr::slice_min 获取第一个日期.

dat <- data.frame(date = as.Date(c("2000-01-01","2000-03-31","2000-07-01","2000-09-30", 
                                   "2001-01-01","2001-03-31","2001-07-01","2001-09-30")),
                  value = c(0.8,1,0.2,0,0.7,1,0.2,0))

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dat %>% 
  filter(value >= .8) %>% 
  group_by(year = year(date)) %>% 
  slice_min(date)
#> # A tibble: 2 x 3
#> # Groups:   year [2]
#>   date       value  year
#>   <date>     <dbl> <dbl>
#> 1 2000-01-01   0.8  2000
#> 2 2001-03-31   1    2001

reprex package (v2.0.0)

于 2021-07-06 创建

如果您的数据已经按日期排序,您可以跳过 filter 并使用下面的方法(或 Ronak 的方法之一)

dat %>% 
  group_by(year = year(date)) %>% 
  slice_max(value >= 0.8, with_ties = FALSE)
#> # A tibble: 2 x 3
#> # Groups:   year [2]
#>   date       value  year
#>   <date>     <dbl> <dbl>
#> 1 2000-01-01   0.8  2000
#> 2 2001-03-31   1    2001

reprex package (v2.0.0)

于 2021-07-07 创建

3 个 Base R 解决方案:

# Repeatedly subsetting: data.frame => stdout(console)
subset(
  subset(
    with(
      dat,
      dat[order(date),]
    ),
    value >= 0.8
  ), 
  ave(
    substr(
      date, 
      1, 
      4
    ), 
    substr(
      date, 
      1, 
      4
    ), 
    FUN = seq.int
  ) == 1
)

# All in one base R using `Split-Apply-Combine`: 
# data.frame => stdout(console)
data.frame(
  do.call(
    rbind, 
    lapply(
      with(
        dat, 
        split(
          dat, 
          substr(date, 1, 4)
        )
      ),
      function(x){
        head(
          subset(
            with(x, x[order(date),]),
            value >= 0.8
          ), 
          1
        )  
      }
    )
  ),
  row.names = NULL
)

# In stages Base R: 
# Subset out values not meeting the threshold value
# criteria: above_threshold_df => data.frame
above_threshold_df <- subset(
  with(dat, dat[order(date),]), 
  value >= 0.8
)

# Extract the year from the date variable: 
# grp => integer vector 
grp <- with(above_threshold_df, substr(date, 1, 4))

# Use the group vector to extract the first entry from 
# each year that meets the threshold: 
# res => data.frame
res <- subset(
  above_threshold_df,
  ave(
    grp, 
    grp, 
    FUN = seq.int
  ) == 1
)

你可以使用slice-

library(dplyr)
library(lubridate)

dat %>%
  group_by(year = year(date)) %>%
  slice(match(TRUE, value >= 0.8)) %>%
  ungroup

#   date       value  year
#  <date>     <dbl> <int>
#1 2000-01-01   0.8  2000
#2 2001-03-31   1    2001

如果每年保证至少有一个值大于0.8那么你也可以使用which.max -

dat %>%
  group_by(year = year(date)) %>%
  slice(which.max(value >= 0.8)) %>%
  ungroup