如何使用 summarize 获取对应于最大值的变量的值。另一个变量的值？

Question

如何用summarize取一个变量的值对应另一个变量的最大值？

数据: 我在下面有一个简化的数据集。

df <- read.table(text = "
                 ID SBP DATE
                 1 90 20210102
                 1 106 20210111
                 2 80 20210513
                 2 87 20210513
                 2 88 20210413", header = TRUE)

我想取 SBP 的值，它对应于最新的 DATE（即最近的收缩压测量值）。可能存在联系，即同一天内 > 1 次测量（如 ID=2 所示），在这种情况下，我想采用第一行。除此之外，我可能需要获取其他变量，例如 SBP 的平均值，不。 SBP 等测量值。因此，我只想使用 summarise()。以下是所需的输出。

期望输出:

df <- read.table(text = "
                 ID SBP 
                 1 106 
                 2 80", header = TRUE)

这是我之前所做的。

1) 将 summarise 与 [ 和 which.max

结合使用

df %>% group_by(ID) %>% summarise(SBP = SBP[which.max(DATE)])
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    80

2) 使用 slice_max

df %>% group_by(ID) %>% slice_max(DATE, with_ties = FALSE)
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    80

3) 将 summarise 与 last

结合使用

df %>% group_by(ID) %>% summarise(SBP = last(SBP, DATE))
## A tibble: 2 x 2
#     ID   SBP
#  <int> <int>
#1     1   106
#2     2    87

我认为 (3) 在可读性方面是理想的，但它没有采用第一行项目，而是采用最后一行项目（不是我想要的）。如果我使用 (2)，我将不得不在使用 slice_max 之前使用 mutate 创建其他感兴趣的变量（如测量次数、平均值等）。 (1) 会混淆其他 R readers/users.

我的问题：我怎样才能写出类似 (3) 的东西，但在有联系时占据第一行？

Answer 1

我会使用 1) arrange + distinct 或 2) group_by + summarise + first 。第一种方法可读性差，但对于大数据集，它实际上比使用 group by 更高效。

library(tidyverse)

df %>%
  arrange(ID, -DATE) %>% 
  distinct(ID, .keep_all = TRUE)
#>   ID SBP     DATE
#> 1  1 106 20210111
#> 2  2  80 20210513

df %>% 
  group_by(ID) %>% 
  summarise(
    SBP = first(SBP, -DATE)
  )
#> # A tibble: 2 x 2
#>      ID   SBP
#> * <int> <int>
#> 1     1   106
#> 2     2    80

^{由 reprex package (v1.0.0)}

于 2021-05-18 创建

如何使用 summarize 获取对应于最大值的变量的值。另一个变量的值？

How to use summarise to take the value of an variable that corresponds to the max. value of another variable?

r

data-manipulation

dplyr

tidyr

tidyverse