为什么 cur_data() within summarize() return df_slice() 错误？

Question

我运行今天在 summarize() 中使用 cur_data() 时遇到了麻烦。

示例数据：

library(tidyverse)

dat <- tibble(id = 1:6,
              type = c(1, 1, 2, 2, 3, 3),
              value = c(2, 4, 6, 8, 7, NA))

第一个管道抛出错误，提到 df_slice():

dat %>%
  group_by(type) %>%
  summarize(mean = mean(value),
            n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            .groups = "drop")
#> Error in `summarize()`:
#> ! Problem while computing `n = length(...)`.
#> ℹ The error occurred in group 1: type = 1.
#> Caused by error:
#> ! Internal error in `df_slice()`: Columns must match the data frame size.

但是，在 summarize() 中切换摘要统计信息的顺序可以避免错误：

dat %>%
  group_by(type) %>%
  summarize(n = length(cur_data() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            mean = mean(value),
            .groups = "drop")
#> # A tibble: 3 × 3
#>    type     n  mean
#>   <dbl> <int> <dbl>
#> 1     1     2     3
#> 2     2     2     7
#> 3     3     1    NA

此外，管道 cur_data() 到 as.data.frame() 也可以避免错误：

dat %>%
  group_by(type) %>%
  summarize(mean = mean(value),
            n = length(cur_data() %>% as.data.frame() %>% filter(!is.na(value)) %>% pull(id) %>% unique()),
            .groups = "drop")
#> # A tibble: 3 × 3
#>    type  mean     n
#>   <dbl> <dbl> <int>
#> 1     1     3     2
#> 2     2     7     2
#> 3     3    NA     1
Created on 2022-02-15 by the reprex package (v2.0.1)

为什么我不能使用第一个示例语法？最终我计算了 mutate() 中需要 cur_data() 的任何东西，并在后来的 summarize() 调用中保留了 first() 观察，但我想知道我遗漏了什么summarize().

其他会话信息：

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: aarch64-apple-darwin20.6.0 (64-bit)
Running under: macOS Monterey 12.1

Matrix products: default
LAPACK: /opt/homebrew/Cellar/r/4.1.2/lib/R/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] reprex_2.0.1         palmerpenguins_0.1.0 forcats_0.5.1        stringr_1.4.0        readr_2.1.2         
 [6] tibble_3.1.6         ggplot2_3.3.5        tidyverse_1.3.1      tidyr_1.2.0          purrr_0.3.4         
[11] dplyr_1.0.8

Answer 1

这是 dplyr 的未决问题：https://github.com/tidyverse/dplyr/issues/6138

转述GitHub问题中的讨论：问题是由 cur_data() 包括之前汇总的列（在本例中为 mean），它没有被回收以匹配数据框中的行数。这使得 cur_data() 本质上是一个错误的数据框。

在你的情况下，使用 as.data.frame() 可以解决问题，因为它确实使 mean 与其余列的长度匹配的回收，以及以不同的顺序排列语句可以解决问题，因为在该点 cur_data() 尚未包含任何新列。

library(dplyr, warn.conflicts = FALSE)

dat <- tibble(
  id = 1:6,
  type = c(1, 1, 2, 2, 3, 3),
  value = c(2, 4, 6, 8, 7, NA)
)

dat %>%
  group_by(type) %>%
  summarize(
    mean = mean(value),
    str(cur_data())
  )
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 1 2
#>  $ value: num [1:2] 2 4
#>  $ mean : num 3
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 3 4
#>  $ value: num [1:2] 6 8
#>  $ mean : num 7
#> tibble [2 x 3] (S3: tbl_df/tbl/data.frame)
#>  $ id   : int [1:2] 5 6
#>  $ value: num [1:2] 7 NA
#>  $ mean : num NA
#> # A tibble: 3 x 2
#>    type  mean
#>   <dbl> <dbl>
#> 1     1     3
#> 2     2     7
#> 3     3    NA

为什么 cur_data() within summarize() return df_slice() 错误？

Why does cur_data() within summarize() return df_slice() error?

r

dplyr

tidyverse