传入变量名时，for 循环遍历 group_by

Question

我正在尝试编写一个 R 函数，该函数将根据列值提取社区中的正比例。更具体地说，我有一个数据集，其中每一行都是一个人。为简化起见，第 1-5 列包含有关其个人特征的信息，第 6 列包含邮政编码，第 7 列包含他们报告阳性的 phone 号码，第 8 列包含星期几，第 9 列有状态。目标是计算邮政编码、phone 数字、星期几和州的总体级别的正比例和数量。对于任何一个类别，我都成功地使用了 https://edwinth.github.io/blog/dplyr-recipes/ 中的代码来构建一个组和汇总函数（如下）。输入数据框和列名，它将按该列上的不同值进行分组，并汇总正数的数量和比例。

group_and_summarize <- function(x, ...) {
  grouping = rlang::quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(positive, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

问题是，当我尝试跨多个列进行聚合时，该代码完全失败。我目前有四个字段要分组，但一旦数据完全收集完毕，我预计会有 ~15 列。我在这里的策略是将其中的每一个存储为列表的单独元素以供以后使用。我尝试使用

output = vector(mode = "list", length = length(aggregate_cols)) #aggregate_cols lists columns needing count and proportion.
    #aggregate_cols = c("ZIP_CODE", "PHONE_NUMBER", "DAY", "STATE")
for(i in 1:length(aggregate_cols)){
output[i] = group_and_summarize(df,aggregate_cols[i])
          }

但收到以下错误信息

Warning messages:
1: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
2: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
3: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
4: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length

测试第一个值

> i=1
> group_and_summarize(df,aggregate_cols[i])
# A tibble: 1 x 3
  `aggregate_cols[i]`  proportion number
  <chr>                 <dbl>  <int>
1 ZIP_CODE              0.168   5600

有什么解决办法吗？我想不出涉及 map 或 apply 函数族的好方法，尽管我愿意接受这些。

编辑：

下面是可重现的代码。

group_and_summarize_demo <- function(x, ...) {
  grouping = quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

cars_cols = c("gear", "cyl")
output = vector(mode = "list", length = length(cars_cols))
for(i in 1:length(cars_cols)){
  output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
}


> group_and_summarize_demo(mtcars, cyl)
# A tibble: 3 x 3
    cyl cyl_proportion cyl_count
  <dbl>          <dbl>     <int>
1     4          0.727        11
2     6          0.429         7
3     8          0.143        14
> cars_cols = c("gear", "cyl")
> output = vector(mode = "list", length = length(cars_cols))
> for(i in 1:length(cars_cols)){
+   output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
+ }
 Show Traceback
 
 Rerun with Debug
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "function" 
> cars_cols[1]
[1] "gear"
> group_and_summarize_demo(mtcars, cars_cols[1])
# A tibble: 1 x 3
  `cars_cols[1]` `cars_cols[1]_proportion` `cars_cols[1]_count`
  <chr>                              <dbl>                <int>
1 gear                               0.406                   32

我不明白为什么这与运行 group_and_summarize_demo(mtcars,cyl) 不同；我怀疑理解会解决这个错误。

Answer 1

在循环之外，您将名称直接传递给函数：

group_and_summarize_demo(mtcars, cyl)

但是，在循环中，您将名称作为字符串传递：

group_and_summarize_demo(mtcars, "cyl") #error

确实，在此设置中使用字符串更容易。为了让它工作，你不应该使用 quos() 但 syms():

group_and_summarize_demo <- function(x, ..., quosure=TRUE) {
  if(quosure)
    grouping = quos(...)
  else
    grouping = syms(...)
  temp = x %>% 
    group_by(!!!grouping) %>% 
    summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

group_and_summarize_demo(mtcars, cyl)
group_and_summarize_demo(mtcars, "cyl", quosure=F)

显然，在您的最终代码中，您应该选择其中之一并坚持使用。

编辑：

如果您一次只传递一个变量，那么使用省略号看起来有点矫枉过正，而且会使事情变得复杂。此外，您的示例似乎不适用于多个变量 (group_and_summarize_demo(mtcars, cyl, vs))。您可能需要考虑以下几项改进：

library(tidyverse)

group_and_summarize_demo <- function(x, gp_col) {
  gp_col = sym(gp_col)
  temp = x %>% 
    group_by(!!gp_col) %>% 
    summarise("{{gp_col}}_proportion" := mean(am, na.rm = TRUE), 
              "{{gp_col}}_count" := n()) %>% 
    filter(!is.na(!!gp_col))
  temp
}

c("gear", "cyl") %>%  
  map(~group_and_summarize_demo(mtcars, .x)) #try map_dfc() also
#> [[1]]
#> # A tibble: 3 x 3
#>    gear gear_proportion gear_count
#>   <dbl>           <dbl>      <int>
#> 1     3           0             15
#> 2     4           0.667         12
#> 3     5           1              5
#> 
#> [[2]]
#> # A tibble: 3 x 3
#>     cyl cyl_proportion cyl_count
#>   <dbl>          <dbl>     <int>
#> 1     4          0.727        11
#> 2     6          0.429         7
#> 3     8          0.143        14

^{由 reprex package (v2.0.0)}

于 2021-04-27 创建

在这里，我使用 := 运算符使用 dplyr::summarise() 的模板 feature。我还使用了 purrr::map() 而不是 for 循环，其中迭代记为 .x.

传入变量名时，for 循环遍历 group_by

For loop over a group_by when passing in a variable name

r

lapply

dplyr

编辑：