如何 group_by 一个因子并为多个列取多数列值

Question

我有一个 COVID-19 政策数据集，其中包含日期列和许多包含虚拟变量的列。我使用的确切数据如下：

IRPP_Index <- tbl_df(read.csv("https://raw.githubusercontent.com/charlesbreton/COVID19-Canada-Provinces/master/Data/Canada-COVID19-Stringency.csv"))

IRPP_Index$date <- ymd(IRPP_Index$date)
IRPP_Index[is.na(IRPP_Index)] <- 0
        
BC_IRPP <- IRPP_Index %>%
              filter(Province.Territory=="British Columbia") %>%
              select(date, S1_Gathering, S3_Schools, S5_Care.home.visitation, 
                     S6_Dining.and.restaurants, S7_Non.essential.retail.business, 
                     S8_Non.essential.services, S9_Cultural.services.and.venues, S10_Intra.Travel,
                     S11_Inter.Travel, S12_Curfew, stringencyIndex) %>%
                     ## Omit Masks in public, Masks in schools
              select(date, stringencyIndex, S1_Gathering:S12_Curfew) %>%
              mutate_at(vars(S1_Gathering:S12_Curfew), as.character) %>%
              fastDummies::dummy_cols() %>%
              select(date, stringencyIndex, S1_Gathering_0:S12_Curfew_0)
    BC_IRPP$week <- floor_date(BC_IRPP$date, "week")

我希望现在按生成的周列进行分组，并且每周在虚拟变量中进行多数观察。以下使用 count() 和 top_n() 的操作将对单个列执行此操作，但我无法对数据框中的所有列应用此类计算。

BC_IRPP <- BC_IRPP %>%
                      select(-date) %>%
                      group_by(week) %>%
                      count(S1_Gathering_0) %>%
                      top_n(1)

想到的方法是lapply() 操作，具有多个函数，即count() 和top_n()。

Answer 1

这里有一个方法，长整形，选择模式，再整形宽。

library(tidyverse)
BC_IRPP %>%
  pivot_longer(-c(date:stringencyIndex, week)) %>%
  group_by(week, name) %>%
  count(value, sort = T) %>%
  slice(1) %>%
  select(-n) %>%
  pivot_wider(names_from = name, values_from = value)

Answer 2

{dplyr} 带有 across() 函数，用于处理 dataframe/tibble.

的（范围内的）列

在你的情况下，你对你的观察（列）进行操作，因此我们可以定义一个函数来识别最大观察。
有很多方法可以做到这一点，让我们使用另一个很酷的包 {vctrs} 和这里的函数 vec_count().

library(vctrs)

# ------------- check what vec_count does ---------------
c(0,0,1,1,1,0,0,1,0,0,1,1,1) %>% 
   vec_count()

# returns data frame with key and count ordered by count, i.e. max key is first element
#  key count
#1   1     7
#2   0     6

# -------------- define our helper function --------------
extract_max <- function(.vec){
   my_max <- .vec %>% vec_count()
   my_max <- my_max$key[1]   # extract max key
}

我们现在可以将其插入 across()。

# ------------- give it a try with a single observation column
BC_IRPP %>% 
    select(week, S1_Gathering_0) %>% 
    group_by(week) %>% 
    summarise(across( .cols = contains("S1_")  # limit this to S1 vars for demo, adapt for full data set
                     , .fns = extract_max
             )

这会产生

# A tibble: 64 x 2
   week       S1_Gathering_0
   <date>              <int>
 1 2020-03-01              1
 2 2020-03-08              1
 3 2020-03-15              0
 4 2020-03-22              0
 5 2020-03-29              0

现在将所有内容放在一起并应用于所有相关列为此，您删除限制 select() 调用并调整 contains() 助手。出于演示目的，我继续将解决方案限制在您的 S1_xxx 列。

BC_IRPP %>% 
   group_by(week) %>% 
   summarise(across(.cols = contains("S1_")  # still restricting demo to S1 vars
                                             # will result in the 5 S1_xxx columns
                   , .fns = extract_max
   ))

瞧：

# A tibble: 64 x 5
   week       S1_Gathering_0 S1_Gathering_2 S1_Gathering_4 S1_Gathering_5
   <date>              <int>          <int>          <int>          <int>
 1 2020-03-01              1              0              0              0
 2 2020-03-08              1              0              0              0
 3 2020-03-15              0              0              1              0
 4 2020-03-22              0              0              1              0
 5 2020-03-29              0              0              1              0
 6 2020-04-05              0              0              1              0
 7 2020-04-12              0              0              1              0
 8 2020-04-19              0              0              1              0
 9 2020-04-26              0              0              1              0
10 2020-05-03              0              0              1              0

across() 是 mutate() 或 summarise() 的强大调用。整理 vector/column 上更复杂的操作的接线可能需要一段时间。在这里它有助于 - 至少我 - 在 across() 调用之外定义辅助函数。

如何 group_by 一个因子并为多个列取多数列值

How to group_by a factor and take majority column values for multiple columns

group-by

r

count

lapply

dplyr