R 计算多列部分字符串匹配的总和

Question

我正在处理一个凌乱的夏令营登记表。表单输出如下：

          leaders         teen_adventure
1 camp, overnight                   <NA>
2            <NA>                   <NA>
3 camp, overnight camp, float, overnight

我想生成新的列，对每个可能的答案求和。

          leaders         teen_adventure camps overnights floats
1 camp, overnight                   <NA>     1          1      0
2            <NA>                   <NA>     0          0      0
3 camp, overnight camp, float, overnight     2          2      1

我骨子里觉得这有一个 dplyr 解决方案，比如：

reprex %>%
  mutate(camps = sum(case_when(
    str_detect(select(., everything()), "camp") ~ 1,
    TRUE ~ 0
  )))

或者可能使用 across()。

这里是示例数据集：

# data
reprex <- structure(list(leaders = c("camp, overnight", NA, "camp, overnight"), 
          teen_adventure = c(NA, NA, "camp, float, overnight")), 
          row.names = c(NA, -3L), class = "data.frame")

Answer 1

一种方式：

library(stringr)
library(tidyr)
reprex %>%
  replace_na(list(leaders='unknown',teen_adventure='unknown'))%>%
  mutate(camp=as.numeric(str_detect(leaders, 'camp')+str_detect(teen_adventure,'camp')),
         float=as.numeric(str_detect(leaders,'float')+str_detect(teen_adventure,'float')),
         overnight=as.numeric(str_detect(leaders,'overnight')+str_detect(teen_adventure,'overnight')))

输出：

          leaders         teen_adventure camp float overnight
1 camp, overnight                unknown    1     0         1
2         unknown                unknown    0     0         0
3 camp, overnight camp, float, overnight    2     1         2

Answer 2

此解决方案适用于任意数量的列和值：

reprex %>%
 as_tibble %>%
 # split the values by `, `
 mutate_all(strsplit, ", ") %>%
 # map through each column then each cell in order make it a named vector
 # for example the first cell : c("camp", "overnight") => c("camp"=1, "overnight"=1)
 # then pivot it longer by the row_number (this is done for quickly suming the values)
 map_dfr( function(x) x %>% map_dfr( ~ set_names(rep(1, length(.x<-.x[!is.na(.x)])), .x)) %>%
     mutate(id = row_number()) %>% 
     pivot_longer(!id) ) %>%
 # group by id and name so group the same variables that are found in the same row
 group_by(id, name) %>%
 # get the sum
 summarise_all(sum, na.rm=T) %>%
 ungroup %>%
 # return the tibble to wide format
 pivot_wider %>%
 # remove the id column
 select(-id) %>%
 # add the original data.frame to it
 tibble(reprex, .)

# A tibble: 3 x 5
  leaders         teen_adventure          camp float overnight
  <chr>           <chr>                  <dbl> <dbl>     <dbl>
1 camp, overnight NA                         1     0         1
2 NA              NA                         0     0         0
3 camp, overnight camp, float, overnight     2     1         2

Answer 3

基本 R 选项

v <- unique(unlist(strsplit(na.omit(unlist(reprex)), ",\s+")))
reprex <- cbind(
  reprex,
  do.call(
    rbind,
    lapply(
      1:nrow(reprex),
      function(k) table(factor(unlist(strsplit(na.omit(unlist(reprex[k, ])), ",\s+")), levels = v))
    )
  )
)

这给出了

          leaders         teen_adventure camp overnight float
1 camp, overnight                   <NA>    1         1     0
2            <NA>                   <NA>    0         0     0
3 camp, overnight camp, float, overnight    2         2     1

Answer 4

我们可以通过遍历列（map）来提取带有str_extract_all的单词，然后使用mtabulate获取频率计数，绑定list元素， summarise 获得 sum

的数字列

library(dplyr)
library(tidyr)
library(purrr)
library(stringr)
library(qdapTools)
library(data.table)
reprex %>% 
   map_dfr(~ str_extract_all(.x, "\w+") %>%
             mtabulate, .id = 'grp') %>%
   group_by(grp = rowid(grp)) %>% 
   summarise(across(everything(), sum, na.rm = TRUE), 
       .groups = 'drop') %>%
   select(-grp) %>% 
   bind_cols(reprex, .)

-输出

#            leaders         teen_adventure camp overnight float
#1 camp, overnight                   <NA>    1         1     0
#2            <NA>                   <NA>    0         0     0
#3 camp, overnight camp, float, overnight    2         2     1

R 计算多列部分字符串匹配的总和

R count sum of partial string matches over multiple columns

r

stringr

dplyr