将包含不同数量响应的单个列分隔为 R 中的多个列

Separate a single column containing varying numbers of responses into multiple columns in R

我有一个专栏,报告调查中一个问题的结果,受访者可以根据需要在七个预先确定的答案中打勾,也可以输入他们自己的自由文本回复。目前,所有回复都存储在一列中,每个选定的 and/or 类型回复由分号分隔。我想将它们分成八列:七个可能的复选框和一个用于自由文本响应的复选框。

一个小的 reprex 来展示我的目标(列数比我的真实数据少,但基本思想是一样的):

library(tidyverse)

# where `fruit` is the input data
fruit <- tibble(id = 1:5,
                fruit = c('apple;banana',
                          'apple',
                          NA,
                          'banana;a free response which isn\'t any of the other columns',
                          'banana;apple;orange')
                )
fruit
#>      id fruit                                                      
#>   <int> <chr>                                                      
#> 1     1 apple;banana                                               
#> 2     2 apple                                                      
#> 3     3 NA                                                         
#> 4     4 banana;a free response which isn't any of the other columns
#> 5     5 banana;apple;orange     

# the output I'm trying to get:
tibble(id = 1:5,
       fruit = c('apple;banana','apple',NA,'banana','banana;apple;orange'),
       apple = c(T,T,F,F,T),
       banana = c(T,F,F,T,T),
       orange = c(F,F,F,F,T),
       other = c(F,F,F,'a free response which isn\'t any of the other columns',F))

#>      id fruit               apple banana orange other                                               
#>   <int> <chr>               <lgl> <lgl>  <lgl>  <chr>                                               
#> 1     1 apple;banana        TRUE  TRUE   FALSE  FALSE                                               
#> 2     2 apple               TRUE  FALSE  FALSE  FALSE                                               
#> 3     3 NA                  FALSE FALSE  FALSE  FALSE                                               
#> 4     4 banana              FALSE TRUE   FALSE  a free response which isn't any of the other columns
#> 5     5 banana;apple;orange TRUE  TRUE   TRUE   FALSE    

我尝试了各种方法,包括 tidyr::separate() 函数和 tidyr::pivot_wider() 的各种排列,但未能特别接近所需的结果。到目前为止,我看过的所有方法都希望将列拆分为在每个单元格中具有相同数量的响应(并且以相同的顺序),但在我的数据中情况并非如此。

我们可以使用 mtabulate.

  1. ; 处的 'fruit' 列与 strsplit 拆分为 list
  2. 将非applebananaorange或非缺失值的元素替换为'other'
  3. 应用mtabulate获取listcbind中元素的频率,将计数转换为逻辑(> 0)
library(qdapTools)
lst1 <- lapply(strsplit(fruit$fruit, ";"), function(x) 
      replace(x, (! x %in% c("apple", "banana", "orange")) & !is.na(x), "other"))

cbind(fruit, mtabulate(lst1) > 0)

-输出

 id                                                       fruit apple banana orange other
1  1                                                apple;banana  TRUE   TRUE  FALSE FALSE
2  2                                                       apple  TRUE  FALSE  FALSE FALSE
3  3                                                        <NA> FALSE  FALSE  FALSE FALSE
4  4 banana;a free response which isn't any of the other columns FALSE   TRUE  FALSE  TRUE
5  5                                         banana;apple;orange  TRUE   TRUE   TRUE FALSE

或使用tidyverse

  1. separate_rows
  2. 拆分 'fruit' 行
  3. 创建一个新列('fruit1'),将'apple'、'banana'、'orange'以外的元素替换为'other' 36=]
  4. 使用 pivot_wider 从 'long' 重塑为 'wide'。将 values_fn 指定为 lambda 函数,以将 'fruit' 中不是 'apple'、'banana'、'orange' 的那些元素更改为相应的值,否则 return一个逻辑(转换为character)。
  5. 使用type.convert自动更改列类型
  6. 加入原始数据 - left_join
library(dplyr)
library(tidyr)
fruit %>% 
    separate_rows(fruit, sep = ";") %>% 
    mutate(fruit1 = case_when(fruit  %in% c("apple", "banana", "orange") ~ fruit,  
     is.na(fruit) ~ NA_character_,
         TRUE ~ "other")) %>%
   pivot_wider(names_from = fruit1, values_from = fruit, 
      values_fn = function(x) ifelse(! x %in% 
       c("apple", "banana", "orange"), x, as.character(length(x) > 0)),
       values_fill = "FALSE") %>% 
    select(-`NA`) %>% 
    type.convert(as.is = TRUE) %>% 
    left_join(fruit)

-输出

# A tibble: 5 x 6
     id apple banana other                                                orange fruit                                                      
  <int> <lgl> <lgl>  <chr>                                                <lgl>  <chr>                                                      
1     1 TRUE  TRUE   FALSE                                                FALSE  apple;banana                                               
2     2 TRUE  FALSE  FALSE                                                FALSE  apple                                                      
3     3 FALSE FALSE  FALSE                                                FALSE  <NA>                                                       
4     4 FALSE TRUE   a free response which isn't any of the other columns FALSE  banana;a free response which isn't any of the other columns
5     5 TRUE  TRUE   FALSE                                                TRUE   banana;apple;orange