R tidyr：使用单独的函数使用 RegEx 将带有逗号分隔文本的字符列分隔为多个列

Question

我有以下数据框

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three"))

看起来像这样

                x
1             one
2        one, two
3      two, three
4 one, two, three

我希望能够将此 x 列分成许多不同的列，一列对应列 x 中的每个 distinct 个单词。基本上我希望最终结果是这样的

    one  two  three
1    1    0     0
2    1    1     0
3    0    1     1
4    1    1     1

我认为，为了获得该数据框，我可能需要能够使用 tidyr 提供的 separate 函数并记录在 here 中。但是，这需要了解正则表达式，而我对它们并不擅长。谁能帮我获取这个数据框？

重要提示：我不知道数字，也不知道单词的先验拼写。

重要示例

它应该也适用于空字符串。例如，如果我们有

df <- data.frame(x=c("one", "one, two", "two, three", "one, two, three", ""))

那么应该也可以。

Answer 1

有了tidyverse，我们可以用separate_rows拆分'x'列，创建一个序列列，用pivot_wider从tidyr

library(dplyr)
library(tidyr)
df %>% 
   filter(!(is.na(x)|x==""))%>% 
   mutate(rn = row_number()) %>% 
   separate_rows(x) %>%
   mutate(i1 = 1) %>% 
   pivot_wider(names_from = x, values_from = i1, , values_fill = list(i1 = 0)) %>%
   select(-rn)
# A tibble: 4 x 3
#    one   two three
#  <dbl> <dbl> <dbl>
#1     1     0     0
#2     1     1     0
#3     0     1     1
#4     1     1     1

在上面的代码中，添加了 rn 列，以便在我们使用 separate_rows 扩展行后为每一行添加不同的标识符，否则，它会导致 list当存在重复元素时，pivot_wider 中的输出列。添加值为 1 的 'i1' 以在 values_from 中使用。另一种选择是指定 values_fn = length

或者我们可以在 base R

中拆分 'x' 列后使用 table

table(stack(setNames(strsplit(as.character(df$x), ",\s+"), seq_len(nrow(df))))[2:1])

Answer 2

这是一个基本的 R 解决方案

# split strings by ", " and save in to a list `lst`
lst <- apply(df, 1, function(x) unlist(strsplit(x,", ")))

# a common set including all distinct words
common <- Reduce(union,lst)

# generate matrix which is obtained by checking if `common` can be found in the array in `lst`
dfout <- `names<-`(data.frame(Reduce(rbind,lapply(lst, function(x) +(common %in% x))),row.names = NULL),common)

这样

> dfout
  one two three
1   1   0     0
2   1   1     0
3   0   1     1
4   1   1     1

Answer 3

您可以从您的列中构建一个模式并将其与 tidyr::extract() 一起使用：

library(tidyverse)
cols <- c("one","two","three")
pattern <- paste0("(",cols,")*", collapse= "(?:, )*")
df %>% 
  extract(x, into = c("one","two","three"), regex = pattern) %>%
  mutate_all(~as.numeric(!is.na(.)))
#>   one two three
#> 1   1   0     0
#> 2   1   1     0
#> 3   0   1     1
#> 4   1   1     1

R tidyr：使用单独的函数使用 RegEx 将带有逗号分隔文本的字符列分隔为多个列

R tidyr: use separate function to separate character column with comma-separated text into multiple columns using RegEx

regex

r

regex-lookarounds

tidyr

tidyverse

重要示例