如何根据每行可变数量的模式匹配将行分成列

How to separate rows into columns based on variable number of pattern matches per row

我有一个这样的数据框:

df <- data.frame(
  id = c("A","B"),
  date = c("31/07/2019", "31/07/2020"),
  x = c('random stuff "A":88876, more stuff',
         'something, "A":1234, more "A":456, random "A":32078, more'),
  stringsAsFactors = F
)

我想创建与模式匹配的新列数;模式是 (?<="A":)\d+(?=,),即“如果您在左侧看到字符串 "A":,在右侧看到逗号 ,,则匹配数字。

问题:(i) 匹配项的数量可能因行而异,(ii) 新列的最大数量事先未知。

到目前为止我所做的是:

df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')

虽然 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)'))) 可以解决新列数量未知的问题,但我收到警告:

`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x,  :
  replacement element 2 has 3 rows to replace 2 rows`

并且值的分配显然不正确:

df
  id       date                                                         x    A1   A2    A3
1  A 31/07/2019                        random stuff "A":88876, more stuff 88876 1234 88876
2  B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876  456 88876

正确的输出应该是这样的:

df
  id       date                                                         x    A1   A2    A3
1  A 31/07/2019                        random stuff "A":88876, more stuff 88876   NA    NA
2  B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more  1234  456 32078

有什么想法吗?

这是一个有点行人的 stringr 解决方案:

library(stringr)
library(dplyr)

matches <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')
ncols   <- max(sapply(matches, length))

matches %>%
  lapply(function(y)  c(y, rep(NA, ncols - length(y)))) %>%
  do.call(rbind, .) %>%
  data.frame() %>%
  setNames(paste0("A", seq(ncols))) %>%
  cbind(df, .) %>%
  tibble()
#> # A tibble: 2 x 6
#>   id    date      x                                            A1    A2    A3   
#>   <chr> <chr>     <chr>                                        <chr> <chr> <chr>
#> 1 A     31/07/20~ "random stuff \"A\":88876, more stuff"       88876 <NA>  <NA> 
#> 2 B     31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234  456   32078

reprex package (v0.3.0)

于 2020-07-06 创建