如何根据每行可变数量的模式匹配将行分成列
How to separate rows into columns based on variable number of pattern matches per row
我有一个这样的数据框:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
我想创建与模式匹配的新列数;模式是 (?<="A":)\d+(?=,)
,即“如果您在左侧看到字符串 "A":
,在右侧看到逗号 ,
,则匹配数字。
问题:(i) 匹配项的数量可能因行而异,(ii) 新列的最大数量事先未知。
到目前为止我所做的是:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')
虽然 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)')))
可以解决新列数量未知的问题,但我收到警告:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
并且值的分配显然不正确:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
正确的输出应该是这样的:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
有什么想法吗?
这是一个有点行人的 stringr
解决方案:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
由 reprex package (v0.3.0)
于 2020-07-06 创建
我有一个这样的数据框:
df <- data.frame(
id = c("A","B"),
date = c("31/07/2019", "31/07/2020"),
x = c('random stuff "A":88876, more stuff',
'something, "A":1234, more "A":456, random "A":32078, more'),
stringsAsFactors = F
)
我想创建与模式匹配的新列数;模式是 (?<="A":)\d+(?=,)
,即“如果您在左侧看到字符串 "A":
,在右侧看到逗号 ,
,则匹配数字。
问题:(i) 匹配项的数量可能因行而异,(ii) 新列的最大数量事先未知。
到目前为止我所做的是:
df[paste("A", 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)'))), sep = "")] <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')
虽然 1:max(lengths(str_extract_all(df$x, '(?<="A":)\d+(?=,)')))
可以解决新列数量未知的问题,但我收到警告:
`Warning message:
In `[<-.data.frame`(`*tmp*`, paste("A", 1:max(lengths(str_extract_all(df$x, :
replacement element 2 has 3 rows to replace 2 rows`
并且值的分配显然不正确:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 1234 88876
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 88876 456 88876
正确的输出应该是这样的:
df
id date x A1 A2 A3
1 A 31/07/2019 random stuff "A":88876, more stuff 88876 NA NA
2 B 31/07/2020 something, "A":1234, more "A":456, random "A":32078, more 1234 456 32078
有什么想法吗?
这是一个有点行人的 stringr
解决方案:
library(stringr)
library(dplyr)
matches <- str_extract_all(df$x, '(?<="A":)\d+(?=,)')
ncols <- max(sapply(matches, length))
matches %>%
lapply(function(y) c(y, rep(NA, ncols - length(y)))) %>%
do.call(rbind, .) %>%
data.frame() %>%
setNames(paste0("A", seq(ncols))) %>%
cbind(df, .) %>%
tibble()
#> # A tibble: 2 x 6
#> id date x A1 A2 A3
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 A 31/07/20~ "random stuff \"A\":88876, more stuff" 88876 <NA> <NA>
#> 2 B 31/07/20~ "something, \"A\":1234, more \"A\":456, ran~ 1234 456 32078
由 reprex package (v0.3.0)
于 2020-07-06 创建