R tidyr regex:从字符列中提取有序数字
R tidyr regex: extract ordered numbers from character column
假设我有这样一个数据框
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
看起来像这样
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
有没有一种简单的方法,可能使用Tidyverse
来提取每行的可视化数量和文件数量?当没有可视化(或没有数据文件,或两者)时,我想提取 0
。基本上我希望最终结果是这样的
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
我尝试使用类似
的东西
str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\.$|ns\.$))")
但我迷路了。
我们可以在 str_extract
中使用正则表达式环视来提取一个或多个数字 (\d+
),然后是 space 和 'vis' 或 'data files'两列
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\d+(?= vis)")),
files = as.numeric(str_extract(x, "\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
在第一种情况下,模式匹配一个或多个数字 (\d+
),后跟正则表达式环视 ((?=
),其中 space 后跟 'vis' 字,在第二列中,它提取数字后跟 space 和字 'file' 或 'files'
基础 R 方法...
df$viz <- as.numeric(sub(".*This script outputs (\d+).*", "\1", df$x))
df$files <- as.numeric(sub(".*(\d+) data file.*", "\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1
您可以使用包 unglue 来获得可读的解决方案,因为您的可能模式数量有限,然后将 NA 替换为 0:
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1
假设我有这样一个数据框
df <- data.frame(x=c("This script outputs 10 visualizations.",
"This script outputs 1 visualization.",
"This script outputs 5 data files.",
"This script outputs 1 data file.",
"This script doesn't output any visualizations or data files",
"This script outputs 9 visualizations and 28 data files.",
"This script outputs 1 visualization and 1 data file."))
看起来像这样
x
1 This script outputs 10 visualizations.
2 This script outputs 1 visualization.
3 This script outputs 5 data files.
4 This script outputs 1 data file.
5 This script doesn't output any visualizations or data files
6 This script outputs 9 visualizations and 28 data files.
7 This script outputs 1 visualization and 1 data file.
有没有一种简单的方法,可能使用Tidyverse
来提取每行的可视化数量和文件数量?当没有可视化(或没有数据文件,或两者)时,我想提取 0
。基本上我希望最终结果是这样的
viz files
1 10 0
2 1 0
3 0 5
4 0 1
5 0 0
6 9 28
7 1 1
我尝试使用类似
的东西str_extract(df$x, "(?<=This script outputs )(.*)(?= visualizatio(n\.$|ns\.$))")
但我迷路了。
我们可以在 str_extract
中使用正则表达式环视来提取一个或多个数字 (\d+
),然后是 space 和 'vis' 或 'data files'两列
library(dplyr)
library(stringr)
df %>%
transmute(viz = as.numeric(str_extract(x, "\d+(?= vis)")),
files = as.numeric(str_extract(x, "\d+(?= data files?)"))) %>%
mutate_all(replace_na, 0)
# viz files
#1 10 0
#2 1 0
#3 0 5
#4 0 0
#5 0 0
#6 9 28
#7 1 0
在第一种情况下,模式匹配一个或多个数字 (\d+
),后跟正则表达式环视 ((?=
),其中 space 后跟 'vis' 字,在第二列中,它提取数字后跟 space 和字 'file' 或 'files'
基础 R 方法...
df$viz <- as.numeric(sub(".*This script outputs (\d+).*", "\1", df$x))
df$files <- as.numeric(sub(".*(\d+) data file.*", "\1", df$x))
df[is.na(df)] <- 0
df
# x viz files
# 1 This script outputs 10 visualizations. 10 0
# 2 This script outputs 1 visualization. 1 0
# 3 This script outputs 5 data files. 5 5
# 4 This script outputs 1 data file. 1 1
# 5 This script doesn't output any visualizations or data files 0 0
# 6 This script outputs 9 visualizations and 28 data files. 9 28
# 7 This script outputs 1 visualization and 1 data file. 1 1
您可以使用包 unglue 来获得可读的解决方案,因为您的可能模式数量有限,然后将 NA 替换为 0:
library(unglue)
patterns <-
c("This script outputs {viz} visualization{=s{0,1}} and {files} data file{=s{0,1}}.",
"This script outputs {viz} visualization{=s{0,1}}.",
"This script outputs {files} data file{=s{0,1}}.")
res <- unglue_unnest(df, x, patterns, convert = TRUE)
res[is.na(res)] <- 0
res
#> viz files
#> 1 10 0
#> 2 1 0
#> 3 0 5
#> 4 0 1
#> 5 0 0
#> 6 9 28
#> 7 1 1