从字符串中提取两列文本

Question

我有一个 table，其中一列有这样的数据：

table$test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

1.) 我试图在一列的方括号内提取该字符串的第一部分，即

table$project_name <- "projectname"

使用正则表达式：

project_name <- "^\[|(?:[a-zA-Z]|[0-9])+|\]$"
table$project_name <- str_extract(table$test_string, project_name)

如果我在 table 的 1 个值（单独 1 行）上测试正则表达式，上面的正则表达式可以使用 str_extract_all(table$test_string, project_name[[1]][2]).

但是，当我将正则表达式模式应用于整个 table 时我得到 NA，如果我使用 str_extract_all.

则出现错误

2.) 字符串的第二部分，即另一列中的URL，

table$url_link <- "https://somewebsite.com/projectname/Abc/xyz-09"

我正在为 URL 使用以下正则表达式：

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*,]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

table$url_link <- str_extract(table$test_string, url_pattern)

这在整个 table 上都有效，但是，我仍然在 url link.

中得到最后一个括号 ')'

我在这里错过了什么？为什么第一个正则表达式单独工作而不是整体工作table？对于 url，我怎么得不到最后一个括号？

Answer 1

感觉可以通过使用括号来分组捕获来大大简化事情。例如：

test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

regex <- "\[(.*)\]\((.*)\)"

gsub(regex, "\1", test_string)
#> [1] "projectname"

gsub(regex, "\2", test_string)
#> [1] "https://somewebsite.com/projectname/Abc/xyz-09"

Answer 2

我们可以利用 qdapRegex

中的便捷功能

library(qdapRegex)
rm_round(test_string, extract = TRUE)[[1]]
#[1] "https://somewebsite.com/projectname/Abc/xyz-09"

rm_square(test_string, extract = TRUE)[[1]]
#[1] "projectname"

数据

test_string<- "[projectname](https://somewebsite.com/projectname/Abc/xyz-09)"

从字符串中提取两列文本

Extract text in two columns from a string

regex

r

pattern-matching

stringr

数据