gsub 的正则表达式匹配行直到和通过换行符 \n 字符

Question

正在尝试为 R gsub 构建正则表达式以通过要删除的换行符匹配字符串。

示例字符串：

text <- "categories: crime, punishment, france\nTags: valjean, javert,les mis\nAt the end of the day, the criminal Valjean escaped once more."

理想的结果是用 gsub 替换前两个文本块，这样剩下的就是后面的文本。

最终，罪犯冉阿让再次逃脱

删除类别和标签。

这是我正在使用的模式：

^categor*.\n{1}

它应该匹配行的开头，单词片段之后的所有内容，直到它到达第一个换行符，但它只匹配片段。我做错了什么？

而且，有没有比两个 gsub 更好的方法来解决这个问题？

Answer 1

1) 这里问的是什么问题所以第一个选项删除了前两行：

sub("^categor([^\n]*\n){2}", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

如果 categor 部分无关紧要，那么这样做：

tail(strsplit(text, "\n")[[1]], -2)
## [1] "At the end of the day, the criminal Valjean escaped once more."

2) 如果要删除形式为 ...:....\n 的任何行，其中每行冒号之前的字符必须是单词字符：

gsub("\w+:[^\n]+\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

或

gsub("\w+:.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

或

grep("^\w+:", unlist(strsplit(text, "\n")), invert = TRUE, value = TRUE)
## [1] "At the end of the day, the criminal Valjean escaped once more."

3) 或者如果我们想删除只有特定标签的行：

gsub("(categories|Tags):.+?\n", "", text)
## [1] "At the end of the day, the criminal Valjean escaped once more."

4) 如果您还想捕获标签，使用 read.dcf 可能也很有趣。

s <- unlist(strsplit(text, "\n"))
ix <- grep("^\w+:", s, invert = TRUE)
s[ix] <- paste("Content", s[ix], sep = ": ")
out <- read.dcf(textConnection(s))

给出这个 3 列矩阵：

> out
     categories                  Tags                     
[1,] "crime, punishment, france" "valjean, javert,les mis"
     Content                                                         
[1,] "At the end of the day, the criminal Valjean escaped once more."

Answer 2

可能是以下正则表达式：

sub("^categor.*\n([^\n]*$)", "\1", text)
#[1] "At the end of the day, the criminal Valjean escaped once more."

Answer 3

试试这个（换行匹配 \n:

gsub("^categor.*\n",  "", text)
# [1] "At the end of the day, the criminal Valjean escaped once more."

Answer 4

无需使用 [^\n]，因为您可以仅使用 . 来匹配除换行符以外的任何内容。请注意，您需要在 TRE 中使用 (?n) 修饰符（使用 (g)sub/(g)regexpr 的默认正则表达式引擎），对于 perl=TRUE，这是默认的 . 行为:

text <- "categories: crime, punishment, france\nTags: valjean, javert,les mis\nAt the end of the day, the criminal Valjean escaped once more."
sub("(?n)^categor(?:.*\n){2}", "", text)
sub("^categor(?:.*\n){2}", "", text, perl=TRUE)

此处，如果字符串以 categor 开头，则删除前两行。

参见R demo online。

图案详情

^ - 字符串锚点的开始
categor - 文字子串
(?:.*\n){2} - 除了换行字符 (.) 零次或多次 (*) 之外的任何字符恰好连续出现 2 次 ({2})，后跟一个LF 字符

gsub 的正则表达式匹配行直到和通过换行符 \n 字符

Regex for gsub to match line until and through newline \n character

regex

r

gsub