在R中的字符串后提取一定数量的单词或特殊字符

Question

我正在尝试提取特定字符串后的一定数量的单词。

library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))

例如要提取“source”后面的4个词，我从另一个问题中学习使用此代码：

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))

这很好用，但是，如果我尝试 select 8 个单词，我注意到它无法识别第一个字符串的“/”和 returns NA。

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){8}'))

问题是：是否有正则表达式来包含特殊字符（或绕过它们），这样我仍然可以提取所需的单词？我注意到其他字符（例如 - ）或双白 space.

也会发生同样的情况

8 个单词的预期输出应该是这样的：

from animal origin as Vitamin A / all-trans-Retinol

它是否将 / 和 - 计为单词并不重要，因为我总是可以将量词的数量调整得更多（在我的情况下，我不介意提取超出我需要的数量）。

谢谢

Answer 1

我们可以用{4,8}

指定范围

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4,8}'))

或者如果它需要特定的数字，用这些数字做一个循环

pat <- sprintf('(?<=source:\s)(\w+,?\s){%d}', c(8, 4))

然后提取具有模式的词和coalesce

library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"     
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"

Answer 2

The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.

Sidenode：这是一种 XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)

你的问题不是，正则表达式不工作 - 你的问题是正则表达式正在工作，但你期望不同的东西。你用它来 select 即将到来的 8 个单词，在某个字符串之后 - 但在 non-word (/) 之前只有 6 个单词 - 所以这与你的模式不匹配。

因此，要为您的问题提供“答案”，您应该首先重做您的问题：

您的确切期望是什么？

akrun 的解决方案可以匹配任何 4-8 个单词，但怀疑这是否是您真正需要的。

Answer 3

这是使用 regmatches + gsub

的基础 R 选项

lapply(regmatches(u <- gsub(".*?source:\s+?","",x$end),gregexpr("\w+",u)),`[`,1:4)

这给出了

[[1]]
[1] "from"   "animal" "origin" "as"

[[2]]
[1] "Eggs"    "liver"   "certain" "fish"

[[3]]
[1] "Leafy"      "green"      "vegetables" "such"

Answer 4

您可能依赖 \S shorthand 字符 class 匹配任何 non-whitespace 字符：

(?<=source:\s)\S+(?:\s+\S+){3,7}\b

参见regex demo。详情：

(?<=source:\s) - 紧接 source: 和空格
\S+ - 一个或多个 non-whitespace 个字符
(?:\s+\S+){3,7} - 三到七次出现 1+ 个空格，然后是 1+ non-whitespace 个字符
\b - 单词边界。

参见R demo online：

library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\s)\S+(?:\s+\S+){3,7}\b")

输出：

[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"

在R中的字符串后提取一定数量的单词或特殊字符

Extract certain number of words or special characters after a string in R

regex

r

stringr