在R中的字符串后提取一定数量的单词或特殊字符

Extract certain number of words or special characters after a string in R

我正在尝试提取特定字符串后的一定数量的单词。

library(stringr)

x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))

例如要提取“source”后面的4个词,我从另一个问题中学习使用此代码:

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))

这很好用,但是,如果我尝试 select 8 个单词,我注意到它无法识别第一个字符串的“/”和 returns NA。

trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){8}'))

问题是:是否有正则表达式来包含特殊字符(或绕过它们),这样我仍然可以提取所需的单词?我注意到其他字符(例如 - )或双白 space.

也会发生同样的情况

8 个单词的预期输出应该是这样的:

from animal origin as Vitamin A / all-trans-Retinol  

它是否将 / 和 - 计为单词并不重要,因为我总是可以将量词的数量调整得更多(在我的情况下,我不介意提取超出我需要的数量)。

谢谢

我们可以用{4,8}

指定范围
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4,8}'))

或者如果它需要特定的数字,用这些数字做一个循环

pat <- sprintf('(?<=source:\s)(\w+,?\s){%d}', c(8, 4))

然后提取具有模式的词和coalesce

library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"     
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"  

The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.

Sidenode:这是一种 XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)

你的问题不是,正则表达式不工作 - 你的问题是正则表达式正在工作,但你期望不同的东西。你用它来 select 即将到来的 8 个单词,在某个字符串之后 - 但在 non-word (/) 之前只有 6 个单词 - 所以这与你的模式不匹配。

因此,要为您的问题提供“答案”,您应该首先重做您的问题:

您的确切期望是什么?

akrun 的解决方案可以匹配任何 4-8 个单词,但怀疑这是否是您真正需要的。

这是使用 regmatches + gsub

的基础 R 选项
lapply(regmatches(u <- gsub(".*?source:\s+?","",x$end),gregexpr("\w+",u)),`[`,1:4)

这给出了

[[1]]
[1] "from"   "animal" "origin" "as"

[[2]]
[1] "Eggs"    "liver"   "certain" "fish"

[[3]]
[1] "Leafy"      "green"      "vegetables" "such"

您可能依赖 \S shorthand 字符 class 匹配任何 non-whitespace 字符:

(?<=source:\s)\S+(?:\s+\S+){3,7}\b

参见regex demo。详情:

  • (?<=source:\s) - 紧接 source: 和空格
  • 之前的位置
  • \S+ - 一个或多个 non-whitespace 个字符
  • (?:\s+\S+){3,7} - 三到七次出现 1+ 个空格,然后是 1+ non-whitespace 个字符
  • \b - 单词边界。

参见R demo online

library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\s)\S+(?:\s+\S+){3,7}\b")

输出:

[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"