在R中的字符串后提取一定数量的单词或特殊字符
Extract certain number of words or special characters after a string in R
我正在尝试提取特定字符串后的一定数量的单词。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
例如要提取“source”后面的4个词,我从另一个问题中学习使用此代码:
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))
这很好用,但是,如果我尝试 select 8 个单词,我注意到它无法识别第一个字符串的“/”和 returns NA。
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){8}'))
问题是:是否有正则表达式来包含特殊字符(或绕过它们),这样我仍然可以提取所需的单词?我注意到其他字符(例如 - )或双白 space.
也会发生同样的情况
8 个单词的预期输出应该是这样的:
from animal origin as Vitamin A / all-trans-Retinol
它是否将 / 和 - 计为单词并不重要,因为我总是可以将量词的数量调整得更多(在我的情况下,我不介意提取超出我需要的数量)。
谢谢
我们可以用{4,8}
指定范围
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4,8}'))
或者如果它需要特定的数字,用这些数字做一个循环
pat <- sprintf('(?<=source:\s)(\w+,?\s){%d}', c(8, 4))
然后提取具有模式的词和coalesce
library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"
The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.
Sidenode:这是一种 XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)
你的问题不是,正则表达式不工作 - 你的问题是正则表达式正在工作,但你期望不同的东西。你用它来 select 即将到来的 8 个单词,在某个字符串之后 - 但在 non-word (/
) 之前只有 6 个单词 - 所以这与你的模式不匹配。
因此,要为您的问题提供“答案”,您应该首先重做您的问题:
您的确切期望是什么?
akrun 的解决方案可以匹配任何 4-8 个单词,但怀疑这是否是您真正需要的。
这是使用 regmatches
+ gsub
的基础 R 选项
lapply(regmatches(u <- gsub(".*?source:\s+?","",x$end),gregexpr("\w+",u)),`[`,1:4)
这给出了
[[1]]
[1] "from" "animal" "origin" "as"
[[2]]
[1] "Eggs" "liver" "certain" "fish"
[[3]]
[1] "Leafy" "green" "vegetables" "such"
您可能依赖 \S
shorthand 字符 class 匹配任何 non-whitespace 字符:
(?<=source:\s)\S+(?:\s+\S+){3,7}\b
参见regex demo。详情:
(?<=source:\s)
- 紧接 source:
和空格 之前的位置
\S+
- 一个或多个 non-whitespace 个字符
(?:\s+\S+){3,7}
- 三到七次出现 1+ 个空格,然后是 1+ non-whitespace 个字符
\b
- 单词边界。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\s)\S+(?:\s+\S+){3,7}\b")
输出:
[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"
我正在尝试提取特定字符串后的一定数量的单词。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
例如要提取“source”后面的4个词,我从另一个问题中学习使用此代码:
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4}'))
这很好用,但是,如果我尝试 select 8 个单词,我注意到它无法识别第一个字符串的“/”和 returns NA。
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){8}'))
问题是:是否有正则表达式来包含特殊字符(或绕过它们),这样我仍然可以提取所需的单词?我注意到其他字符(例如 - )或双白 space.
也会发生同样的情况8 个单词的预期输出应该是这样的:
from animal origin as Vitamin A / all-trans-Retinol
它是否将 / 和 - 计为单词并不重要,因为我总是可以将量词的数量调整得更多(在我的情况下,我不介意提取超出我需要的数量)。
谢谢
我们可以用{4,8}
trimws(stringr::str_extract(x$end, '(?<=source:\s)(\w+,?\s){4,8}'))
或者如果它需要特定的数字,用这些数字做一个循环
pat <- sprintf('(?<=source:\s)(\w+,?\s){%d}', c(8, 4))
然后提取具有模式的词和coalesce
library(dplyr)
do.call(coalesce, lapply(pat, function(y) trimws(stringr::str_extract(x$end, y))))
#[1] "from animal origin as"
#[2] "Eggs, liver, certain fish species such as sardines,"
#[3] "Leafy green vegetables such"
The question is: is there a regex to include the special characters (or bypass them), so I can still extract the needed words? I noticed that the same happens with other characters (eg - ) or with double white space.
Sidenode:这是一种 XY-Problem (https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem)
你的问题不是,正则表达式不工作 - 你的问题是正则表达式正在工作,但你期望不同的东西。你用它来 select 即将到来的 8 个单词,在某个字符串之后 - 但在 non-word (/
) 之前只有 6 个单词 - 所以这与你的模式不匹配。
因此,要为您的问题提供“答案”,您应该首先重做您的问题:
您的确切期望是什么?
akrun 的解决方案可以匹配任何 4-8 个单词,但怀疑这是否是您真正需要的。
这是使用 regmatches
+ gsub
lapply(regmatches(u <- gsub(".*?source:\s+?","",x$end),gregexpr("\w+",u)),`[`,1:4)
这给出了
[[1]]
[1] "from" "animal" "origin" "as"
[[2]]
[1] "Eggs" "liver" "certain" "fish"
[[3]]
[1] "Leafy" "green" "vegetables" "such"
您可能依赖 \S
shorthand 字符 class 匹配任何 non-whitespace 字符:
(?<=source:\s)\S+(?:\s+\S+){3,7}\b
参见regex demo。详情:
(?<=source:\s)
- 紧接source:
和空格 之前的位置
\S+
- 一个或多个 non-whitespace 个字符(?:\s+\S+){3,7}
- 三到七次出现 1+ 个空格,然后是 1+ non-whitespace 个字符\b
- 单词边界。
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / alltrans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
stringr::str_extract(x$end, "(?<=source:\s)\S+(?:\s+\S+){3,7}\b")
输出:
[1] "from animal origin as Vitamin A / alltrans-Retinol"
[2] "Eggs, liver, certain fish species such as sardines"
[3] "Leafy green vegetables such as spinach; egg yolks"