第二次非结构化数据的部分提取

Question

我有一个来自 Edgar 的巨大文本文件。我只想从业务风险部分提取部分文本。

例如，如果文本如下：

Bshehebvegegeveghdhebejejrjbfbfk

我想提取开始位置作为he（第二个实例）结束位置ge（第二个实例）。

所以我的输出将是 - hebvegege

我想要 R 中的代码。我对业务风险部分特别感兴趣。

Answer 1

一个选项是 gregexpr 找到模式 'he' 和 'ge' 的起始字符的索引，然后使用 substr 指定 start 和 stop 字符串的位置以提取子字符串

i1 <- gregexpr("he", str1)[[1]][2]
i2 <- gregexpr("ge", str1)[[1]][2] +1
substr(str1, i1, i2)
#[1] "hebvegege"

或一步到位

do.call(substr, c(str1, lapply(c("he", "(?<=g)e"), 
     function(pat) gregexpr(pat, str1, perl=TRUE)[[1]][2]) ))
#[1] "hebvegege"

str1 <- "Bshehebvegegeveghdhebejejrjbfbfk"

Partial extraction of unstructured data on 2nd instance