如何提取已添加到 R 中字符串的文本
How to extract text that has been added to a string in R
我有一个已知格式的字符串,例如:
"This string will have additional text here *, and it will have more here ^, and finally there will be more here ~ with some text after."
一条数据会
"This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
其中插入的文本不会总是相同的长度。我需要一种方法来识别每个 *、^、~ 在第二个字符串中的含义:
* = "about things"
^ = "regarding other stuff"
~ = "near the end"
新字符串的文本不会由任何内容分隔,但模板字符串希望在每个可选位之间有足够独特的文本,您每次都可以识别它。
我试过四处寻找,但找不到任何与我要求的类似的东西,甚至开始,任何包或函数都会非常有用!
我现在不知道它是否是最好的解决方案,但我会用定界符替换已知部分(或者在开头和和什么都没有),然后用该定界符拆分结果文本。
text = "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
temp = gsub("This string will have additional text here ", "", text)
temp = gsub(", and it will have more here ", "^", temp)
temp = gsub(", and finally there will be more here ", "^", temp)
temp = gsub(" with some text after.", "", temp)
solution = unlist(strsplit(temp, "\^"))
solution
只是 using the stringr 包的一个细微变化,使已知部件和它们的替代品(视觉上)更靠近。
library(stringr)
text <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
text_repl <-
str_replace_all(
text,
c(
"This string will have additional text here " = "",
", and it will have more here " = "^",
", and finally there will be more here " = "^",
" with some text after." = ""
)
)
str_split(text_repl, "\^", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "about things" "regarding other stuff" "near the end"
str_split()
returns 字符向量列表 (simplify = FALSE
) 或字符矩阵 (simplify = TRUE
),可以很容易地转换为 data.frame.
也许您可以查看 ~、* 和 ^ 等前后的独特单词模式,并将它们放入这样的向量中:
priorstrings <- c("text here", "have more here", "be more here")
afterstrings <- c("and it", "and finally", "with some")
然后通过检查是否
来检查这些是否真的是唯一的
length(unique(priorstrings)) == length(priorstrings)
length(unique(afterstrings)) == length(afterstrings)
两者的计算结果均为 TRUE。
然后将它们粘贴在一起,中间环顾四周,如下所示:
fullsearches <- paste0(priorstrings, " (.*? )" , afterstrings)
我再次使用了您的示例字符串,将其命名为 y,并添加了另一个名为 z:
y <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
z <- "This string will have additional text here on this topic, and it will have more here to follow up, and finally there will be more here to finish with some text after."
然后,最后,做这样的事情:
sapply(list(y,z), function(x) str_match(x, fullsearches)[,2])
这给出:
[,1] [,2]
[1,] "about things, " "on this topic, "
[2,] "regarding other stuff, " "to follow up, "
[3,] "near the end " "to finish "
我认为您完全可以通过这种方式添加更多 priorstrings、afterstrings 和 fullsearchers,并将其应用于更大的字符串列表。
我有一个已知格式的字符串,例如:
"This string will have additional text here *, and it will have more here ^, and finally there will be more here ~ with some text after."
一条数据会
"This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
其中插入的文本不会总是相同的长度。我需要一种方法来识别每个 *、^、~ 在第二个字符串中的含义:
* = "about things"
^ = "regarding other stuff"
~ = "near the end"
新字符串的文本不会由任何内容分隔,但模板字符串希望在每个可选位之间有足够独特的文本,您每次都可以识别它。
我试过四处寻找,但找不到任何与我要求的类似的东西,甚至开始,任何包或函数都会非常有用!
我现在不知道它是否是最好的解决方案,但我会用定界符替换已知部分(或者在开头和和什么都没有),然后用该定界符拆分结果文本。
text = "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
temp = gsub("This string will have additional text here ", "", text)
temp = gsub(", and it will have more here ", "^", temp)
temp = gsub(", and finally there will be more here ", "^", temp)
temp = gsub(" with some text after.", "", temp)
solution = unlist(strsplit(temp, "\^"))
solution
只是
library(stringr)
text <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
text_repl <-
str_replace_all(
text,
c(
"This string will have additional text here " = "",
", and it will have more here " = "^",
", and finally there will be more here " = "^",
" with some text after." = ""
)
)
str_split(text_repl, "\^", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "about things" "regarding other stuff" "near the end"
str_split()
returns 字符向量列表 (simplify = FALSE
) 或字符矩阵 (simplify = TRUE
),可以很容易地转换为 data.frame.
也许您可以查看 ~、* 和 ^ 等前后的独特单词模式,并将它们放入这样的向量中:
priorstrings <- c("text here", "have more here", "be more here")
afterstrings <- c("and it", "and finally", "with some")
然后通过检查是否
来检查这些是否真的是唯一的length(unique(priorstrings)) == length(priorstrings)
length(unique(afterstrings)) == length(afterstrings)
两者的计算结果均为 TRUE。
然后将它们粘贴在一起,中间环顾四周,如下所示:
fullsearches <- paste0(priorstrings, " (.*? )" , afterstrings)
我再次使用了您的示例字符串,将其命名为 y,并添加了另一个名为 z:
y <- "This string will have additional text here about things, and it will have more here regarding other stuff, and finally there will be more here near the end with some text after."
z <- "This string will have additional text here on this topic, and it will have more here to follow up, and finally there will be more here to finish with some text after."
然后,最后,做这样的事情:
sapply(list(y,z), function(x) str_match(x, fullsearches)[,2])
这给出:
[,1] [,2]
[1,] "about things, " "on this topic, "
[2,] "regarding other stuff, " "to follow up, "
[3,] "near the end " "to finish "
我认为您完全可以通过这种方式添加更多 priorstrings、afterstrings 和 fullsearchers,并将其应用于更大的字符串列表。