提取R中第2期和第3期之间的字母

Extract out letters between 2nd period and 3rd period in R

我有一个名为 Identifier 的向量:

c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)

我想提取 OA

我试过:

gsub(".*\.(.*)\..*", "\1", Identifier)

基本上,我想提取出第二节和第三节之间的文本。如果只有两个句点(NC.1.OA),我想在第二个句点之后提取所有内容。

这是 sub 的替代方法,使用 strsplitapply:

sapply(Identifier, function(x) unlist(strsplit(x, "\."))[3])

NC.1.OA   NC.1.OA.0   NC.1.OA.1 NC.1.OA.1.a NC.1.OA.1.b NC.1.OA.1.c 
    "OA"        "OA"        "OA"        "OA"        "OA"        "OA" 
NC.1.OA.2 NC.1.OA.2.0   NC.1.OA.3   NC.1.OA.4 
    "OA"        "OA"        "OA"        "OA" 

重复(非句号,后面跟句号)两次,然后捕获非句号,你要的子串在那个捕获组中:

Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
gsub("(?:[^.]+\.){2}([^.]+).*", "\1", Identifier)

输出:

[1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"

具体来说,(?:[^.]+\.)是先匹配非句点字符再匹配单个句点的组。组后的 {2} 表示前面的标记(组)重复两次 - 即 "non-periods, followed by a period, followed by non-periods, followed by a period."。然后,最后的 ([^.]+) 匹配第二个句点之后尽可能多的非句点字符,从而匹配第二个句点和第三个句点(或字符串末尾)之间的非句点。

我们可以尝试 stringr 来:

Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b", 
               "NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
library(stringr)
str_extract(Identifier, ".OA.")
# [1] NA     ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA."
str_extract(Identifier, "OA")
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
gsub('\.', '', str_extract(Identifier, ".OA.?"))
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
regmatches(Identifier, gregexpr("OA", Identifier))

如果需要矢量,请包装​​ ?unlist

unlist(
    regmatches(Identifier, gregexpr("OA", Identifier))
)
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"