提取R中第2期和第3期之间的字母
Extract out letters between 2nd period and 3rd period in R
我有一个名为 Identifier
的向量:
c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
我想提取 OA
我试过:
gsub(".*\.(.*)\..*", "\1", Identifier)
基本上,我想提取出第二节和第三节之间的文本。如果只有两个句点(NC.1.OA
),我想在第二个句点之后提取所有内容。
这是 sub
的替代方法,使用 strsplit
和 apply
:
sapply(Identifier, function(x) unlist(strsplit(x, "\."))[3])
NC.1.OA NC.1.OA.0 NC.1.OA.1 NC.1.OA.1.a NC.1.OA.1.b NC.1.OA.1.c
"OA" "OA" "OA" "OA" "OA" "OA"
NC.1.OA.2 NC.1.OA.2.0 NC.1.OA.3 NC.1.OA.4
"OA" "OA" "OA" "OA"
重复(非句号,后面跟句号)两次,然后捕获非句号,你要的子串在那个捕获组中:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
gsub("(?:[^.]+\.){2}([^.]+).*", "\1", Identifier)
输出:
[1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
具体来说,(?:[^.]+\.)
是先匹配非句点字符再匹配单个句点的组。组后的 {2}
表示前面的标记(组)重复两次 - 即 "non-periods, followed by a period, followed by non-periods, followed by a period."。然后,最后的 ([^.]+)
匹配第二个句点之后尽可能多的非句点字符,从而匹配第二个句点和第三个句点(或字符串末尾)之间的非句点。
我们可以尝试 stringr
来:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
library(stringr)
str_extract(Identifier, ".OA.")
# [1] NA ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA."
str_extract(Identifier, "OA")
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
gsub('\.', '', str_extract(Identifier, ".OA.?"))
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
regmatches(Identifier, gregexpr("OA", Identifier))
如果需要矢量,请包装 ?unlist
unlist(
regmatches(Identifier, gregexpr("OA", Identifier))
)
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
我有一个名为 Identifier
的向量:
c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
我想提取 OA
我试过:
gsub(".*\.(.*)\..*", "\1", Identifier)
基本上,我想提取出第二节和第三节之间的文本。如果只有两个句点(NC.1.OA
),我想在第二个句点之后提取所有内容。
这是 sub
的替代方法,使用 strsplit
和 apply
:
sapply(Identifier, function(x) unlist(strsplit(x, "\."))[3])
NC.1.OA NC.1.OA.0 NC.1.OA.1 NC.1.OA.1.a NC.1.OA.1.b NC.1.OA.1.c
"OA" "OA" "OA" "OA" "OA" "OA"
NC.1.OA.2 NC.1.OA.2.0 NC.1.OA.3 NC.1.OA.4
"OA" "OA" "OA" "OA"
重复(非句号,后面跟句号)两次,然后捕获非句号,你要的子串在那个捕获组中:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
gsub("(?:[^.]+\.){2}([^.]+).*", "\1", Identifier)
输出:
[1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
具体来说,(?:[^.]+\.)
是先匹配非句点字符再匹配单个句点的组。组后的 {2}
表示前面的标记(组)重复两次 - 即 "non-periods, followed by a period, followed by non-periods, followed by a period."。然后,最后的 ([^.]+)
匹配第二个句点之后尽可能多的非句点字符,从而匹配第二个句点和第三个句点(或字符串末尾)之间的非句点。
我们可以尝试 stringr
来:
Identifier = c("NC.1.OA", "NC.1.OA.0", "NC.1.OA.1", "NC.1.OA.1.a", "NC.1.OA.1.b",
"NC.1.OA.1.c", "NC.1.OA.2", "NC.1.OA.2.0", "NC.1.OA.3", "NC.1.OA.4"
)
library(stringr)
str_extract(Identifier, ".OA.")
# [1] NA ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA." ".OA."
str_extract(Identifier, "OA")
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
gsub('\.', '', str_extract(Identifier, ".OA.?"))
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"
regmatches(Identifier, gregexpr("OA", Identifier))
如果需要矢量,请包装 ?unlist
unlist(
regmatches(Identifier, gregexpr("OA", Identifier))
)
# [1] "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA" "OA"