使用 gsub 或 sub 函数只获取字符串的一部分?
Using gsub or sub function to only get part of a string?
Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
我有一列有 75 行变量,例如上面的 col。我不太确定如何使用 gsub 或 sub 来获取第一个冒号之后的整数。
预期输出:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
我试过了,但似乎不起作用:
gsub("*..:","", df$col)
您可以使用
df$col <- sub("(\d:\d+):\d+$", "\1", df$col)
详情
(\d:\d+)
- 捕获第 1 组(其值可通过替换模式中的 </code> 访问):一个数字、一个冒号和 1+ 个数字。</li>
<li><code>:
- 冒号
\d+
- 1+ 位数
$
- 字符串结尾。
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\d:\d+):\d+$", "\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
替代方法:
df$col <- sub("^(.*?:\d+).*", "\1", df$col)
这里,
^
- 字符串开头
(.*?:\d+)
- 第 1 组:任何 0+ 个字符,尽可能少(由于惰性 *?
量词),然后是 :
和 1+ 个数字
.*
- 字符串的其余部分。
但是,它应该与PCRE正则表达式引擎一起使用,通过perl=TRUE
:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\d+).*", "\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
sub("(\d+:\d+):\d+$", "\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
或者用 stringi
:
匹配你想要的(而不是替换掉你不想要的)
stringi::stri_extract_first(df$Col, regex = "[A-Z-\*]+\d+:\d+")
稍微简洁一些stringr
:
stringr::str_extract(df$Col, "[A-Z-\*]+\d+:\d+")
# or
stringr::str_extract(df$Col, "[\w-*]+\d+:\d+")
以下内容也可能对您有所帮助。
sub("([^:]*):([^:]*).*","\1:\2",df$dat)
输出结果如下。
> sub("([^:]*):([^:]*).*","\1:\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
其中数据框的输入如下。
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
说明: 以下仅作说明之用。
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\1:\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \1 means 1st memory place \2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.
Col
WBU-ARGU*06:03:04
WBU-ARDU*08:01:01
WBU-ARFU*11:03:05
WBU-ARFU*03:456
我有一列有 75 行变量,例如上面的 col。我不太确定如何使用 gsub 或 sub 来获取第一个冒号之后的整数。
预期输出:
Col
WBU-ARGU*06:03
WBU-ARDU*08:01
WBU-ARFU*11:03
WBU-ARFU*03:456
我试过了,但似乎不起作用:
gsub("*..:","", df$col)
您可以使用
df$col <- sub("(\d:\d+):\d+$", "\1", df$col)
详情
(\d:\d+)
- 捕获第 1 组(其值可通过替换模式中的</code> 访问):一个数字、一个冒号和 1+ 个数字。</li> <li><code>:
- 冒号\d+
- 1+ 位数$
- 字符串结尾。
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("(\d:\d+):\d+$", "\1", col)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
替代方法:
df$col <- sub("^(.*?:\d+).*", "\1", df$col)
这里,
^
- 字符串开头(.*?:\d+)
- 第 1 组:任何 0+ 个字符,尽可能少(由于惰性*?
量词),然后是:
和 1+ 个数字.*
- 字符串的其余部分。
但是,它应该与PCRE正则表达式引擎一起使用,通过perl=TRUE
:
col <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456")
sub("^(.*?:\d+).*", "\1", col, perl=TRUE)
## => [1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
sub("(\d+:\d+):\d+$", "\1", df$Col)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456"
或者用 stringi
:
stringi::stri_extract_first(df$Col, regex = "[A-Z-\*]+\d+:\d+")
稍微简洁一些stringr
:
stringr::str_extract(df$Col, "[A-Z-\*]+\d+:\d+")
# or
stringr::str_extract(df$Col, "[\w-*]+\d+:\d+")
以下内容也可能对您有所帮助。
sub("([^:]*):([^:]*).*","\1:\2",df$dat)
输出结果如下。
> sub("([^:]*):([^:]*).*","\1:\2",df$dat)
[1] "WBU-ARGU*06:03" "WBU-ARDU*08:01" "WBU-ARFU*11:03" "WBU-ARFU*03:456b"
其中数据框的输入如下。
dat <- c("WBU-ARGU*06:03:04","WBU-ARDU*08:01:01","WBU-ARFU*11:03:05","WBU-ARFU*03:456b")
df <- data.frame(dat)
说明: 以下仅作说明之用。
sub(" ##using sub for global subtitution function of R here.
([^:]*) ##By mentioning () we are keeping the matched values from vector's element into 1st place of memory(which we could use later), which is till next colon comes it will match everything.
: ##Mentioning letter colon(:) here.
([^:]*) ##By mentioning () making 2nd place in memory for matched values in vector's values which is till next colon comes it will match everything.
.*" ##Mentioning .* to match everything else now after 2nd colon comes in value.
,"\1:\2" ##Now mentioning the values of memory holds with whom we want to substitute the element values \1 means 1st memory place \2 is second memory place's value.
,df$dat) ##Mentioning df$dat dataframe's dat value.