如何根据正则表达式分隔 dplyr 中的列
How to separate a column in dplyr based on regex
我有以下数据框:
df <- structure(list(X2 = c("BB_137.HVMSC", "BB_138.combined.HVMSC",
"BB_139.combined.HVMSC", "BB_140.combined.HVMSC", "BB_141.HVMSC",
"BB_142.combined.HMSC-bm")), .Names = "X2", row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
看起来像这样
> df
# A tibble: 6 x 1
X2
<chr>
1 BB_137.HVMSC
2 BB_138.combined.HVMSC
3 BB_139.combined.HVMSC
4 BB_140.combined.HVMSC
5 BB_141.HVMSC
6 BB_142.combined.HMSC-bm
我想做的是将最后一个字段保留为第二列,分成两列(使用 .
作为分隔符)
col1 col2
BB_137 HVMSC
BB_138.combined HVMSC
BB_139.combined HVMSC
BB_140.combined HVMSC
BB_141 HVMSC
BB_142.combined HMSC-bm
正确的做法是什么?
我的尝试是这样的:
> df %>% separate(X2, into = c("sid","status", "tiss"), sep = "[.]")
# A tibble: 6 x 3
sid status tiss
* <chr> <chr> <chr>
1 BB_137 HVMSC <NA>
2 BB_138 combined HVMSC
3 BB_139 combined HVMSC
4 BB_140 combined HVMSC
5 BB_141 HVMSC <NA>
6 BB_142 combined HMSC-bm
Warning message:
Too few values at 2 locations: 1, 5
我们可以在单独的函数中使用负前瞻作为分隔符。
library(tidyr)
separate(data = df, col = X2, into = c("col1", "col2"), sep = "(\.)(?!.*\.)")
# col1 col2
# <chr> <chr>
#1 BB_137 HVMSC
#2 BB_138.combined HVMSC
#3 BB_139.combined HVMSC
#4 BB_140.combined HVMSC
#5 BB_141 HVMSC
#6 BB_142.combined HMSC-bm
正则表达式取自 this 答案。
我们也可以使用tidyr::extract()
extract(df, X2, c("col1","col2"), "(.*)\.(H.*)")
我有以下数据框:
df <- structure(list(X2 = c("BB_137.HVMSC", "BB_138.combined.HVMSC",
"BB_139.combined.HVMSC", "BB_140.combined.HVMSC", "BB_141.HVMSC",
"BB_142.combined.HMSC-bm")), .Names = "X2", row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
看起来像这样
> df
# A tibble: 6 x 1
X2
<chr>
1 BB_137.HVMSC
2 BB_138.combined.HVMSC
3 BB_139.combined.HVMSC
4 BB_140.combined.HVMSC
5 BB_141.HVMSC
6 BB_142.combined.HMSC-bm
我想做的是将最后一个字段保留为第二列,分成两列(使用 .
作为分隔符)
col1 col2
BB_137 HVMSC
BB_138.combined HVMSC
BB_139.combined HVMSC
BB_140.combined HVMSC
BB_141 HVMSC
BB_142.combined HMSC-bm
正确的做法是什么?
我的尝试是这样的:
> df %>% separate(X2, into = c("sid","status", "tiss"), sep = "[.]")
# A tibble: 6 x 3
sid status tiss
* <chr> <chr> <chr>
1 BB_137 HVMSC <NA>
2 BB_138 combined HVMSC
3 BB_139 combined HVMSC
4 BB_140 combined HVMSC
5 BB_141 HVMSC <NA>
6 BB_142 combined HMSC-bm
Warning message: Too few values at 2 locations: 1, 5
我们可以在单独的函数中使用负前瞻作为分隔符。
library(tidyr)
separate(data = df, col = X2, into = c("col1", "col2"), sep = "(\.)(?!.*\.)")
# col1 col2
# <chr> <chr>
#1 BB_137 HVMSC
#2 BB_138.combined HVMSC
#3 BB_139.combined HVMSC
#4 BB_140.combined HVMSC
#5 BB_141 HVMSC
#6 BB_142.combined HMSC-bm
正则表达式取自 this 答案。
我们也可以使用tidyr::extract()
extract(df, X2, c("col1","col2"), "(.*)\.(H.*)")