删除括号中的字符串并将其添加为新列
Remove string in parenthesis and add that as a new column
可能重复Here
我有一个两列的数据框。我想删除括号中的字符串并将其添加为新列。数据框显示在下方。
structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L,
5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA",
" heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA",
" ribosomal protein L34 (RPL34), transcript variant 1, mRNA",
" ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA",
"clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA",
"farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA",
"mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA",
"ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID",
"Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
如果没有找到括号中的字符串,则将该行留空。这里我有两个案例
1) 获取括号中的所有字符串并作为新列添加,以,
分隔
2) 括号中的最后一个字符串并添加为新列
我尝试了类似 df$Symbol <- sapply(df, function(x) sub("\).*", "", sub(".*\(", "", x)))
但没有给出适当的输出
案例 1 输出
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA ubiquinone, (9kD, MLRQ),NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA oligomycin sensitivity conferring protein,ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA subunit 9,ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds NA
案例 2 输出
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds <NA>
我想我走了捷径,但如果你能逃脱它,只匹配括号中看起来像基因符号的东西,即只匹配大写字母和数字
dd <- structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L, 5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA", " heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA", " ribosomal protein L34 (RPL34), transcript variant 1, mRNA", " ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA", "clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA", "farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA", "mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA", "ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID", "Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
dd$Gene.Name <- as.character(dd$Gene.Name)
## case 1
mm <- gregexpr('(?<=\()(.*?)(?=\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case1 = sapply(mm, function(x)
ifelse(length(x), paste(x, collapse = ', '), NA)))
dd[, c(1,3)]
# ID case1
# 1 1 ubiquinone, 9kD, MLRQ, NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 oligomycin sensitivity conferring protein, ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 subunit 9, ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
## case 2
mm <- gregexpr('(?<=\()([A-Z0-9]+)(?=\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case2 = sapply(mm, function(x) ifelse(length(x), x, NA)))
dd[, c(1,4)]
# ID case2
# 1 1 NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
使用 sub
获取字符串末尾圆括号内的单词的选项。
Symbol <- sub('.*\(([^\)]+)\)[^\(]+$', '\1',df1[,2])
df1$Symbol <- Symbol[1:nrow(df1)*NA^(!grepl('\(',df1[,2]))]
df1$Symbol
#[1] "NDUFA4" "MRPS33" "FDFT1" "RPS11" "ATP5O" "CMAS"
#[7] "HNRPF" "RPL34" "ATP5G3" "RPS15A" "LOC84528" NA
更新
对于第一种情况,即。提取圆括号内的所有字符并使用 ,
将它们粘贴在一起,一个选项是 qdapRegex
中的 rm_round
。 rm_round
的输出是 list
。所以我们使用 lapply/sapply
循环遍历 list
。内部有 ,
的字符串用 grep
分隔,我们用 paste
圆括号,然后用 paste
和 collapse=', '
分隔字符串。一个方便的包装函数是 toString
.
library(qdapRegex)
df1$allSymbol <- sapply(rm_round(df1[,2],extract=TRUE), function(x) {
indx <- grep(',', x)
x[indx] <-paste0("(", x[indx], ")")
toString(x)})
is.na(df1$allSymbol) <- df1$allSymbol=='NA'
df1[3:4]
# allSymbol Symbol
#1 ubiquinone, (9kD, MLRQ), NDUFA4 NDUFA4
#2 MRPS33 MRPS33
#3 FDFT1 FDFT1
#4 RPS11 RPS11
#5 oligomycin sensitivity conferring protein, ATP5O ATP5O
#6 CMAS CMAS
#7 HNRPF HNRPF
#8 RPL34 RPL34
#9 subunit 9, ATP5G3 ATP5G3
#10 RPS15A RPS15A
#11 LOC84528 LOC84528
#12 <NA> <NA>
可能重复Here
我有一个两列的数据框。我想删除括号中的字符串并将其添加为新列。数据框显示在下方。
structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L,
5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA",
" heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA",
" ribosomal protein L34 (RPL34), transcript variant 1, mRNA",
" ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA",
"clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA",
"farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA",
"mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA",
"ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID",
"Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
如果没有找到括号中的字符串,则将该行留空。这里我有两个案例
1) 获取括号中的所有字符串并作为新列添加,以,
2) 括号中的最后一个字符串并添加为新列
我尝试了类似 df$Symbol <- sapply(df, function(x) sub("\).*", "", sub(".*\(", "", x)))
但没有给出适当的输出
案例 1 输出
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA ubiquinone, (9kD, MLRQ),NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA oligomycin sensitivity conferring protein,ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA subunit 9,ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds NA
案例 2 输出
ID Gene.Name Symbol
1 NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA NDUFA4
2 mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA MRPS33
3 farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA FDFT1
4 ribosomal protein S11 (RPS11), mRNA RPS11
5 ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA ATP5O
6 cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA CMAS
7 heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA HNRPF
8 ribosomal protein L34 (RPL34), transcript variant 1, mRNA RPL34
9 ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA ATP5G3
10 ribosomal protein S15a (RPS15A), mRNA RPS15A
11 homeobox protein from AL590526 (LOC84528), mRNA LOC84528
12 clone MGC:10120 IMAGE:3900723, mRNA, complete cds <NA>
我想我走了捷径,但如果你能逃脱它,只匹配括号中看起来像基因符号的东西,即只匹配大写字母和数字
dd <- structure(list(ID = 1:12, Gene.Name = structure(c(3L, 11L, 9L, 5L, 1L, 8L, 2L, 4L, 6L, 12L, 10L, 7L), .Label = c(" ATP synt, H+ tran, O subunit (oligomycin sensitivity conferring protein) (ATP5O), mRNA", " heterogeneous nuclear ribonucleoprotein F (HNRPF), mRNA", " NADH (ubiquinone) 1 alpha subcomplex, 4 (9kD, MLRQ) (NDUFA4), mRNA", " ribosomal protein L34 (RPL34), transcript variant 1, mRNA", " ribosomal protein S11 (RPS11), mRNA", "ATP synthase, H+ tran, mitochondrial F0, subunit c (subunit 9) isoform 3 (ATP5G3), mRNA", "clone MGC:10120 IMAGE:3900723, mRNA, complete cds", "cytidine monophosphate N-acetylneuraminic acid synthetase (CMAS), mRNA", "farnesyl-diphosphate farnesyltransferase 1 (FDFT1), mRNA", "homeobox protein from AL590526 (LOC84528), mRNA", "mitochondrial S33 (MRPS33), transcript variant 1, nuclear gene, mRNA", "ribosomal protein S15a (RPS15A), mRNA"), class = "factor")), .Names = c("ID", "Gene.Name"), row.names = c(NA, -12L), class = "data.frame")
dd$Gene.Name <- as.character(dd$Gene.Name)
## case 1
mm <- gregexpr('(?<=\()(.*?)(?=\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case1 = sapply(mm, function(x)
ifelse(length(x), paste(x, collapse = ', '), NA)))
dd[, c(1,3)]
# ID case1
# 1 1 ubiquinone, 9kD, MLRQ, NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 oligomycin sensitivity conferring protein, ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 subunit 9, ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
## case 2
mm <- gregexpr('(?<=\()([A-Z0-9]+)(?=\))', dd$Gene.Name, perl = TRUE)
mm <- regmatches(dd$Gene.Name, mm)
dd <- cbind(dd, case2 = sapply(mm, function(x) ifelse(length(x), x, NA)))
dd[, c(1,4)]
# ID case2
# 1 1 NDUFA4
# 2 2 MRPS33
# 3 3 FDFT1
# 4 4 RPS11
# 5 5 ATP5O
# 6 6 CMAS
# 7 7 HNRPF
# 8 8 RPL34
# 9 9 ATP5G3
# 10 10 RPS15A
# 11 11 LOC84528
# 12 12 <NA>
使用 sub
获取字符串末尾圆括号内的单词的选项。
Symbol <- sub('.*\(([^\)]+)\)[^\(]+$', '\1',df1[,2])
df1$Symbol <- Symbol[1:nrow(df1)*NA^(!grepl('\(',df1[,2]))]
df1$Symbol
#[1] "NDUFA4" "MRPS33" "FDFT1" "RPS11" "ATP5O" "CMAS"
#[7] "HNRPF" "RPL34" "ATP5G3" "RPS15A" "LOC84528" NA
更新
对于第一种情况,即。提取圆括号内的所有字符并使用 ,
将它们粘贴在一起,一个选项是 qdapRegex
中的 rm_round
。 rm_round
的输出是 list
。所以我们使用 lapply/sapply
循环遍历 list
。内部有 ,
的字符串用 grep
分隔,我们用 paste
圆括号,然后用 paste
和 collapse=', '
分隔字符串。一个方便的包装函数是 toString
.
library(qdapRegex)
df1$allSymbol <- sapply(rm_round(df1[,2],extract=TRUE), function(x) {
indx <- grep(',', x)
x[indx] <-paste0("(", x[indx], ")")
toString(x)})
is.na(df1$allSymbol) <- df1$allSymbol=='NA'
df1[3:4]
# allSymbol Symbol
#1 ubiquinone, (9kD, MLRQ), NDUFA4 NDUFA4
#2 MRPS33 MRPS33
#3 FDFT1 FDFT1
#4 RPS11 RPS11
#5 oligomycin sensitivity conferring protein, ATP5O ATP5O
#6 CMAS CMAS
#7 HNRPF HNRPF
#8 RPL34 RPL34
#9 subunit 9, ATP5G3 ATP5G3
#10 RPS15A RPS15A
#11 LOC84528 LOC84528
#12 <NA> <NA>