给定字符向量中具有多个模式的正则表达式
regex expression with multiple patterns in a given character vector
我有一个具有许多不同格式的字符串(x,见下文)。它们都是基因组上的位置,但名称不同。这些名字是给我的,属于一个大约 600 万的列表,所以我手动更改并不容易。这是一个子集,但是还有其他像 X1 或 chr 13 也是这个列表的一部分。:
x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
我希望所有字符串看起来像这样:
y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")
我试过以下方法,但第一个“.”之后的所有内容都试过了。被删除...这不是我想要的。
x.test = gsub(pattern = "\.\S+$", replacement = "", x = x)
如有任何帮助,我们将不胜感激!
如果您的所有数据都符合您给出的示例:
x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
有两种类型的id,一种是SNP id(以rs或kgp开头),另一种是给出染色体位置(以染色体名称开头)。
您可以通过识别您的 SNP id 开始,例如:
x1 = gsub("((rs|kgp)\d+).*","\1",x)
这个returns:
[1] "rs62224609" "rs376238049" "rs62224614" "X22.17028719.G.A" "rs4535153" "X22.17028719.G.A" "kgp3171179" "rs375850426" "chr22.17030620.G.A"
然后格式化染色体位置(我假设你有从 1 到 22、X、Y 和 M 的染色体,但这取决于你的数据):
## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\d{1,2}|X|Y|M)[[:punct:]]+(\d+).*","chr\2:\3",x1)
这个returns:
[1] "rs62224609" "rs376238049" "rs62224614" "chr22:17028719" "rs4535153" "chr22:17028719" "kgp3171179" "rs375850426" "chr22:17030620"
我有一个具有许多不同格式的字符串(x,见下文)。它们都是基因组上的位置,但名称不同。这些名字是给我的,属于一个大约 600 万的列表,所以我手动更改并不容易。这是一个子集,但是还有其他像 X1 或 chr 13 也是这个列表的一部分。:
x <- c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
我希望所有字符串看起来像这样:
y <- c("rs62224609", "rs376238049", "rs62224614", "chr22:17028719", "rs4535153", "chr22:17028719", "kgp3171179", "rs375850426", "chr22:17030620")
我试过以下方法,但第一个“.”之后的所有内容都试过了。被删除...这不是我想要的。
x.test = gsub(pattern = "\.\S+$", replacement = "", x = x)
如有任何帮助,我们将不胜感激!
如果您的所有数据都符合您给出的示例:
x = c("rs62224609.16051249.T.C", "rs376238049.16052962.C.T","rs62224614.16053862.C.T","X22.17028719.G.A", "rs4535153", "X22.17028719.G.A", "kgp3171179", "rs375850426.17029070.GCAGTGGC.G" , "chr22.17030620.G.A")
有两种类型的id,一种是SNP id(以rs或kgp开头),另一种是给出染色体位置(以染色体名称开头)。 您可以通过识别您的 SNP id 开始,例如:
x1 = gsub("((rs|kgp)\d+).*","\1",x)
这个returns:
[1] "rs62224609" "rs376238049" "rs62224614" "X22.17028719.G.A" "rs4535153" "X22.17028719.G.A" "kgp3171179" "rs375850426" "chr22.17030620.G.A"
然后格式化染色体位置(我假设你有从 1 到 22、X、Y 和 M 的染色体,但这取决于你的数据):
## We look for [(chr OR X) (1 or 2 digits or X or Y or M) 1 or more punctuation marks (1 or more digits) anything] and
## we transform it into: [chr (the second captured element) : (the third captured element)]
x2 = gsub("(chr|X)(\d{1,2}|X|Y|M)[[:punct:]]+(\d+).*","chr\2:\3",x1)
这个returns:
[1] "rs62224609" "rs376238049" "rs62224614" "chr22:17028719" "rs4535153" "chr22:17028719" "kgp3171179" "rs375850426" "chr22:17030620"