在第 4 列中分隔两个字母字符串
Separate two letter string in 4th column
我有一个数据框 - df - 带有基因组数据。最后的 col 有两个字母的变体。
id crm pos allele
160841 rs2237282 11 1273948 AG
160842 rs6417577 11 1276796 AC
165677 rs2151342 11 1199626 GT
165678 rs2749240 11 1258025 AG
我想将最后一栏分成两栏,每栏一个字母
id crm pos allele allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
我已经尝试在 RStudio 1.1.419、R 3.4.3 中使用 dplyr 和 tidyr,但没有成功:
- 分离(df,等位基因,进入=c("allele","allele2"))
- 分离(df,等位基因,进入=c("allele","allele2"),sep="")
- separate(df, allele, into=c("allele", "allele2"), sep="\c")
- separate(df, allele, into=c("allele", "allele2"), sep=".")
- separate(df, allele, into=c("allele", "allele2"), sep=.)
- separate(df, allele, into=c("allele", "allele2"), sep=\c)
我怎样才能得到想要的拆分?
使用基础 r:
HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
id crm pos allele A1 A2
160841 rs2237282 11 1273948 AG A G
160842 rs6417577 11 1276796 AC A C
165677 rs2151342 11 1199626 GT G T
165678 rs2749240 11 1258025 AG A G
library(tidyverse)
df %>%
mutate(allele2 = substr(allele, 2, 2)) %>%
mutate(allele = substr(allele, 1, 1))
在 separate
中,sep
参数可以是数字,表示要拆分的字符位置,因此:
separate(df, allele, into = c("allele1", "allele2"), sep = 1)
给予:
id crm pos allele1 allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
除了separate
之外,extract
是tidyr包中的另一个选项。这可以通过在 regex
参数中指定捕获组来实现。
library(tidyr)
df %>%
extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
# id crm pos allele1 allele2
# 160841 rs2237282 11 1273948 A G
# 160842 rs6417577 11 1276796 A C
# 165677 rs2151342 11 1199626 G T
# 165678 rs2749240 11 1258025 A G
我有一个数据框 - df - 带有基因组数据。最后的 col 有两个字母的变体。
id crm pos allele
160841 rs2237282 11 1273948 AG
160842 rs6417577 11 1276796 AC
165677 rs2151342 11 1199626 GT
165678 rs2749240 11 1258025 AG
我想将最后一栏分成两栏,每栏一个字母
id crm pos allele allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
我已经尝试在 RStudio 1.1.419、R 3.4.3 中使用 dplyr 和 tidyr,但没有成功:
- 分离(df,等位基因,进入=c("allele","allele2"))
- 分离(df,等位基因,进入=c("allele","allele2"),sep="")
- separate(df, allele, into=c("allele", "allele2"), sep="\c")
- separate(df, allele, into=c("allele", "allele2"), sep=".")
- separate(df, allele, into=c("allele", "allele2"), sep=.)
- separate(df, allele, into=c("allele", "allele2"), sep=\c)
我怎样才能得到想要的拆分?
使用基础 r:
HERE=data.frame(A1=character(),A2=character())
cbind(data,strcapture("(.)(.)",data$allele,HERE))
id crm pos allele A1 A2
160841 rs2237282 11 1273948 AG A G
160842 rs6417577 11 1276796 AC A C
165677 rs2151342 11 1199626 GT G T
165678 rs2749240 11 1258025 AG A G
library(tidyverse)
df %>%
mutate(allele2 = substr(allele, 2, 2)) %>%
mutate(allele = substr(allele, 1, 1))
在 separate
中,sep
参数可以是数字,表示要拆分的字符位置,因此:
separate(df, allele, into = c("allele1", "allele2"), sep = 1)
给予:
id crm pos allele1 allele2
160841 rs2237282 11 1273948 A G
160842 rs6417577 11 1276796 A C
165677 rs2151342 11 1199626 G T
165678 rs2749240 11 1258025 A G
除了separate
之外,extract
是tidyr包中的另一个选项。这可以通过在 regex
参数中指定捕获组来实现。
library(tidyr)
df %>%
extract(allele, into = c("allele1", "allele2"), regex = "([ATCG])([ATCG])")
# id crm pos allele1 allele2
# 160841 rs2237282 11 1273948 A G
# 160842 rs6417577 11 1276796 A C
# 165677 rs2151342 11 1199626 G T
# 165678 rs2749240 11 1258025 A G