如何替换 R 中多个文件的特定字符串值?
How to replace specific string values for several files in R?
我有 50 个文件(每个文件有 100 万到 200 万行),所有文件都有一个 variant_id
列,我想对其进行更改 - 这些文件的布局如下:
variant_id ...
chr1_665098_G_A_b38 ...
chr2_665097_C_T_b38 ...
chr3_665094_A_GG_b38 ...
chr10_23458_TTTCAAG_C_b38 ...
我想将 variant_id
列编辑为:
variant_id
1:665098
2:665097
3:665094
10:23458
我正在尝试通过以下方式同时对我的所有文件进行此更改:
#Read in all files
temp = list.files(pattern="*.txt")
for (i in 1:length(temp)) assign(temp[i], fread(temp[i]))
#Edit variant_id strings for every dataset in environment
my_func <- function(x) {
x <- x %>%
select(variant_id, pval_nominal) %>%
mutate(variant_id = sub("^([^-]*-[^-]*).*", "\1", variant_id))
}
e <- .GlobalEnv
nms <- ls(pattern = ".txt$", envir = e)
for(nm in nms) e[[nm]] <- my_func(e[[nm]])
我被困在 mutate(variant_id = sub("^([^-]*-[^-]*).*", "\1", variant_id))
- 不知道如何最好地使用 sub
来实现我需要的所有更改 chr
被删除,第一个 _
成为 :
然后删除第二个数值后的所有字符。我怎样才能让这个工作?有没有更好的功能可以试试?感谢任何帮助。
输入示例数据:
df <- structure(list(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38",
"chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca")), row.names = c(NA,
-4L), class = c("data.table", "data.frame"))
我们可以使用sub
捕获字符并替换为捕获组的反向引用
library(data.table)
df[, variant_id := sub("chr(\d+)_(\d+)_.*", "\1:\2", variant_id)]
-输出
> df
variant_id
1: 1:665098
2: 2:665097
3: 3:665094
4: 10:23458
如果是多个文件,读取一个list
中的文件,保存在list
中
lst1 <- lapply(temp, function(x) fread(x)[,
variant_id := sub("chr(\d+)_(\d+)_.*", "\1:\2", variant_id)][])
这是您的情况的一个完全可重现的示例。
此处的目标不仅是向您展示正则表达式的另一种可能解决方案,而且还向您展示另一种设置代码的方法。
我注意到您在函数中选择了 2 个特定的列,因此我在代码中添加了该选项。
# reproducible example
df <- data.frame(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38",
"chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca"),
pval_nominal = c(0.005,0.01),
filler = letters[1:2])
folder <- tempdir()
write.csv(df, file.path(folder, "test1.txt"))
write.csv(df, file.path(folder, "test2.txt"))
# library
library(data.table)
# read all files: use full paths! you'll avoid a lot of issues
temp <- list.files(folder, pattern = "*.txt", full.names = TRUE)
# read files with lappply and make a list of them!
l <- lapply(temp, fread, sep = ",")
# select columns and modify variant_id
# if you use data.table you generally want to stick with it and not to mix it with dplyr and viceversa (but that depends on you)
l <- lapply(l, function(d) d[,.(variant_id = sub("^\D+(\d+)_(\d+).*", "\1:\2", variant_id), pval_nominal)])
l
#> [[1]]
#> variant_id pval_nominal
#> 1: 1:665098 0.005
#> 2: 2:665097 0.010
#> 3: 3:665094 0.005
#> 4: 10:23458 0.010
#>
#> [[2]]
#> variant_id pval_nominal
#> 1: 1:665098 0.005
#> 2: 2:665097 0.010
#> 3: 3:665094 0.005
#> 4: 10:23458 0.010
由 reprex package (v2.0.0)
于 2021-11-18 创建
我有 50 个文件(每个文件有 100 万到 200 万行),所有文件都有一个 variant_id
列,我想对其进行更改 - 这些文件的布局如下:
variant_id ...
chr1_665098_G_A_b38 ...
chr2_665097_C_T_b38 ...
chr3_665094_A_GG_b38 ...
chr10_23458_TTTCAAG_C_b38 ...
我想将 variant_id
列编辑为:
variant_id
1:665098
2:665097
3:665094
10:23458
我正在尝试通过以下方式同时对我的所有文件进行此更改:
#Read in all files
temp = list.files(pattern="*.txt")
for (i in 1:length(temp)) assign(temp[i], fread(temp[i]))
#Edit variant_id strings for every dataset in environment
my_func <- function(x) {
x <- x %>%
select(variant_id, pval_nominal) %>%
mutate(variant_id = sub("^([^-]*-[^-]*).*", "\1", variant_id))
}
e <- .GlobalEnv
nms <- ls(pattern = ".txt$", envir = e)
for(nm in nms) e[[nm]] <- my_func(e[[nm]])
我被困在 mutate(variant_id = sub("^([^-]*-[^-]*).*", "\1", variant_id))
- 不知道如何最好地使用 sub
来实现我需要的所有更改 chr
被删除,第一个 _
成为 :
然后删除第二个数值后的所有字符。我怎样才能让这个工作?有没有更好的功能可以试试?感谢任何帮助。
输入示例数据:
df <- structure(list(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38",
"chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca")), row.names = c(NA,
-4L), class = c("data.table", "data.frame"))
我们可以使用sub
捕获字符并替换为捕获组的反向引用
library(data.table)
df[, variant_id := sub("chr(\d+)_(\d+)_.*", "\1:\2", variant_id)]
-输出
> df
variant_id
1: 1:665098
2: 2:665097
3: 3:665094
4: 10:23458
如果是多个文件,读取一个list
中的文件,保存在list
lst1 <- lapply(temp, function(x) fread(x)[,
variant_id := sub("chr(\d+)_(\d+)_.*", "\1:\2", variant_id)][])
这是您的情况的一个完全可重现的示例。
此处的目标不仅是向您展示正则表达式的另一种可能解决方案,而且还向您展示另一种设置代码的方法。
我注意到您在函数中选择了 2 个特定的列,因此我在代码中添加了该选项。
# reproducible example
df <- data.frame(variant_id = c("chr1_665098_G_A_b38", "chr2_665097_C_T_b38",
"chr3_665094_A_GG_b38", "chr10_23458_TTTCAAG_C_b38\xca"),
pval_nominal = c(0.005,0.01),
filler = letters[1:2])
folder <- tempdir()
write.csv(df, file.path(folder, "test1.txt"))
write.csv(df, file.path(folder, "test2.txt"))
# library
library(data.table)
# read all files: use full paths! you'll avoid a lot of issues
temp <- list.files(folder, pattern = "*.txt", full.names = TRUE)
# read files with lappply and make a list of them!
l <- lapply(temp, fread, sep = ",")
# select columns and modify variant_id
# if you use data.table you generally want to stick with it and not to mix it with dplyr and viceversa (but that depends on you)
l <- lapply(l, function(d) d[,.(variant_id = sub("^\D+(\d+)_(\d+).*", "\1:\2", variant_id), pval_nominal)])
l
#> [[1]]
#> variant_id pval_nominal
#> 1: 1:665098 0.005
#> 2: 2:665097 0.010
#> 3: 3:665094 0.005
#> 4: 10:23458 0.010
#>
#> [[2]]
#> variant_id pval_nominal
#> 1: 1:665098 0.005
#> 2: 2:665097 0.010
#> 3: 3:665094 0.005
#> 4: 10:23458 0.010
由 reprex package (v2.0.0)
于 2021-11-18 创建