我可以在 R 中进行子集化操作列吗?
Can I manipulate a column whilst subsetting in R?
我有一个包含列名的逻辑回归汇总统计数据框
"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"
我想制作一个包含三列的新数据框:
"SNP" "A1"
和 "logOR"
执行此操作的明显方法是创建一个新列 logOR,然后简单地对这 3 列进行子集化。
但是,我想知道是否可以在子集过程中执行log(OR)?
我试过:
raw<-c("SNP","A1","log(OR)")
data.raw<-data[,raw]
R对此印象不深。
提前致谢!
使用 with
是一种很好且简单的方法:
dat.raw <- with(data, data.frame(SNP,A1,log(OR)))
最快最干净的方法 (imo) 是使用基函数 transform
transform(data,logOR = log(OR))[c("SNP","A1","logOR")]
奖金
还有其他方法可以做到这一点,我已经将一些方法相互进行了基准测试,并给出了大小数据集(1000 行或 100000)的结果。
transform
在任何情况下都是最快的。在这种情况下,它是一个基本函数,其行为与 mutate 完全相同。
with
在这里对我来说意义不大 "philosophically",但它是 data.table
之后最短的一行,并且当大小增加时,它的性能几乎与 transform
相当。
小data.frames
library(microbenchmark)
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
"NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)
microbenchmark(
transform = transform(data,logOR = log(OR))[c("SNP","A1","logOR")],
within = within (data,logOR <- log(OR))[c("SNP","A1","logOR")],
with = with (data, data.frame(SNP,A1,logOR=log(OR))), # jkt's solution
mutate = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")], # mutate will behave exactly the same as transform in this case
mutate_p = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR), # same function but with the pipe syntax as formulated by Craig did in the comments
data.table = as.data.table(data)[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table with conversion
data.table2 = data2[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table without conversion, this adds logOR to data2 however
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# transform 202.086 243.4945 281.1694 263.3140 286.6725 6781.367 1000
# within 290.919 353.2080 395.3183 373.5580 397.4480 7039.017 1000
# with 279.948 337.8130 406.2508 361.8790 392.1390 7601.388 1000
# mutate 912.040 1056.2610 1215.2035 1107.4010 1185.4395 8148.541 1000
# mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564 1000
# data.table 938.584 1058.5610 1175.6758 1116.7795 1214.4605 5079.035 1000
# data.table2 819.314 935.5755 1086.9992 993.6175 1084.0425 27160.856 1000
更大data.frames
n <- 100000
...
# Unit: milliseconds
# expr min lq mean median uq max neval
# transform 3.005094 3.320254 3.978661 3.548707 3.815381 14.87116 1000
# within 3.252126 3.618074 4.542457 3.929165 4.275118 99.77254 1000
# with 3.102066 3.413511 4.229389 3.653466 3.937482 89.80346 1000
# mutate 3.803171 4.221853 4.931597 4.474195 4.815546 26.43214 1000
# mutate_p 4.283788 4.754672 5.622917 4.996396 5.366238 92.74237 1000
# data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386 1000
# data.table2 3.997825 4.749894 6.677897 5.456840 6.125562 116.99369 1000
编辑:添加了data.table解决方案
我有一个包含列名的逻辑回归汇总统计数据框
"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"
我想制作一个包含三列的新数据框:
"SNP" "A1"
和 "logOR"
执行此操作的明显方法是创建一个新列 logOR,然后简单地对这 3 列进行子集化。
但是,我想知道是否可以在子集过程中执行log(OR)?
我试过:
raw<-c("SNP","A1","log(OR)")
data.raw<-data[,raw]
R对此印象不深。
提前致谢!
使用 with
是一种很好且简单的方法:
dat.raw <- with(data, data.frame(SNP,A1,log(OR)))
最快最干净的方法 (imo) 是使用基函数 transform
transform(data,logOR = log(OR))[c("SNP","A1","logOR")]
奖金
还有其他方法可以做到这一点,我已经将一些方法相互进行了基准测试,并给出了大小数据集(1000 行或 100000)的结果。
transform
在任何情况下都是最快的。在这种情况下,它是一个基本函数,其行为与 mutate 完全相同。
with
在这里对我来说意义不大 "philosophically",但它是 data.table
之后最短的一行,并且当大小增加时,它的性能几乎与 transform
相当。
小data.frames
library(microbenchmark)
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
"NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)
microbenchmark(
transform = transform(data,logOR = log(OR))[c("SNP","A1","logOR")],
within = within (data,logOR <- log(OR))[c("SNP","A1","logOR")],
with = with (data, data.frame(SNP,A1,logOR=log(OR))), # jkt's solution
mutate = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")], # mutate will behave exactly the same as transform in this case
mutate_p = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR), # same function but with the pipe syntax as formulated by Craig did in the comments
data.table = as.data.table(data)[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table with conversion
data.table2 = data2[,logOR := log(OR)][,.(SNP,A1,logOR)], # data.table without conversion, this adds logOR to data2 however
times = 1000)
# Unit: microseconds
# expr min lq mean median uq max neval
# transform 202.086 243.4945 281.1694 263.3140 286.6725 6781.367 1000
# within 290.919 353.2080 395.3183 373.5580 397.4480 7039.017 1000
# with 279.948 337.8130 406.2508 361.8790 392.1390 7601.388 1000
# mutate 912.040 1056.2610 1215.2035 1107.4010 1185.4395 8148.541 1000
# mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564 1000
# data.table 938.584 1058.5610 1175.6758 1116.7795 1214.4605 5079.035 1000
# data.table2 819.314 935.5755 1086.9992 993.6175 1084.0425 27160.856 1000
更大data.frames
n <- 100000
...
# Unit: milliseconds
# expr min lq mean median uq max neval
# transform 3.005094 3.320254 3.978661 3.548707 3.815381 14.87116 1000
# within 3.252126 3.618074 4.542457 3.929165 4.275118 99.77254 1000
# with 3.102066 3.413511 4.229389 3.653466 3.937482 89.80346 1000
# mutate 3.803171 4.221853 4.931597 4.474195 4.815546 26.43214 1000
# mutate_p 4.283788 4.754672 5.622917 4.996396 5.366238 92.74237 1000
# data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386 1000
# data.table2 3.997825 4.749894 6.677897 5.456840 6.125562 116.99369 1000
编辑:添加了data.table解决方案