我可以在 R 中进行子集化操作列吗?

Can I manipulate a column whilst subsetting in R?

我有一个包含列名的逻辑回归汇总统计数据框

"CHR" "SNP" "BP" "A1" "TEST" "NMISS" "OR" "STAT" "P"

我想制作一个包含三列的新数据框:

"SNP" "A1""logOR"

执行此操作的明显方法是创建一个新列 logOR,然后简单地对这 3 列进行子集化。

但是,我想知道是否可以在子集过程中执行log(OR)?

我试过:

raw<-c("SNP","A1","log(OR)")

data.raw<-data[,raw]

R对此印象不深。

提前致谢!

使用 with 是一种很好且简单的方法:

dat.raw <- with(data, data.frame(SNP,A1,log(OR)))

最快最干净的方法 (imo) 是使用基函数 transform

transform(data,logOR =  log(OR))[c("SNP","A1","logOR")]

奖金

还有其他方法可以做到这一点,我已经将一些方法相互进行了基准测试,并给出了大小数据集(1000 行或 100000)的结果。

transform 在任何情况下都是最快的。在这种情况下,它是一个基本函数,其行为与 mutate 完全相同。

with 在这里对我来说意义不大 "philosophically",但它是 data.table 之后最短的一行,并且当大小增加时,它的性能几乎与 transform 相当。

小data.frames

library(microbenchmark)    
n <- 1000
data <- data.frame("CHR"=sample(1:n),"SNP"=sample(1:n),"BP"=sample(1:n),"A1"=sample(1:n),"TEST"=sample(1:n),
                   "NMISS"=sample(1:n),"OR"=sample(1:n),"STAT"=sample(1:n),"P"=sample(1:n))
data2 <- as.data.table(data)

microbenchmark(
  transform   = transform(data,logOR =  log(OR))[c("SNP","A1","logOR")],
  within      = within   (data,logOR <- log(OR))[c("SNP","A1","logOR")],
  with        = with     (data, data.frame(SNP,A1,logOR=log(OR))),               # jkt's solution
  mutate      = mutate(data,logOR = log(OR))[c("SNP","A1","logOR")],             # mutate will behave exactly the same as transform in this case
  mutate_p    = data %>% mutate(logOR = log(OR)) %>% select(SNP, A1, logOR),     # same function but with the pipe syntax as formulated by Craig did in the comments
  data.table  = as.data.table(data)[,logOR :=  log(OR)][,.(SNP,A1,logOR)],       # data.table with conversion
  data.table2 = data2[,logOR :=  log(OR)][,.(SNP,A1,logOR)],                     # data.table without conversion, this adds logOR to data2 however
  times = 1000)

# Unit: microseconds
#       expr      min        lq      mean    median        uq       max neval
#   transform  202.086  243.4945  281.1694  263.3140  286.6725  6781.367  1000
#      within  290.919  353.2080  395.3183  373.5580  397.4480  7039.017  1000
#        with  279.948  337.8130  406.2508  361.8790  392.1390  7601.388  1000
#      mutate  912.040 1056.2610 1215.2035 1107.4010 1185.4395  8148.541  1000
#    mutate_p 1283.297 1516.7040 1741.8224 1584.3020 1710.2950 33254.564  1000
#  data.table  938.584 1058.5610 1175.6758 1116.7795 1214.4605  5079.035  1000
# data.table2  819.314  935.5755 1086.9992  993.6175 1084.0425 27160.856  1000

更大data.frames

n <- 100000
...
# Unit: milliseconds
#        expr      min       lq     mean   median        uq       max neval
#   transform 3.005094 3.320254 3.978661 3.548707  3.815381  14.87116  1000
#      within 3.252126 3.618074 4.542457 3.929165  4.275118  99.77254  1000
#        with 3.102066 3.413511 4.229389 3.653466  3.937482  89.80346  1000
#      mutate 3.803171 4.221853 4.931597 4.474195  4.815546  26.43214  1000
#    mutate_p 4.283788 4.754672 5.622917 4.996396  5.366238  92.74237  1000
#  data.table 4.831649 6.336141 9.911754 8.212245 12.283330 102.13386  1000
# data.table2 3.997825 4.749894 6.677897 5.456840  6.125562 116.99369  1000

编辑:添加了data.table解决方案