使用列作为参数在 data.table 中按行应用函数
Apply function by row in data.table using columns as arguments
我正在尝试使用 data.table 以列作为参数按行应用函数。我目前正在按照建议使用 here
但是,我的 data.table 是 2700 万行 7 列,所以当我 运行 它递归地处理许多输入文件时,应用操作需要很长时间,作业占用了所有可用的 RAM (32Gb)。很可能我复制了 data.table 多次,尽管我不确定。
考虑到每个输入文件大约有 3000 万行 x 7 列,并且有 30 个输入文件要处理,我想帮助提高此代码的内存效率。我相当确定使用 apply 的行会减慢整个代码的速度,因此内存效率更高或使用矢量化函数的替代方案可能是更好的选择。
我在尝试使用 data.table 编写一个将 4 列作为参数并逐行操作的矢量化函数时遇到了很多麻烦。我的示例代码中的应用解决方案有效,但速度很慢。我尝试过的一种选择是:
cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]
但 datU data.table 输出的前 6 行如下所示:
Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC
这是我使用 apply 的有效代码(它生成了上面的 high1 列),但速度太慢且占用大量内存:
#Get input files from top directory, searching through all subdirectories
file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)
#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values
savelist=NULL
for (i in file_list) {
datU <- fread(i)
name=dirname(i)
#Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
max1 <- maxn(1)
max2 <- maxn(2)
colNum=c(4,5,6,7)
datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
datU[,difference:=high1-high2,by=1:nrow(datU)]
datU[,folder:=name]
savelist[[i]]<-datU
}
#Create loop to iterate over folders and output data
sigout=NULL
for (i in savelist) {
# Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]
df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)
#Output values
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}
我喜欢一些关于应用替代函数的示例,这些函数将为我提供第 4、5、6、7 列的每一行的最高值和第二高值,这些值可以通过索引或列名来标识。
谢谢!
你可以这样做:
DF <- read.table(text = " Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC", header = TRUE)
library(data.table)
setDT(DF)
maxTwo <- function(x) {
ind <- length(x) - (1:0) #the index is equal for all rows,
#so it could be made a function parameter
#for better efficiency
as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}
DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)),
by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]
# Cycle Tab ID colA colB colC colG high1 high1a max1 max2 diffMax
#1: 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057
#2: 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795
#3: 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157
#4: 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049
#5: 5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156
#6: 6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911
但是,您仍然会遍历行,这意味着 nrow
调用该函数。您可以尝试 Rcpp 在编译代码中进行循环。
取决于您要如何处理重复项,例如如果您没有它们或想将它们组合在一起,您可以这样做:
d = data.table(a = 1:4, b = 4:1, c = c(2,1,1,4))
# a b c
#1: 1 4 2
#2: 2 3 1
#3: 3 2 1
#4: 4 1 4
high1 = do.call(pmax, d)
#[1] 4 3 3 4
high2 = do.call(pmax, d * (d != high1))
#[1] 2 2 2 1
否则,您可以在您的精度范围之外添加一些抖动(我选择了大量以使其可见):
d.jitter = d + runif(nrow(d) * ncol(d), 0, 1e-4)
# a b c
#1: 1.000044 4.000090 2.000008
#2: 2.000076 3.000029 1.000034
#3: 3.000007 2.000029 1.000036
#4: 4.000001 1.000069 4.000041
high1.j = do.call(pmax, d.jitter)
high2 = do.call(pmax, d * (d.jitter != high1.j))
#[1] 2 2 2 4
相关 .SD
和 .SDcols
语义的翻译留作 reader 的简单练习。
我正在尝试使用 data.table 以列作为参数按行应用函数。我目前正在按照建议使用 here
但是,我的 data.table 是 2700 万行 7 列,所以当我 运行 它递归地处理许多输入文件时,应用操作需要很长时间,作业占用了所有可用的 RAM (32Gb)。很可能我复制了 data.table 多次,尽管我不确定。
考虑到每个输入文件大约有 3000 万行 x 7 列,并且有 30 个输入文件要处理,我想帮助提高此代码的内存效率。我相当确定使用 apply 的行会减慢整个代码的速度,因此内存效率更高或使用矢量化函数的替代方案可能是更好的选择。
我在尝试使用 data.table 编写一个将 4 列作为参数并逐行操作的矢量化函数时遇到了很多麻烦。我的示例代码中的应用解决方案有效,但速度很慢。我尝试过的一种选择是:
cols=c("C","T","A","G")
func1<-function(x)x[max1(x)]
datU[,high1a:=func1(cols),by=1:nrow(datU)]
但 datU data.table 输出的前 6 行如下所示:
Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC
这是我使用 apply 的有效代码(它生成了上面的 high1 列),但速度太慢且占用大量内存:
#Get input files from top directory, searching through all subdirectories
file_list <- list.files(pattern = "*.test.txt", recursive=TRUE, full.names=TRUE)
#Make a loop to recursively read files from subdirectories, determine highest and second highest values in specified columns, create new column with those values
savelist=NULL
for (i in file_list) {
datU <- fread(i)
name=dirname(i)
#Compute highest and second highest for each row (cols 4,5,6,7) and the difference between highest and second highest values
maxn <- function(n) function(x) order(x, decreasing = TRUE)[n]
max1 <- maxn(1)
max2 <- maxn(2)
colNum=c(4,5,6,7)
datU[,high1:=apply(datU[,colNum,with=FALSE],1,function(x)x[max1(x)])])
datU[,high2:=apply(datU[,colNum,with=FALSE],1,function(x)x[max2(x)])]
datU[,difference:=high1-high2,by=1:nrow(datU)]
datU[,folder:=name]
savelist[[i]]<-datU
}
#Create loop to iterate over folders and output data
sigout=NULL
for (i in savelist) {
# Do some stuff to manipulate data frames, then merge them for output
setkey(i,Cycle,folder)
Sums1<-i[,sum(colA,colB,colC,colD),by=list(Cycle,folder)]
MeanTot<-Sums[,round(mean(V1),3),by=list(Cycle,folder)]
MeanTotsd<-Sums[,round(sd(V1),3),by=list(Cycle,folder)]
Meandiff<-i[,list(meandiff=mean(difference)),by=list(Cycle,folder)]
Meandiffsd<-i[,list(meandiff=sd(difference)),by=list(Cycle,folder)]
df1out<-merge(MeanTot,MeanTotsd,by=list(Cycle,folder))
df2out<-merge(Meandiff,Meandiffsd,by=list(Cycle,folder))
sigout<-merge(df1out,df2out)
#Output values
write.table(sigout,"Sigout.txt",append=TRUE,quote=FALSE,sep=",",row.names=FALSE,col.names=TRUE)
}
我喜欢一些关于应用替代函数的示例,这些函数将为我提供第 4、5、6、7 列的每一行的最高值和第二高值,这些值可以通过索引或列名来标识。
谢谢!
你可以这样做:
DF <- read.table(text = " Cycle Tab ID colA colB colC colG high1 high1a
1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC
2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC
3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC
4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC
5 0 45513 -89.719 -504.643 1298.476 131.32 1298.476 colC
6 0 45513 -250.11 -30.862 1877.049 -184.772 1877.049 colC", header = TRUE)
library(data.table)
setDT(DF)
maxTwo <- function(x) {
ind <- length(x) - (1:0) #the index is equal for all rows,
#so it could be made a function parameter
#for better efficiency
as.list(sort.int(x, partial = ind)[ind]) #partial sorting
}
DF[, paste0("max", 1:2) := maxTwo(unlist(.SD)),
by = seq_len(nrow(DF)), .SDcols = 4:7]
DF[, diffMax := max2 - max1]
# Cycle Tab ID colA colB colC colG high1 high1a max1 max2 diffMax
#1: 1 0 45513 -233.781 -84.087 -3.141 3740.916 3740.916 colC -3.141 3740.916 3744.057
#2: 2 0 45513 -103.561 -347.382 2900.866 357.071 2900.866 colC 357.071 2900.866 2543.795
#3: 3 0 45513 153.383 4036.636 353.479 -42.736 4036.636 colC 353.479 4036.636 3683.157
#4: 4 0 45513 -147.941 28.994 4354.994 384.945 4354.994 colC 384.945 4354.994 3970.049
#5: 5 0 45513 -89.719 -504.643 1298.476 131.320 1298.476 colC 131.320 1298.476 1167.156
#6: 6 0 45513 -250.110 -30.862 1877.049 -184.772 1877.049 colC -30.862 1877.049 1907.911
但是,您仍然会遍历行,这意味着 nrow
调用该函数。您可以尝试 Rcpp 在编译代码中进行循环。
取决于您要如何处理重复项,例如如果您没有它们或想将它们组合在一起,您可以这样做:
d = data.table(a = 1:4, b = 4:1, c = c(2,1,1,4))
# a b c
#1: 1 4 2
#2: 2 3 1
#3: 3 2 1
#4: 4 1 4
high1 = do.call(pmax, d)
#[1] 4 3 3 4
high2 = do.call(pmax, d * (d != high1))
#[1] 2 2 2 1
否则,您可以在您的精度范围之外添加一些抖动(我选择了大量以使其可见):
d.jitter = d + runif(nrow(d) * ncol(d), 0, 1e-4)
# a b c
#1: 1.000044 4.000090 2.000008
#2: 2.000076 3.000029 1.000034
#3: 3.000007 2.000029 1.000036
#4: 4.000001 1.000069 4.000041
high1.j = do.call(pmax, d.jitter)
high2 = do.call(pmax, d * (d.jitter != high1.j))
#[1] 2 2 2 4
相关 .SD
和 .SDcols
语义的翻译留作 reader 的简单练习。