仅当特定列中的值具有相同符号时才对重复行进行平均(有条件的)
Average across duplicate rows ONLY if values in certain column are same sign (Conditional)
我有这样的数据:
gene_id logFC logCPM LR PValue FDR
FBgn0000422 -1.875410209 4.429477429 25.16243497 5.27E-07 9.46E-05
FBgn0000422 1.262578335 4.429477429 11.65196417 0.000641348 0.022693702
FBgn0000422 -1.55793362 4.429477429 18.01707407 2.19E-05 0.00235694
FBgn0000565 -1.225082505 6.984450503 22.91546921 1.69E-06 0.000232455
FBgn0000565 -0.989958212 6.984450503 15.45759475 8.44E-05 0.006343374
FBgn0000565 -0.947467121 6.984450503 14.06298678 0.000176789 0.010290503
FBgn0001257 -1.135767061 6.745553159 33.67172953 6.52E-09 2.83E-06
FBgn0001257 -0.806003432 6.745553159 17.36036853 3.09E-05 0.003015214
FBgn0001257 -0.90371115 6.745553159 21.8449115 2.96E-06 0.000523406
FBgn0001291 -0.850144165 5.096971424 42.18504599 8.30E-11 8.08E-08
FBgn0001291 -0.892576562 5.096971424 47.27263627 6.18E-12 2.08E-08
FBgn0001291 -0.629617901 5.096971424 24.12565834 9.02E-07 0.000195886
FBgn0001301 -0.72615833 3.849906562 20.61723199 5.61E-06 0.000634277
FBgn0001301 -0.647614044 3.849906562 16.55276488 4.73E-05 0.004244782
FBgn0001301 -0.700985769 3.849906562 19.62582463 9.42E-06 0.001242629
FBgn0002719 0.39714033 8.153175244 9.467307643 0.002091661 0.045180557
FBgn0002719 -0.566665823 8.153175244 19.77575512 8.71E-06 0.001137708
FBgn0002719 0.509820318 8.153175244 15.96243465 6.46E-05 0.005084696
每个 gene_id 有 3 个重复项,我想对重复项进行平均,我可以使用 plyr 使用以下代码来做到这一点:
AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))
但是,如果 "logFC" 中的三个值在 gene_id 中具有相同的符号(全部为负数或全部正面)。
我不需要保留不符合这个条件的。
在使用plyr之前,在logFC列中过滤掉基因id既不是全负也不是全正的行怎么样?
例如。 data.table:
library(data.table)
AvL_univ_DOD <- data.table(AvL_univ_DOD)
AvL_univ_DOD[,sign:=logFC>0]
#count how many duplicates you have for each gene_id
AvL_univ_DOD[,number_of_duplicates:=.N,by=gene_id]
#count how many positives you have for each gene_id
AvL_univ_DOD[,number_of_pos:=sum(sign),by=gene_id]
# keep only cases where you have all positives or all negatives
AvL_univ_DOD2 <- AvL_univ_DOD[number_of_pos==0|number_of_pos==number_of_duplicates]
# apply plyr
AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD2,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))
我有这样的数据:
gene_id logFC logCPM LR PValue FDR
FBgn0000422 -1.875410209 4.429477429 25.16243497 5.27E-07 9.46E-05
FBgn0000422 1.262578335 4.429477429 11.65196417 0.000641348 0.022693702
FBgn0000422 -1.55793362 4.429477429 18.01707407 2.19E-05 0.00235694
FBgn0000565 -1.225082505 6.984450503 22.91546921 1.69E-06 0.000232455
FBgn0000565 -0.989958212 6.984450503 15.45759475 8.44E-05 0.006343374
FBgn0000565 -0.947467121 6.984450503 14.06298678 0.000176789 0.010290503
FBgn0001257 -1.135767061 6.745553159 33.67172953 6.52E-09 2.83E-06
FBgn0001257 -0.806003432 6.745553159 17.36036853 3.09E-05 0.003015214
FBgn0001257 -0.90371115 6.745553159 21.8449115 2.96E-06 0.000523406
FBgn0001291 -0.850144165 5.096971424 42.18504599 8.30E-11 8.08E-08
FBgn0001291 -0.892576562 5.096971424 47.27263627 6.18E-12 2.08E-08
FBgn0001291 -0.629617901 5.096971424 24.12565834 9.02E-07 0.000195886
FBgn0001301 -0.72615833 3.849906562 20.61723199 5.61E-06 0.000634277
FBgn0001301 -0.647614044 3.849906562 16.55276488 4.73E-05 0.004244782
FBgn0001301 -0.700985769 3.849906562 19.62582463 9.42E-06 0.001242629
FBgn0002719 0.39714033 8.153175244 9.467307643 0.002091661 0.045180557
FBgn0002719 -0.566665823 8.153175244 19.77575512 8.71E-06 0.001137708
FBgn0002719 0.509820318 8.153175244 15.96243465 6.46E-05 0.005084696
每个 gene_id 有 3 个重复项,我想对重复项进行平均,我可以使用 plyr 使用以下代码来做到这一点:
AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))
但是,如果 "logFC" 中的三个值在 gene_id 中具有相同的符号(全部为负数或全部正面)。
我不需要保留不符合这个条件的。
在使用plyr之前,在logFC列中过滤掉基因id既不是全负也不是全正的行怎么样? 例如。 data.table:
library(data.table)
AvL_univ_DOD <- data.table(AvL_univ_DOD)
AvL_univ_DOD[,sign:=logFC>0]
#count how many duplicates you have for each gene_id
AvL_univ_DOD[,number_of_duplicates:=.N,by=gene_id]
#count how many positives you have for each gene_id
AvL_univ_DOD[,number_of_pos:=sum(sign),by=gene_id]
# keep only cases where you have all positives or all negatives
AvL_univ_DOD2 <- AvL_univ_DOD[number_of_pos==0|number_of_pos==number_of_duplicates]
# apply plyr
AvL_univ_DOD_AVG<-ddply(AvL_univ_DOD2,.(gene_id),colwise(mean,c("logFC","logCPM","LR","PValue","FDR")))