data.table 在 R 中的累积计算(例如累积相关性)
Cumulative Calculations (e.g. cumulative correlation) with data.table in R
在 R 中,我有一个 data.table 有两个测量值 red
和 green
,我想计算它们的累积相关性。
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
如何在一个 data.table 命令中获得以下输出?
...
> DT[1:5, cor(red, green)]
[1] 1 # should go into row 5
> DT[1:6, cor(red, green)]
[1] 0.9970501 # should go into row 6, and so on ...
> DT[1:7, cor(red, green)]
[1] 0.9976889
编辑:
我知道它可以通过循环来解决,但是我的 data.table 有大约 100 万行被分成更小的块,所以循环相当慢,我认为可能还有其他可能性。
创建一个 cumcor
函数怎么样?
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
cumcor <- function(x, y, start = 5, ...) {
c(rep(NA, start - 1), sapply(start:length(x), function(k) cor(x[1:k], y[1:k]), ...))
}
DT[, list(red, green, cumcor = cumcor(red, green))]
red green cumcor
1: 1.0 2 NA
2: 2.0 4 NA
3: 3.0 6 NA
4: 4.0 8 NA
5: 5.0 10 1.0000000
6: 6.5 12 0.9970501
7: 7.6 14 0.9976889
8: 8.7 16 0.9983762
请注意上面的 cumcor
函数在开始时需要更多的 QC(x
和 y
具有相同的长度,start
大于 0,等等。 )
根据我对累积方差的类似问题 的回答,您可以找到累积协方差
library(dplyr) # for cummean
cum_cov <- function(x, y){
n <- 1:length(x)
res <- cumsum(x*y) - cummean(x)*cumsum(y) - cummean(y)*cumsum(x) + n*cummean(x)*cummean(y)
res / (n-1)
}
cum_var <- function(x){# copy-pasted from previous answer
n <- 1:length(x)
(cumsum(x^2) - n*cummean(x)^2) / (n-1)
}
然后累积相关性
cum_cor <- function(x, y) cum_cov(x, y)/sqrt(cum_var(x)*cum_var(y))
DT[, cumcor:=cum_cor(red, green),]
red green id cumcor
1: 1.0 2 1 NaN
2: 2.0 4 2 1.0000000
3: 3.0 6 3 1.0000000
4: 4.0 8 4 1.0000000
5: 5.0 10 5 1.0000000
6: 6.5 12 6 0.9970501
7: 7.6 14 7 0.9976889
8: 8.7 16 8 0.9983762
希望速度够快
x <- rnorm(1e6)
y <- rnorm(1e6)+x
system.time(cum_cor(x, y))
# user system elapsed
# 0.319 0.020 0.339
在 R 中,我有一个 data.table 有两个测量值 red
和 green
,我想计算它们的累积相关性。
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
如何在一个 data.table 命令中获得以下输出?
...
> DT[1:5, cor(red, green)]
[1] 1 # should go into row 5
> DT[1:6, cor(red, green)]
[1] 0.9970501 # should go into row 6, and so on ...
> DT[1:7, cor(red, green)]
[1] 0.9976889
编辑: 我知道它可以通过循环来解决,但是我的 data.table 有大约 100 万行被分成更小的块,所以循环相当慢,我认为可能还有其他可能性。
创建一个 cumcor
函数怎么样?
library(data.table)
DT <- data.table(red = c(1, 2, 3, 4, 5, 6.5, 7.6, 8.7),
green = c(2, 4, 6, 8, 10, 12, 14, 16),
id = 1:8)
cumcor <- function(x, y, start = 5, ...) {
c(rep(NA, start - 1), sapply(start:length(x), function(k) cor(x[1:k], y[1:k]), ...))
}
DT[, list(red, green, cumcor = cumcor(red, green))]
red green cumcor
1: 1.0 2 NA
2: 2.0 4 NA
3: 3.0 6 NA
4: 4.0 8 NA
5: 5.0 10 1.0000000
6: 6.5 12 0.9970501
7: 7.6 14 0.9976889
8: 8.7 16 0.9983762
请注意上面的 cumcor
函数在开始时需要更多的 QC(x
和 y
具有相同的长度,start
大于 0,等等。 )
根据我对累积方差的类似问题
library(dplyr) # for cummean
cum_cov <- function(x, y){
n <- 1:length(x)
res <- cumsum(x*y) - cummean(x)*cumsum(y) - cummean(y)*cumsum(x) + n*cummean(x)*cummean(y)
res / (n-1)
}
cum_var <- function(x){# copy-pasted from previous answer
n <- 1:length(x)
(cumsum(x^2) - n*cummean(x)^2) / (n-1)
}
然后累积相关性
cum_cor <- function(x, y) cum_cov(x, y)/sqrt(cum_var(x)*cum_var(y))
DT[, cumcor:=cum_cor(red, green),]
red green id cumcor
1: 1.0 2 1 NaN
2: 2.0 4 2 1.0000000
3: 3.0 6 3 1.0000000
4: 4.0 8 4 1.0000000
5: 5.0 10 5 1.0000000
6: 6.5 12 6 0.9970501
7: 7.6 14 7 0.9976889
8: 8.7 16 8 0.9983762
希望速度够快
x <- rnorm(1e6)
y <- rnorm(1e6)+x
system.time(cum_cor(x, y))
# user system elapsed
# 0.319 0.020 0.339