如何使用 data.table 查找所有列组合的线性回归
How to find linear regression on all combinations of columns using data.table
我正在尝试为每个可能的变量组合找到 iris
数据集组之间的线性回归。由于这是一个玩具示例,因此很容易分别对每个变量集进行线性回归并连接结果。但是,对于具有大量列的 data.table
,很难找到所有组之间的线性回归。
library(data.table)
dt = copy(iris)
setDT(dt)[, .(model1 = lm(Sepal.Length ~ Petal.Width, .SD)$coeff[2], model2 = lm(Petal.Width ~ Sepal.Length, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.9301727 0.08314444
2: versicolor 1.4263647 0.20935719
3: virginica 0.6508306 0.12141646
setDT(dt)[, .(model1 = lm(Sepal.Width ~ Petal.Length, .SD)$coeff[2], model2 = lm(Petal.Length ~ Sepal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.3878739 0.0814112
2: versicolor 0.3743068 0.8393782
3: virginica 0.2343482 0.6863153
setDT(dt)[, .(model1 = lm(Sepal.Width ~ Sepal.Length, .SD)$coeff[2], model2 = lm(Sepal.Length ~ Sepal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.7985283 0.6904897
2: versicolor 0.3197193 0.8650777
3: virginica 0.2318905 0.9015345
setDT(dt)[, .(model1 = lm(Petal.Width ~ Petal.Length, .SD)$coeff[2], model2 = lm(Petal.Length ~ Petal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.2012451 0.5464903
2: versicolor 0.3310536 1.8693247
3: virginica 0.1602970 0.6472593
与其分别对每组变量进行线性回归,不如使用 data.table 轻松完成?我想要的输出如下 -
Species Variable1 Variable2 model1 model2
setosa Sepal.Length Petal.Width 0.9301727 0.08314444
versicolor Sepal.Length Petal.Width 1.4263647 0.20935719
virginica Sepal.Length Petal.Width 0.6508306 0.12141646
setosa Sepal.Width Petal.Length 0.3878739 0.0814112
versicolor Sepal.Width Petal.Length 0.3743068 0.8393782
virginica Sepal.Width Petal.Length 0.2343482 0.6863153
setosa Sepal.Width Sepal.Length 0.7985283 0.6904897
versicolor Sepal.Width Sepal.Length 0.3197193 0.8650777
virginica Sepal.Width Sepal.Length 0.2318905 0.9015345
setosa Petal.Width Petal.Length 0.2012451 0.5464903
versicolor Petal.Width Petal.Length 0.3310536 1.8693247
virginica Petal.Width Petal.Length 0.1602970 0.6472593
我们可以使用 combn
创建一个 list
的公式,在 'iris' 的列名上 reformulate
除了 'Species',然后 , 循环在数据中按 'Species' 分组的 list
上,应用 lm
并提取 coeff
icients
library(data.table)
lst1 <- combn(names(iris)[-5], 2, FUN =
function(x) reformulate(x[1], x[2]), simplify = FALSE)
dt = copy(iris)
out <- setDT(dt)[, lapply(lst1, function(fmla)
lm(fmla, .SD)$coeff),
by = Species]
setnames(out, -1, sapply(lst1, deparse))
-输出
out
Species Sepal.Width ~ Sepal.Length Petal.Length ~ Sepal.Length Petal.Width ~ Sepal.Length Petal.Length ~ Sepal.Width Petal.Width ~ Sepal.Width
1: setosa -0.5694327 0.8030518 -0.17022108 1.1829224 0.02417907
2: setosa 0.7985283 0.1316317 0.08314444 0.0814112 0.06470856
3: versicolor 0.8721460 0.1851155 0.08325571 1.9349223 0.16690570
4: versicolor 0.3197193 0.6864698 0.20935719 0.8393782 0.41844560
5: virginica 1.4463054 0.6104680 1.22610837 3.5108983 0.66405950
6: virginica 0.2318905 0.7500808 0.12141646 0.6863153 0.45794906
Petal.Width ~ Petal.Length
1: -0.04822033
2: 0.20124509
3: -0.08428835
4: 0.33105360
5: 1.13603130
我正在尝试为每个可能的变量组合找到 iris
数据集组之间的线性回归。由于这是一个玩具示例,因此很容易分别对每个变量集进行线性回归并连接结果。但是,对于具有大量列的 data.table
,很难找到所有组之间的线性回归。
library(data.table)
dt = copy(iris)
setDT(dt)[, .(model1 = lm(Sepal.Length ~ Petal.Width, .SD)$coeff[2], model2 = lm(Petal.Width ~ Sepal.Length, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.9301727 0.08314444
2: versicolor 1.4263647 0.20935719
3: virginica 0.6508306 0.12141646
setDT(dt)[, .(model1 = lm(Sepal.Width ~ Petal.Length, .SD)$coeff[2], model2 = lm(Petal.Length ~ Sepal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.3878739 0.0814112
2: versicolor 0.3743068 0.8393782
3: virginica 0.2343482 0.6863153
setDT(dt)[, .(model1 = lm(Sepal.Width ~ Sepal.Length, .SD)$coeff[2], model2 = lm(Sepal.Length ~ Sepal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.7985283 0.6904897
2: versicolor 0.3197193 0.8650777
3: virginica 0.2318905 0.9015345
setDT(dt)[, .(model1 = lm(Petal.Width ~ Petal.Length, .SD)$coeff[2], model2 = lm(Petal.Length ~ Petal.Width, .SD)$coeff[2]), by = Species]
Species model1 model2
1: setosa 0.2012451 0.5464903
2: versicolor 0.3310536 1.8693247
3: virginica 0.1602970 0.6472593
与其分别对每组变量进行线性回归,不如使用 data.table 轻松完成?我想要的输出如下 -
Species Variable1 Variable2 model1 model2
setosa Sepal.Length Petal.Width 0.9301727 0.08314444
versicolor Sepal.Length Petal.Width 1.4263647 0.20935719
virginica Sepal.Length Petal.Width 0.6508306 0.12141646
setosa Sepal.Width Petal.Length 0.3878739 0.0814112
versicolor Sepal.Width Petal.Length 0.3743068 0.8393782
virginica Sepal.Width Petal.Length 0.2343482 0.6863153
setosa Sepal.Width Sepal.Length 0.7985283 0.6904897
versicolor Sepal.Width Sepal.Length 0.3197193 0.8650777
virginica Sepal.Width Sepal.Length 0.2318905 0.9015345
setosa Petal.Width Petal.Length 0.2012451 0.5464903
versicolor Petal.Width Petal.Length 0.3310536 1.8693247
virginica Petal.Width Petal.Length 0.1602970 0.6472593
我们可以使用 combn
创建一个 list
的公式,在 'iris' 的列名上 reformulate
除了 'Species',然后 , 循环在数据中按 'Species' 分组的 list
上,应用 lm
并提取 coeff
icients
library(data.table)
lst1 <- combn(names(iris)[-5], 2, FUN =
function(x) reformulate(x[1], x[2]), simplify = FALSE)
dt = copy(iris)
out <- setDT(dt)[, lapply(lst1, function(fmla)
lm(fmla, .SD)$coeff),
by = Species]
setnames(out, -1, sapply(lst1, deparse))
-输出
out
Species Sepal.Width ~ Sepal.Length Petal.Length ~ Sepal.Length Petal.Width ~ Sepal.Length Petal.Length ~ Sepal.Width Petal.Width ~ Sepal.Width
1: setosa -0.5694327 0.8030518 -0.17022108 1.1829224 0.02417907
2: setosa 0.7985283 0.1316317 0.08314444 0.0814112 0.06470856
3: versicolor 0.8721460 0.1851155 0.08325571 1.9349223 0.16690570
4: versicolor 0.3197193 0.6864698 0.20935719 0.8393782 0.41844560
5: virginica 1.4463054 0.6104680 1.22610837 3.5108983 0.66405950
6: virginica 0.2318905 0.7500808 0.12141646 0.6863153 0.45794906
Petal.Width ~ Petal.Length
1: -0.04822033
2: 0.20124509
3: -0.08428835
4: 0.33105360
5: 1.13603130