将线性模型应用于由多个范围定义的数据子集
Apply linear models to subsets of data defined by several ranges
我在 R 中有一个特定的数据整理任务,我需要将预测变量划分为 重叠“邻域”(或范围)并拟合线性模型(simple/bivariate) 到每个“邻域”以获得与 that“邻域”中的 middle 预测变量关联的拟合值。我按如下方式处理任务:
- 我为每个“邻居”创建虚拟变量(每列一个)
- 我将
lm()
函数应用于数据子集,即虚拟变量为 1 的行,不包括等于 0 的行
- 我在每个“邻域”
中提取与 中间 预测变量关联的拟合值
- 我最终得到一个拟合值向量,其长度等于“邻域”的数量
我的方法适用于重叠邻域数量较少的情况。当重叠邻域的数量很大时,它相当冗长。这是一个可重现的示例(使用我创建的模拟数据,在本例中,#of neighborhoods = 7):
# Mock data
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
# Create dummies
new_data <- data %>%
mutate(neighborhood1 = ifelse(between(x, 0, 2.5), 1, 0),
neighborhood2 = ifelse(between(x, 0.5, 3.5), 1, 0),
neighborhood3 = ifelse(between(x, 1.5, 4.5), 1, 0),
neighborhood4 = ifelse(between(x, 2.5, 5.5), 1, 0),
neighborhood5 = ifelse(between(x, 3.5, 6.5), 1, 0),
neighborhood6 = ifelse(between(x, 4.5, 7.5), 1, 0),
neighborhood7 = ifelse(between(x, 5.5, 8.5), 1, 0))
# Run regression model on subsets of data
# Obtain fitted value Y at the middle X
# (in this example there are three obs per neighborhood and so we want the middle fitted value # 2)
Y_hat_1 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood1 == 1))[["fitted.values"]][[2]]
Y_hat_2 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood2 == 1))[["fitted.values"]][[2]]
Y_hat_3 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood3 == 1))[["fitted.values"]][[2]]
Y_hat_4 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood4 == 1))[["fitted.values"]][[2]]
Y_hat_5 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood5 == 1))[["fitted.values"]][[2]]
Y_hat_6 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood6 == 1))[["fitted.values"]][[2]]
Y_hat_7 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood7 == 1))[["fitted.values"]][[2]]
我一直想知道是否有更有效的方法来处理此任务(可能使用嵌套数据框或循环或任何 dplyr
或 data.table
函数可以使此任务更容易)。任何建议对我都非常有帮助,非常感谢!!并且,对于这个相当冗长的问题,我很抱歉,因为我正在尝试更具体一些。非常感谢!
您希望 x 等于断点的观察值落入两个相邻类别的事实使这变得更加复杂,但这应该有效并且非常快。一般来说,如果您要跨数据重复执行任务,则数据采用长格式而不是分栏格式会更容易。
library(data.table)
data <- data.table(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
setDT(data)
lookup <- data.table(start=c(0, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
new <- data[lookup, ,on=c("x>=start","x<=end")] #not just an interval join, but also expands the data
new[,x.1:=NULL] #drop the interval join column since it's not needed
new[, fitted(lm(y~x,data=.SD))[2],by="neighborhood"]
Henrik 的解决方案 .EACHI
:
library(data.table)
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
setDT(data)
lookup <- data.table(start=c(0, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
data[lookup, list(.GRP,fitted(lm(y~x.x,data=.SD))[2]),on=c("x>=start","x<=end"),by=.EACHI]
我在 R 中有一个特定的数据整理任务,我需要将预测变量划分为 重叠“邻域”(或范围)并拟合线性模型(simple/bivariate) 到每个“邻域”以获得与 that“邻域”中的 middle 预测变量关联的拟合值。我按如下方式处理任务:
- 我为每个“邻居”创建虚拟变量(每列一个)
- 我将
lm()
函数应用于数据子集,即虚拟变量为 1 的行,不包括等于 0 的行 - 我在每个“邻域” 中提取与 中间 预测变量关联的拟合值
- 我最终得到一个拟合值向量,其长度等于“邻域”的数量
我的方法适用于重叠邻域数量较少的情况。当重叠邻域的数量很大时,它相当冗长。这是一个可重现的示例(使用我创建的模拟数据,在本例中,#of neighborhoods = 7):
# Mock data
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
# Create dummies
new_data <- data %>%
mutate(neighborhood1 = ifelse(between(x, 0, 2.5), 1, 0),
neighborhood2 = ifelse(between(x, 0.5, 3.5), 1, 0),
neighborhood3 = ifelse(between(x, 1.5, 4.5), 1, 0),
neighborhood4 = ifelse(between(x, 2.5, 5.5), 1, 0),
neighborhood5 = ifelse(between(x, 3.5, 6.5), 1, 0),
neighborhood6 = ifelse(between(x, 4.5, 7.5), 1, 0),
neighborhood7 = ifelse(between(x, 5.5, 8.5), 1, 0))
# Run regression model on subsets of data
# Obtain fitted value Y at the middle X
# (in this example there are three obs per neighborhood and so we want the middle fitted value # 2)
Y_hat_1 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood1 == 1))[["fitted.values"]][[2]]
Y_hat_2 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood2 == 1))[["fitted.values"]][[2]]
Y_hat_3 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood3 == 1))[["fitted.values"]][[2]]
Y_hat_4 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood4 == 1))[["fitted.values"]][[2]]
Y_hat_5 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood5 == 1))[["fitted.values"]][[2]]
Y_hat_6 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood6 == 1))[["fitted.values"]][[2]]
Y_hat_7 <- lm(y ~ x,
data = filter(.data = new_data,
neighborhood7 == 1))[["fitted.values"]][[2]]
我一直想知道是否有更有效的方法来处理此任务(可能使用嵌套数据框或循环或任何 dplyr
或 data.table
函数可以使此任务更容易)。任何建议对我都非常有帮助,非常感谢!!并且,对于这个相当冗长的问题,我很抱歉,因为我正在尝试更具体一些。非常感谢!
您希望 x 等于断点的观察值落入两个相邻类别的事实使这变得更加复杂,但这应该有效并且非常快。一般来说,如果您要跨数据重复执行任务,则数据采用长格式而不是分栏格式会更容易。
library(data.table)
data <- data.table(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
setDT(data)
lookup <- data.table(start=c(0, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
new <- data[lookup, ,on=c("x>=start","x<=end")] #not just an interval join, but also expands the data
new[,x.1:=NULL] #drop the interval join column since it's not needed
new[, fitted(lm(y~x,data=.SD))[2],by="neighborhood"]
Henrik 的解决方案 .EACHI
:
library(data.table)
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
setDT(data)
lookup <- data.table(start=c(0, 0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
data[lookup, list(.GRP,fitted(lm(y~x.x,data=.SD))[2]),on=c("x>=start","x<=end"),by=.EACHI]