将线性模型应用于由多个范围定义的数据子集

Apply linear models to subsets of data defined by several ranges

我在 R 中有一个特定的数据整理任务,我需要将预测变量划分为 重叠“邻域”(或范围)并拟合线性模型(simple/bivariate) 到每个“邻域”以获得与 that“邻域”中的 middle 预测变量关联的拟合值。我按如下方式处理任务:

  1. 我为每个“邻居”创建虚拟变量(每列一个)
  2. 我将 lm() 函数应用于数据子集,即虚拟变量为 1 的行,不包括等于 0 的行
  3. 我在每个“邻域”
  4. 中提取与 中间 预测变量关联的拟合值
  5. 我最终得到一个拟合值向量,其长度等于“邻域”的数量

我的方法适用于重叠邻域数量较少的情况。当重叠邻域的数量很大时,它相当冗长。这是一个可重现的示例(使用我创建的模拟数据,在本例中,#of neighborhoods = 7):

# Mock data
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
# Create dummies
new_data <- data %>% 
  mutate(neighborhood1 = ifelse(between(x, 0, 2.5), 1, 0),
         neighborhood2 = ifelse(between(x, 0.5, 3.5), 1, 0),
         neighborhood3 = ifelse(between(x, 1.5, 4.5), 1, 0),
         neighborhood4 = ifelse(between(x, 2.5, 5.5), 1, 0),
         neighborhood5 = ifelse(between(x, 3.5, 6.5), 1, 0),
         neighborhood6 = ifelse(between(x, 4.5, 7.5), 1, 0),
         neighborhood7 = ifelse(between(x, 5.5, 8.5), 1, 0))
# Run regression model on subsets of data 
# Obtain fitted value Y at the middle X 
# (in this example there are three obs per neighborhood and so we want the middle fitted value # 2)
Y_hat_1 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood1 == 1))[["fitted.values"]][[2]]
Y_hat_2 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood2 == 1))[["fitted.values"]][[2]]
Y_hat_3 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood3 == 1))[["fitted.values"]][[2]]
Y_hat_4 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood4 == 1))[["fitted.values"]][[2]]
Y_hat_5 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood5 == 1))[["fitted.values"]][[2]]
Y_hat_6 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood6 == 1))[["fitted.values"]][[2]]
Y_hat_7 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood7 == 1))[["fitted.values"]][[2]]

我一直想知道是否有更有效的方法来处理此任务(可能使用嵌套数据框或循环或任何 dplyrdata.table 函数可以使此任务更容易)。任何建议对我都非常有帮助,非常感谢!!并且,对于这个相当冗长的问题,我很抱歉,因为我正在尝试更具体一些。非常感谢!

您希望 x 等于断点的观察值落入两个相邻类别的事实使这变得更加复杂,但这应该有效并且非常快。一般来说,如果您要跨数据重复执行任务,则数据采用长格式而不是分栏格式会更容易。

library(data.table)
data <- data.table(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))

setDT(data) 
lookup <- data.table(start=c(0,  0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
new <- data[lookup, ,on=c("x>=start","x<=end")] #not just an interval join, but also expands the data
new[,x.1:=NULL] #drop the interval join column since it's not needed

new[, fitted(lm(y~x,data=.SD))[2],by="neighborhood"]

Henrik 的解决方案 .EACHI:

library(data.table)
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))

setDT(data) 
lookup <- data.table(start=c(0,  0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
data[lookup, list(.GRP,fitted(lm(y~x.x,data=.SD))[2]),on=c("x>=start","x<=end"),by=.EACHI]