将线性模型应用于由多个范围定义的数据子集

Question

我在 R 中有一个特定的数据整理任务，我需要将预测变量划分为重叠“邻域”（或范围）并拟合线性模型（simple/bivariate) 到每个“邻域”以获得与 that“邻域”中的 middle 预测变量关联的拟合值。我按如下方式处理任务：

我为每个“邻居”创建虚拟变量（每列一个）
我将 lm() 函数应用于数据子集，即虚拟变量为 1 的行，不包括等于 0 的行
我在每个“邻域”

中间

我最终得到一个拟合值向量，其长度等于“邻域”的数量

我的方法适用于重叠邻域数量较少的情况。当重叠邻域的数量很大时，它相当冗长。这是一个可重现的示例（使用我创建的模拟数据，在本例中，#of neighborhoods = 7）：

# Mock data
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))
# Create dummies
new_data <- data %>% 
  mutate(neighborhood1 = ifelse(between(x, 0, 2.5), 1, 0),
         neighborhood2 = ifelse(between(x, 0.5, 3.5), 1, 0),
         neighborhood3 = ifelse(between(x, 1.5, 4.5), 1, 0),
         neighborhood4 = ifelse(between(x, 2.5, 5.5), 1, 0),
         neighborhood5 = ifelse(between(x, 3.5, 6.5), 1, 0),
         neighborhood6 = ifelse(between(x, 4.5, 7.5), 1, 0),
         neighborhood7 = ifelse(between(x, 5.5, 8.5), 1, 0))
# Run regression model on subsets of data 
# Obtain fitted value Y at the middle X 
# (in this example there are three obs per neighborhood and so we want the middle fitted value # 2)
Y_hat_1 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood1 == 1))[["fitted.values"]][[2]]
Y_hat_2 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood2 == 1))[["fitted.values"]][[2]]
Y_hat_3 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood3 == 1))[["fitted.values"]][[2]]
Y_hat_4 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood4 == 1))[["fitted.values"]][[2]]
Y_hat_5 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood5 == 1))[["fitted.values"]][[2]]
Y_hat_6 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood6 == 1))[["fitted.values"]][[2]]
Y_hat_7 <- lm(y ~ x, 
                    data = filter(.data = new_data, 
                                  neighborhood7 == 1))[["fitted.values"]][[2]]

我一直想知道是否有更有效的方法来处理此任务（可能使用嵌套数据框或循环或任何 dplyr 或 data.table 函数可以使此任务更容易）。任何建议对我都非常有帮助，非常感谢！！并且，对于这个相当冗长的问题，我很抱歉，因为我正在尝试更具体一些。非常感谢！

Answer 1

您希望 x 等于断点的观察值落入两个相邻类别的事实使这变得更加复杂，但这应该有效并且非常快。一般来说，如果您要跨数据重复执行任务，则数据采用长格式而不是分栏格式会更容易。

library(data.table)
data <- data.table(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))

setDT(data) 
lookup <- data.table(start=c(0,  0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
new <- data[lookup, ,on=c("x>=start","x<=end")] #not just an interval join, but also expands the data
new[,x.1:=NULL] #drop the interval join column since it's not needed

new[, fitted(lm(y~x,data=.SD))[2],by="neighborhood"]

Henrik 的解决方案 .EACHI:

library(data.table)
data <- tibble(y = as.integer(rnorm(10, mean = 100, sd = 20)), x = seq.int(0,9))

setDT(data) 
lookup <- data.table(start=c(0,  0.5, 1.5, 2.5, 3.5, 4.5, 5.5), end=c(2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5),neighborhood=1:7)
data[lookup, list(.GRP,fitted(lm(y~x.x,data=.SD))[2]),on=c("x>=start","x<=end"),by=.EACHI]

将线性模型应用于由多个范围定义的数据子集

Apply linear models to subsets of data defined by several ranges

statistics

r

dplyr

data.table

data-wrangling