R中的手动交互图线性回归

Manual interaction plot linear regression in R

我正在尝试使用对数转换的丰度数据(更适合)和其他一些变量来预测在不同月相(因子)期间看到的动物的平均丰度。最好的模型(最低 AIC)结果包括阶段和调查持续时间的相互作用以及云层覆盖(都是连续的):

LMoon<-lm(ln~Phase*Duration+Clouds, data=abund)

summary(LMoon)

Call:
lm(formula = ln ~ Phase * Duration + Clouds, data = abund)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75416 -0.46311  0.09522  0.46591  1.85978 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       0.382031   0.876865   0.436 0.664125    
Phase2            2.130065   1.226305   1.737 0.085851 .  
Phase3            1.971060   1.818542   1.084 0.281351    
Phase4            0.608043   1.140122   0.533 0.595146    
Phase5            4.786674   1.151850   4.156 7.44e-05 ***
Phase6            0.958706   1.046831   0.916 0.362238    
Phase7            0.254711   3.425214   0.074 0.940888    
Phase8            0.865995   1.043916   0.830 0.409005    
Duration          0.069153   0.035407   1.953 0.053952 .  
Clouds           -0.004259   0.002401  -1.774 0.079494 .  
Phase2:Duration  -0.087843   0.047818  -1.837 0.069545 .  
Phase3:Duration  -0.089908   0.069652  -1.291 0.200109    
Phase4:Duration  -0.005424   0.046675  -0.116 0.907749    
Phase5:Duration  -0.172016   0.049369  -3.484 0.000768 ***
Phase6:Duration  -0.035597   0.041435  -0.859 0.392583    
Phase7:Duration   0.024084   0.176773   0.136 0.891939    
Phase8:Duration  -0.033424   0.042064  -0.795 0.428963    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7721 on 89 degrees of freedom
Multiple R-squared:  0.3368,    Adjusted R-squared:  0.2176 
F-statistic: 2.825 on 16 and 89 DF,  p-value: 0.0009894

现在,由于这种相互作用,我需要制作一个相互作用图(CI 在绘制 lsmeans 时太宽了)。 我尝试使用那里提到的不同功能,但其中 none 有效。 显然我需要手动计算和绘图,我是这样做的:

intercepts <- c(coef(LMoon)["(Intercept)"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase2"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase3"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase4"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase5"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase6"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase7"],
                coef(LMoon)["(Intercept)"] + coef(LMoon)["Phase8"])

lines.df <- data.frame(intercepts = intercepts,
                       slopes = c(coef(LMoon)["Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase2:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase3:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase4:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase5:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase6:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase7:Duration"],
                                  coef(LMoon)["Duration"]+coef(LMoon)["Phase8:Duration"]),
                       Phase2 = levels(abund$Phase))

qplot(x = Duration, y = Sp2, color = Phase, data = abund) + 
  geom_abline(aes(intercept = intercepts, 
                  slope = slopes, 
                  color = Phase), data = lines.df)

我得到的图是错误的,因为 y 值是原始的真实比例,但线条是基于使用对数转换数据的 lm。

interaction plot abundance, duration, lunar phases

为了反向转换这个,有人告诉我,实际上我最终不会得到直线。 而不是使用 abline(),我应该创建一组例如覆盖持续时间数据范围的 100 个新 x 值,并使用系数计算您的预测 y 值。然后使用 lines() 绘制这些,它应该看起来像一条平滑的曲线。

这就是我迷路的地方。

所以我为调查持续时间范围(最小 15 最大 45 分钟)创建了一组新的 x 值: dur2 <- seq(from = 15, to = 45, length.out=100)

然后,一旦我得到这些值,我就应该使用我的 LM 的系数为每个 x 值获得预测的 y 值。之后,将 y 值反向转换为原始比例。然后使用 x 值和反向转换的 y 值将线条添加到图中。

现在如何准确得到预测值?我不能使用任何 pred type/function,我都试过了。它只是不适用于我的模型,所以手动是唯一的方法,但我不知道如何...

希望有人能帮我解决这个问题,我已经尝试了几个星期,现在已经绝望了,几乎要放弃了。

干杯!

PS 这里的数据:

> dput(subset(abund, Phase %in% c("Phase1", "Phase2")))

structure(list(Year = integer(0), Date = structure(integer(0), .Label = c("01/08/2009", 
"01/08/2016", "02/07/2019", "02/08/2009", "02/08/2012", "02/08/2016", 
"02/09/2007", "03/08/2007", "03/08/2009", "03/08/2014", "03/08/2015", 
"04/07/2019", "04/08/2009", "04/08/2013", "05/08/2009", "05/08/2014", 
"05/08/2015", "06/07/2008", "06/07/2019", "07/08/2009", "08/07/2010", 
"09/07/2010", "09/08/2015", "10/08/2009", "11/08/2009", "12/08/2009", 
"13/08/2009", "13/08/2014", "14/08/2009", "14/08/2012", "16/07/2006", 
"18/07/2009", "18/08/2015", "19/07/2011", "20/08/2009", "21/07/2011", 
"21/09/2009", "22/07/2011", "22/07/2016", "22/07/2017", "23/07/2007", 
"23/07/2016", "23/07/2017", "24/07/2017", "25/07/2007", "25/07/2010", 
"25/07/2017", "25/08/2016", "26/07/2010", "26/07/2011", "27/07/2006", 
"27/07/2011", "27/07/2012", "28/07/2016", "29/06/2019", "29/07/2005", 
"29/07/2009", "29/07/2010", "29/07/2016", "29/07/2019", "30/07/2005", 
"30/07/2007", "30/07/2016", "30/08/2005", "31/07/2005", "31/07/2009", 
"31/07/2014", "31/07/2016"), class = "factor"), NrSurvey = integer(0), 
    Duration = integer(0), Sp2 = integer(0), Phase = structure(integer(0), .Label = c("1", 
    "2", "3", "4", "5", "6", "7", "8"), class = "factor"), Clouds = integer(0), 
    Visibility = integer(0), ln = numeric(0)), row.names = integer(0), class = "data.frame")

使用predict获得预测值。不要手动计算。

使用 expand.grid() 生成一个数据框,其中包含您的 dur2 序列和您想要绘制的值的其他预测变量的所有组合。像这样:

prediction_data = expand.grid(
  Duration = dur2,
  Phase= unique(abund$Phase),
  Clouds = mean(abund$Clouds) # Hold Clouds constant at some value
)

# column names in prediction_data need to match those in the model formula
prediction_data$pred = predict(LMoon, newdata = prediction_data)
prediction_data$pred_orig = exp(prediction_data$pred)

# plot
ggplot(prediction_data, aes(x = Duration, y = pred_orig, color = Phase)) +
  geom_line() +
  geom_point(data = abund)

类似的东西应该有用。

另一个不错的选择是使用 broom::augment 生成预测。这也可以很容易地给出每个预测点的标准误差和残差。

library(broom)
prediction_data = augment(LMoon, newdata = prediction_data)