R GLM 函数省略数据
R GLM function omitting data
我正在创建一个逻辑回归模型来预测因式分解的二元结果变量 (yes/no),但我 运行 遇到了一个奇怪的数据缺失问题。基本上,与让 GLM 执行自己的 na.action 相比,当我在 运行 GLM 函数之前从模型中手动过滤观察结果时,我收到了非常不同的 R 平方。请参阅下面的示例代码:
outcome <- rnorm(100)
outcome <- ifelse(outcome <= 0.5, 0, 1)
var1 <- rnorm(100)
var2 <- rnorm(100)
var3 <- c(rnorm(88), NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(cbind(outcome, var1, var2, var3))
df$outcome <- factor(df$outcome)
model_1 <- glm(outcome ~., data = df, family = "binomial")
nagelkerke(model_1)
model_1的结果:
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.160916
Cox and Snell (ML) 0.192093
Nagelkerke (Cragg and Uhler) 0.261581
现在我尝试预先过滤掉案例并得到一个完全不同的 R 平方:
df_clean <- filter(df, is.na(var3) == FALSE)
model_2 <- glm(outcome ~., data = df_clean, family = "binomial")
nagelkerke(model_2)
model_2的结果:
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0110171
Cox and Snell (ML) 0.0123142
Nagelkerke (Cragg and Uhler) 0.0182368
考虑到 GLM 的默认值 na.action = na.omit(我将其解释为省略了缺失值的案例),为什么会这样?这与事先过滤掉这些案例然后 运行 模型本质上不是一回事吗?
此外,我尝试将 na.action 更改为 "na.omit" 和 "na.exclude" 并收到相同的输出。感谢您的帮助!
您是正确的,na.omit
将忽略缺失值和 运行 您的模型。事实上,当你 运行 summary(model_1)
和 summary(model_2)
.
时你应该看到相同的输出
但是,当原始数据集中的一个变量中存在 NA 值时,您正在使用 运行 的 nagelkerke
函数会出现问题。从那里 documentation...
The fitted model and the null model should be properly nested. That is, the terms of one need to be a subset of the the other, and they should have the same set of observations. One issue arises when there are NA values in one variable but not another, and observations with NA are removed in the model fitting. The result may be fitted and null models with different sets of observations. Setting restrictNobs to TRUE ensures that only observations in the fit model are used in the null model. This appears to work for lm and some glm models, but causes the function to fail for other model object types
如果将 restrictNobs
设置为 TRUE
,您应该会看到相同的输出
我正在创建一个逻辑回归模型来预测因式分解的二元结果变量 (yes/no),但我 运行 遇到了一个奇怪的数据缺失问题。基本上,与让 GLM 执行自己的 na.action 相比,当我在 运行 GLM 函数之前从模型中手动过滤观察结果时,我收到了非常不同的 R 平方。请参阅下面的示例代码:
outcome <- rnorm(100)
outcome <- ifelse(outcome <= 0.5, 0, 1)
var1 <- rnorm(100)
var2 <- rnorm(100)
var3 <- c(rnorm(88), NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
df <- data.frame(cbind(outcome, var1, var2, var3))
df$outcome <- factor(df$outcome)
model_1 <- glm(outcome ~., data = df, family = "binomial")
nagelkerke(model_1)
model_1的结果:
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.160916
Cox and Snell (ML) 0.192093
Nagelkerke (Cragg and Uhler) 0.261581
现在我尝试预先过滤掉案例并得到一个完全不同的 R 平方:
df_clean <- filter(df, is.na(var3) == FALSE)
model_2 <- glm(outcome ~., data = df_clean, family = "binomial")
nagelkerke(model_2)
model_2的结果:
$Pseudo.R.squared.for.model.vs.null
Pseudo.R.squared
McFadden 0.0110171
Cox and Snell (ML) 0.0123142
Nagelkerke (Cragg and Uhler) 0.0182368
考虑到 GLM 的默认值 na.action = na.omit(我将其解释为省略了缺失值的案例),为什么会这样?这与事先过滤掉这些案例然后 运行 模型本质上不是一回事吗?
此外,我尝试将 na.action 更改为 "na.omit" 和 "na.exclude" 并收到相同的输出。感谢您的帮助!
您是正确的,na.omit
将忽略缺失值和 运行 您的模型。事实上,当你 运行 summary(model_1)
和 summary(model_2)
.
但是,当原始数据集中的一个变量中存在 NA 值时,您正在使用 运行 的 nagelkerke
函数会出现问题。从那里 documentation...
The fitted model and the null model should be properly nested. That is, the terms of one need to be a subset of the the other, and they should have the same set of observations. One issue arises when there are NA values in one variable but not another, and observations with NA are removed in the model fitting. The result may be fitted and null models with different sets of observations. Setting restrictNobs to TRUE ensures that only observations in the fit model are used in the null model. This appears to work for lm and some glm models, but causes the function to fail for other model object types
如果将 restrictNobs
设置为 TRUE
,您应该会看到相同的输出