在运行多次回归时正确处理单列中的 NA

Question

我正在尝试使用 lm 来计算虚拟变量为真时的平均值。我有一个包含三个列（Sepal.Length、Sepal.Width 和 Dummy）的数据框。当其中一个列包含 NA 时，整行都被排除（即使我运行宁两个单独的回归）导致不正确的方法。当只有一个 col 包含 NA 时，如何正确运行几个不排除整行的回归（没有 for 循环）？

# setup mydata
mydata <- structure(list(Sepal.Length = c(5.1, 4.9, 4.7, 4.6, 5, 5.4), 
    Sepal.Width = c(NA, NA, 3.2, 3.1, 3.6, 3.9), Dummy = c(1, 
    1, 1, 0, 0, 0)), row.names = c(NA, 6L), class = "data.frame")

mydata

# Sepal.Length Sepal.Width Dummy
# 1          5.1          NA     1
# 2          4.9          NA     1
# 3          4.7         3.2     1
# 4          4.6         3.1     0
# 5          5.0         3.6     0
# 6          5.4         3.9     0

# reg Sepal.Length ~ Dummy, Sepal.Width ~ Dummy    
fit <- lm(data.matrix(mydata) ~ data.matrix(mydata["Dummy"]))

intercepts <- fit$coefficients[1,]
betas <- fit$coefficients[2,]

# calculate average when Dummy==1
intercepts + betas

# Sepal.Length  Sepal.Width        Dummy 
#         4.7          3.2          1.0 

# calculate average when Dummy==1 (does not match)
apply(data.matrix(mydata %>% filter(Dummy==1)), 2, mean, na.rm=TRUE)

# Sepal.Length  Sepal.Width        Dummy 
#         4.9          3.2          1.0

Answer 1

如果您按照 this 示例使用 purrr 中的 map，这似乎有效。

library("dplyr")
library("purrr")

mydata %>% map(~lm(.x ~ Dummy, data=mydata)) %>% map("coefficients") %>% map(sum)

# $Sepal.Length
# [1] 4.9

# $Sepal.Width
# [1] 3.2

# $Dummy
# [1] 1

在 运行 多次回归时正确处理单列中的 NA

Properly Handle NA in Single Column when Running Several Regressions

r

lm

na

在运行多次回归时正确处理单列中的 NA