按 non_missing 值的比率估算 R 中虚拟变量的缺失值

Question

我是R的新手。现在，我在估算缺失值时遇到了麻烦，需要您的帮助。我有一个像这样的数据框 df:

a  <- c(0,0,0,1,1,1,NA)
b  <- c(1,0,1,0,1,0,NA)
c  <- c(0,1,NA,0,1,0,1)
df <- data.frame(a,b,c)

我想根据非NA值的比率来估算这些变量的缺失值。例如：变量 a 有 50% 的 0 和 50% 的 1。因此，NA 值应该归因于 0 和 1 以保持比率相同。这是我的代码：

    ratio0 <- function(x) {  # ratio 0 of non NA missing value
           table(x)[1]/sum(table(x)[1],table(x)[2])
    } 
    ratio1 <- function(x) {  # ratio 1 of non NA missing value
           table(x)[2]/sum(table(x)[1],table(x)[2])
    } 

    for(i in 1:ncol(df)) {
        df[is.na(df[,i]), i] <- sample(c(0,1),sum(is.na(df[,i])),replace=TRUE,prob=c(ratio0(df[,i]),ratio1(df[,i])))
    }

应用上面的代码时，出现错误："Error in sample.int(length(x), size, replace, prob) : NA in probability vector"。

你能告诉我我的错误在哪里吗？

因为当我尝试为单个变量应用代码时，它起作用了。例如，下面的代码用于估算数据框 df.

第 3 列的缺失值

df[is.na(df[,3]), 3] <- sample(c(0,1), sum(is.na(df[,3])), replace=TRUE, prob=c(ratio0(df[,3]), ratio1(df[,3])))

非常感谢您的帮助。

Answer 1

我们可以构造一个自定义函数，然后 apply() 它按列 data.frame 显示。

# Function to replace NA's
replacer <- function(x) {

probs <- prop.table(table(x)) # Get proportions
y <- sample(c(0,1),length(which(is.na(x))), prob = probs, replace = TRUE)# Generate sample
x[is.na(x)] <- y # Replace values
return(x)

}

> apply(df,2,replacer)
#     a b c
#[1,] 0 1 0
#[2,] 0 0 1
#[3,] 0 1 1
#[4,] 1 0 0
#[5,] 1 1 1
#[6,] 1 0 0
#[7,] 1 1 1

Answer 2

如果你想做一个比率函数，我会做这样的事情

ratio <- function(x, which) {
    b <- !is.na(x)
    sum(x[b] == which) / sum(b)
}

但如果我理解正确的话，你可以使用非 na 值的向量直接从中采样

fun <- function(x) {
    b <- is.na(x)
    x[b] <- sample(x[!b], sum(b), replace=TRUE)
    x
}

as.data.frame(lapply(df, fun), stringsAsFactors = FALSE)

按 non_missing 值的比率估算 R 中虚拟变量的缺失值

Impute missing value for dummy variables in R by ratio of non_missing value

r

apply

dataframe