R:基于索引的数据框之间的 NA 替换

R: NA substitution between data frame based on index

我有这个数据df.1

   month a       b          c                  
    1    0 0.000000000 0.000000000  
    2    0 0.000000000 0.001503194  
    3    0 0.000000000 0.000000000 
    4    0 0.000000000 0.000000000  
    5    0 0.000000000 0.000000000  
    6    0 0.000000000 0.000000000  
    7    0 0.000000000 0.000000000  
    8    0 0.000000000 0.000000000  
    9    0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000  
    11  NA       NA          NA  
    12  NA       NA          NA  
    1   0 0.000000000 0.000000000 
    2   0 0.001537279 0.006917756  
    3   0 0.000000000 0.003669725  
    4   0 0.000000000 0.000000000  
    5   0 0.000000000 0.000000000  
    6   0 0.000000000 0.000000000  
    7   0 0.000000000 0.000000000  
    8   0 0.000000000 0.000000000  
    9   0 0.000000000 0.000000000  
    10   0 0.000000000 0.000000000
    11   0 0.000000000 0.013513514
    12  NA     NA          NA

和此数据 df.2:

month     a         b         c
    1  0.03842077 0.002266291 0.000000000 
    2  0.01359501 0.001027937 0.000000000 
    3  0.08631519 0.008732519 0.001376147 
    4  0.26564710 0.083635347 0.019053692 
    5  0.34839088 0.152203121 0.021010075 
    6  0.31767367 0.152029019 0.029397773 
    7  0.31507761 0.110973916 0.023445471 
    8  0.29773872 0.096458381 0.026745770 
    9  0.31226976 0.109342562 0.023996392 
    10 0.23841220 0.081582743 0.021674228 
    11 0.04379016 0.003519300 0.000000000 
    12 0.02244389 0.002493766 0.000000000 

当第 1 列中的索引 ( month) 是一样的。我试过这段代码:

res_new <- data.frame(matrix(nrow=nrow(df.1),ncol=3))
for (n in 1:12){
res_new <- data.frame(ifelse(is.na(df.1[which(df.1[,1] == n),2:4])==TRUE,df.2[which(df.2[,1] == n),2:4],df.1[,n]))

  }

但结果是一个新的大矩阵,其中 df.1 中的每个 NA 值都替换为 df.2

中的所有值

怎么办? (我的实际数据框要大得多)

也许这不是最好的方法,但像这样的一些方法可能行得通!

df1 <- data.frame(month = 1:12,
                  a = c(rep(1, 10), NA, NA),
                  b = c(rep(2, 11), NA))

df2 <- data.frame(month = 1:12,
                  a = rnorm(12),
                  b = rnorm(12))

# first, merge both data frame by the key in this case the month
new_df <- merge(df1, df2, by = "month")

# then use a vectorize operation with ifelse function
new_df$imp_a <- ifelse(!is.na(new_df$a.x), new_df$a.x, new_df$a.y)

# then you need to drop the temporal columns or make a subset of the
# new imputed columns generated
new_df

如果您需要估算很多列,也许可以为 ifelse 步骤创建一个函数,如下所示:

impute <- function(df, col1, col2) {
 # impute col1 NA by col2 values creating a new column
 new_name <- paste("new", col1, by = "_")
 df[[new_name]] <- ifelse(!is.na(df[[col1]]), df[[col1]], df[[col2]])
 df
 }

impute(new_df, "a.x", "a.y")

考虑到你有一个更大的数据框,我会尽量避免合并表。您可以使用 ifelse 来完成工作。

month <- c(1:12, 1:12)
a <- c(rep(0,10), NA, NA, rep(0,11), NA)
b <- c(rep(0,10), NA, NA, 0,.0015,rep(0,9), NA)
c <- c(0,.0015,rep(0,8), NA, NA, 0,.0069, .0036,rep(0,7), .0135, NA)
df.1 <- data.frame(month,a,b,c)

df.2 <- data.frame(month=c(1:12), a=rep(1,12), b=rep(2,12), c=rep(3,12))

df.1$a <- ifelse(is.na(df.1$a), df.2$a[match(df.1$month, df.2$month)], df.1$a)
df.1$b <- ifelse(is.na(df.1$b), df.2$b[match(df.1$month, df.2$month)], df.1$b)
df.1$c <- ifelse(is.na(df.1$c), df.2$c[match(df.1$month, df.2$month)], df.1$c)

> df.1
   month a      b      c
1      1 0 0.0000 0.0000
2      2 0 0.0000 0.0015
3      3 0 0.0000 0.0000
4      4 0 0.0000 0.0000
5      5 0 0.0000 0.0000
6      6 0 0.0000 0.0000
7      7 0 0.0000 0.0000
8      8 0 0.0000 0.0000
9      9 0 0.0000 0.0000
10    10 0 0.0000 0.0000
11    11 1 2.0000 3.0000
12    12 1 2.0000 3.0000
13     1 0 0.0000 0.0000
14     2 0 0.0015 0.0069
15     3 0 0.0000 0.0036
16     4 0 0.0000 0.0000
17     5 0 0.0000 0.0000
18     6 0 0.0000 0.0000
19     7 0 0.0000 0.0000
20     8 0 0.0000 0.0000
21     9 0 0.0000 0.0000
22    10 0 0.0000 0.0000
23    11 0 0.0000 0.0135
24    12 1 2.0000 3.0000

前12行数据:

df.1 <- data.frame(
  month = 1:12, 
  a = c(rep(0, 10), NA, NA), 
  b = c(rep(0, 10), NA, NA), 
  c = c(0, 0.001503194, rep(0, 8), NA, NA)
)

df.2 <- data.frame(
  month = 1:12,
  a = c(0.03842077, 0.01359501, 0.08631519, 0.2656471, 0.34839088, 0.31767367, 
        0.31507761, 0.29773872, 0.31226976, 0.2384122, 0.04379016, 0.02244389), 
  b = c(0.002266291, 0.001027937, 0.008732519, 0.083635347, 0.152203121, 
        0.152029019, 0.110973916, 0.096458381, 0.109342562, 0.081582743, 
        0.0035193, 0.002493766 ), 
  c = c(0, 0, 0.001376147, 0.019053692, 0.021010075, 0.029397773, 0.023445471,
        0.02674577, 0.023996392, 0.021674228, 0, 0)
)

解决方案

此解决方案仅允许一行中的某些列为 NA。大数据可能需要一些时间,但可以完成工作。

for (row in 1:nrow(df.1)) {
  for (col in names(df.1)[-1]) {
    if (is.na(df.1[row, col]) && df.1[row, "month"] == df.2[row, "month"]) {
      df.1[row, col] <- df.2[row, col]
    }
  }
}
df.1

   month          a           b           c
1      1 0.00000000 0.000000000 0.000000000
2      2 0.00000000 0.000000000 0.001503194
3      3 0.00000000 0.000000000 0.000000000
4      4 0.00000000 0.000000000 0.000000000
5      5 0.00000000 0.000000000 0.000000000
6      6 0.00000000 0.000000000 0.000000000
7      7 0.00000000 0.000000000 0.000000000
8      8 0.00000000 0.000000000 0.000000000
9      9 0.00000000 0.000000000 0.000000000
10    10 0.00000000 0.000000000 0.000000000
11    11 0.04379016 0.003519300 0.000000000
12    12 0.02244389 0.002493766 0.000000000

说明

我们使用双循环检查 ac 列中的每个元素。如果该元素不是 NA 我们继续下一个。否则,我们检查 df.2 中同一行中的月份是否相同,如果是 TRUE,我们将元素替换为 df.2.

中的相应元素

假设您有完整的行,其中包含您想要填写的缺失值,您可以使用 whichmatch.

分两步完成此操作
# find the location of the missing rows in df
missRows <- which(!complete.cases(df.1))
# fill in missing rows with rows in df.2 with matching months
df.1[missRows, ] <- df.2[match(df.1$month[missRows], df.2$month, nomatch=0),]

请注意,缺少的行用 !complete.cases 标识。此外,使用 nomatch=0 参数来忽略未找到匹配项的实例。