如何替换重复行中一列中缺失的数据

How to replace missing data in one column from a duplicate row

这是一个数据集很大的数据样本,我希望能够根据 ID 列中的值是否相等用正确的值替换所有缺失值。

Id.data<-data.frame(
  ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568),
  gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"),
  race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian","NA","BLack","Black","NA","Asian","NA"),
  Hours = c(45,54,32,24,56,40,42,25,40,36,56,24),
  stringsAsFactors = FALSE
)

假设您的缺失值是 NA,而不是 "NA",我们可以使用 dplyrtidyr:

library(dplyr)
library(tidyr)
Id.data %>% group_by(ID) %>%
  fill(everything(), .direction = "downup")

如果您的缺失值有引号,那么这个问题最好在您的工作流上游解决 - 如果您向我们展示这些值何时为 created/imported 的代码,我们可以帮助您正确创建它们。

您提到您的数据集很大。可能值得考虑 data.table 方法。


Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )

dt <- as.data.table(Id.data)

dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]

输出:


        ID gender      race Hours
 1:    564   male Caucasian    45
 2:    564   male Caucasian    36
 3:    758 female     Black    54
 4:    987 female Caucasion    32
 5:    987 female Caucasion    42
 6:   1568   male  Hispanic    24
 7:   1568   male  Hispanic    24
 8:   4987   male  Hispanic    56
 9: 413578 female     Asian    40
10:  65647   male     BLack    25
11:   4895 female     Black    40
12:    135   male     Asian    56

并为两种方法计时:


library(dplyr)
library(tidyr)
library(microbenchmark)
library(data.table)

Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )

dt <- as.data.table(Id.data)

dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]

microbenchmark(
    dplyr = Id.data %>% group_by(ID) %>% fill(everything(), .direction = "downup"),
    dt = dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
)

基准 table:


Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval cld
 dplyr 4.397308 4.493263 4.684294 4.552560 4.639137  9.966245   100   b
    dt 1.238002 1.270062 1.421998 1.303184 1.330787 12.470191   100  a 

使用 data.table 大约快 3.5 倍。