如何替换重复行中一列中缺失的数据
How to replace missing data in one column from a duplicate row
这是一个数据集很大的数据样本,我希望能够根据 ID 列中的值是否相等用正确的值替换所有缺失值。
Id.data<-data.frame(
ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568),
gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"),
race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian","NA","BLack","Black","NA","Asian","NA"),
Hours = c(45,54,32,24,56,40,42,25,40,36,56,24),
stringsAsFactors = FALSE
)
假设您的缺失值是 NA
,而不是 "NA"
,我们可以使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
Id.data %>% group_by(ID) %>%
fill(everything(), .direction = "downup")
如果您的缺失值有引号,那么这个问题最好在您的工作流上游解决 - 如果您向我们展示这些值何时为 created/imported 的代码,我们可以帮助您正确创建它们。
您提到您的数据集很大。可能值得考虑 data.table
方法。
Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )
dt <- as.data.table(Id.data)
dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
输出:
ID gender race Hours
1: 564 male Caucasian 45
2: 564 male Caucasian 36
3: 758 female Black 54
4: 987 female Caucasion 32
5: 987 female Caucasion 42
6: 1568 male Hispanic 24
7: 1568 male Hispanic 24
8: 4987 male Hispanic 56
9: 413578 female Asian 40
10: 65647 male BLack 25
11: 4895 female Black 40
12: 135 male Asian 56
并为两种方法计时:
library(dplyr)
library(tidyr)
library(microbenchmark)
library(data.table)
Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )
dt <- as.data.table(Id.data)
dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
microbenchmark(
dplyr = Id.data %>% group_by(ID) %>% fill(everything(), .direction = "downup"),
dt = dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
)
基准 table:
Unit: milliseconds
expr min lq mean median uq max neval cld
dplyr 4.397308 4.493263 4.684294 4.552560 4.639137 9.966245 100 b
dt 1.238002 1.270062 1.421998 1.303184 1.330787 12.470191 100 a
使用 data.table 大约快 3.5 倍。
这是一个数据集很大的数据样本,我希望能够根据 ID 列中的值是否相等用正确的值替换所有缺失值。
Id.data<-data.frame(
ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568),
gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"),
race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian","NA","BLack","Black","NA","Asian","NA"),
Hours = c(45,54,32,24,56,40,42,25,40,36,56,24),
stringsAsFactors = FALSE
)
假设您的缺失值是 NA
,而不是 "NA"
,我们可以使用 dplyr
和 tidyr
:
library(dplyr)
library(tidyr)
Id.data %>% group_by(ID) %>%
fill(everything(), .direction = "downup")
如果您的缺失值有引号,那么这个问题最好在您的工作流上游解决 - 如果您向我们展示这些值何时为 created/imported 的代码,我们可以帮助您正确创建它们。
您提到您的数据集很大。可能值得考虑 data.table
方法。
Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )
dt <- as.data.table(Id.data)
dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
输出:
ID gender race Hours
1: 564 male Caucasian 45
2: 564 male Caucasian 36
3: 758 female Black 54
4: 987 female Caucasion 32
5: 987 female Caucasion 42
6: 1568 male Hispanic 24
7: 1568 male Hispanic 24
8: 4987 male Hispanic 56
9: 413578 female Asian 40
10: 65647 male BLack 25
11: 4895 female Black 40
12: 135 male Asian 56
并为两种方法计时:
library(dplyr)
library(tidyr)
library(microbenchmark)
library(data.table)
Id.data<-data.frame( ID = c(564,758,987,1568,4987,413578,987,65647,4895,564,135,1568), gender= c("male","female","female","male","male","female","female","male", "female","male","male","male"), race= c ("Caucasian","Black","Caucasion","Hispanic","Hispanic","Asian",NA,"BLack","Black",NA,"Asian",NA), Hours = c(45,54,32,24,56,40,42,25,40,36,56,24), stringsAsFactors = FALSE )
dt <- as.data.table(Id.data)
dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
microbenchmark(
dplyr = Id.data %>% group_by(ID) %>% fill(everything(), .direction = "downup"),
dt = dt[, lapply( .SD, function(v)na.locf(v, fromLast=TRUE) ), by=ID ]
)
基准 table:
Unit: milliseconds
expr min lq mean median uq max neval cld
dplyr 4.397308 4.493263 4.684294 4.552560 4.639137 9.966245 100 b
dt 1.238002 1.270062 1.421998 1.303184 1.330787 12.470191 100 a
使用 data.table 大约快 3.5 倍。