用所选列的行最小值替换 NA
Replace NAs with Row Minimum for Selected Columns
假设我有一个包含多种类型列(字符、数字、ID、时间等)的数据框。我将提供一个简单的例子如下:
m <- data.frame(LETTERS[1:10], LETTERS[15:24],runif(10),runif(10),runif(10),runif(10),runif(10))
x<-c("Col1","Col2","Col3","Col4","Col5","Col6","Col7")
colnames(m)<-x
m<-as.data.frame(lapply(m, function(x) x[ sample(c(TRUE, NA), prob = c(0.75, 0.25), size = length(x), replace = TRUE) ]))
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 NA 0.07054851 0.65521042
3 C <NA> 0.75798665 NA 0.04483692 0.54671014 NA
4 D R 0.96825047 0.01875140 0.07383107 NA 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 NA 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 NA NA 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 NA 0.72085080 NA
10 J X 0.39093317 0.97107464 NA 0.86417719 0.39890170
对于 Col3-Col7,如果 NA 少于 3 个,我想用 Col3-Col7 中的最小行替换它,否则保留 NA。所以,我希望数据集如下所示:
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 0.07054851 0.07054851 0.65521042
3 C <NA> 0.75798665 0.04483692 0.04483692 0.54671014 0.04483692
4 D R 0.96825047 0.01875140 0.07383107 0.01875140 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 0.04181401 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 0.69548691 0.69548691 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 0.63233902 0.72085080 0.63233902
10 J X 0.39093317 0.97107464 0.39093317 0.86417719 0.39890170
因此,除第 6 行外,每一行的值均由第 3-7 列的每行中的最小值估算。
在我的实际数据集中,对于列 18:27 之间的每一行,如果少于 4 个 NA,则替换为列 18:27 的行最小值,否则保留所有 NA。
我试过使用 dplyr pipes/mutate/replace 方法,但我不确定如何对列的子集执行此操作(我的印象是你只能使用 [= 指定一个列26=]).我尝试过的一些逻辑包括在 if 语句中 includes
rowSums(is.na(.[18:27]))<4 & rowSums(is.na(.[18:27]))>0)
我在 matrixStats 包中看到了 rowMins 函数,但我只是想知道我是否可以使用 dplyr/dataframe 而不是矩阵来做到这一点。
我建议采用 tidyverse
方法,您可以按 Col1
和 Col2
重塑数据并分组,然后重新构建数据。由于我们将使用管道,我们还可以使用 mutate()
创建新变量,并在创建 Flag
变量并计算最小值后评估您想要的条件。接下来的代码:
library(tidyverse)
#Data
m <- structure(list(Col1 = c("A", "B", "C", "D", "<NA>", "F", "G",
"<NA>", "I", "J"), Col2 = c("O", "P", "<NA>", "R", "S", "<NA>",
"U", "<NA>", "W", "X"), Col3 = c(0.09929126, 0.50314123, 0.75798665,
0.96825047, 0.47079716, NA, 0.71947656, 0.90518907, 0.79208514,
0.39093317), Col4 = c(0.40435352, 0.81725456, NA, 0.0187514,
0.04181401, NA, NA, 0.20661633, 0.63233902, 0.97107464), Col5 = c(0.1536083,
NA, 0.04483692, 0.07383107, 0.21423046, NA, NA, 0.65788523, NA,
NA), Col6 = c(0.038304, 0.07054851, 0.54671014, NA, NA, 0.33702657,
0.99142181, 0.0553433, 0.7208508, 0.86417719), Col7 = c(0.80157985,
0.65521042, NA, 0.04498563, 0.55493444, 0.5498926, 0.69548691,
0.78420756, NA, 0.3989017)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
代码:
#Reshape
m %>% pivot_longer(cols = -c(Col1,Col2)) %>%
group_by(Col1,Col2) %>% mutate(MinVal=min(value,na.rm=T),
Flag=sum(is.na(value))) %>% ungroup() %>%
mutate(value=ifelse(is.na(value) & Flag<3,MinVal,value)) %>%
select(-c(MinVal,Flag)) %>%
pivot_wider(names_from = name,values_from=value)
输出:
# A tibble: 10 x 7
Col1 Col2 Col3 Col4 Col5 Col6 Col7
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A O 0.0993 0.404 0.154 0.0383 0.802
2 B P 0.503 0.817 0.0705 0.0705 0.655
3 C <NA> 0.758 0.0448 0.0448 0.547 0.0448
4 D R 0.968 0.0188 0.0738 0.0188 0.0450
5 <NA> S 0.471 0.0418 0.214 0.0418 0.555
6 F <NA> NA NA NA 0.337 0.550
7 G U 0.719 0.695 0.695 0.991 0.695
8 <NA> <NA> 0.905 0.207 0.658 0.0553 0.784
9 I W 0.792 0.632 0.632 0.721 0.632
10 J X 0.391 0.971 0.391 0.864 0.399
假设我有一个包含多种类型列(字符、数字、ID、时间等)的数据框。我将提供一个简单的例子如下:
m <- data.frame(LETTERS[1:10], LETTERS[15:24],runif(10),runif(10),runif(10),runif(10),runif(10))
x<-c("Col1","Col2","Col3","Col4","Col5","Col6","Col7")
colnames(m)<-x
m<-as.data.frame(lapply(m, function(x) x[ sample(c(TRUE, NA), prob = c(0.75, 0.25), size = length(x), replace = TRUE) ]))
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 NA 0.07054851 0.65521042
3 C <NA> 0.75798665 NA 0.04483692 0.54671014 NA
4 D R 0.96825047 0.01875140 0.07383107 NA 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 NA 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 NA NA 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 NA 0.72085080 NA
10 J X 0.39093317 0.97107464 NA 0.86417719 0.39890170
对于 Col3-Col7,如果 NA 少于 3 个,我想用 Col3-Col7 中的最小行替换它,否则保留 NA。所以,我希望数据集如下所示:
> m
Col1 Col2 Col3 Col4 Col5 Col6 Col7
1 A O 0.09929126 0.40435352 0.15360830 0.03830400 0.80157985
2 B P 0.50314123 0.81725456 0.07054851 0.07054851 0.65521042
3 C <NA> 0.75798665 0.04483692 0.04483692 0.54671014 0.04483692
4 D R 0.96825047 0.01875140 0.07383107 0.01875140 0.04498563
5 <NA> S 0.47079716 0.04181401 0.21423046 0.04181401 0.55493444
6 F <NA> NA NA NA 0.33702657 0.54989260
7 G U 0.71947656 0.69548691 0.69548691 0.99142181 0.69548691
8 <NA> <NA> 0.90518907 0.20661633 0.65788523 0.05534330 0.78420756
9 I W 0.79208514 0.63233902 0.63233902 0.72085080 0.63233902
10 J X 0.39093317 0.97107464 0.39093317 0.86417719 0.39890170
因此,除第 6 行外,每一行的值均由第 3-7 列的每行中的最小值估算。
在我的实际数据集中,对于列 18:27 之间的每一行,如果少于 4 个 NA,则替换为列 18:27 的行最小值,否则保留所有 NA。
我试过使用 dplyr pipes/mutate/replace 方法,但我不确定如何对列的子集执行此操作(我的印象是你只能使用 [= 指定一个列26=]).我尝试过的一些逻辑包括在 if 语句中 includes
rowSums(is.na(.[18:27]))<4 & rowSums(is.na(.[18:27]))>0)
我在 matrixStats 包中看到了 rowMins 函数,但我只是想知道我是否可以使用 dplyr/dataframe 而不是矩阵来做到这一点。
我建议采用 tidyverse
方法,您可以按 Col1
和 Col2
重塑数据并分组,然后重新构建数据。由于我们将使用管道,我们还可以使用 mutate()
创建新变量,并在创建 Flag
变量并计算最小值后评估您想要的条件。接下来的代码:
library(tidyverse)
#Data
m <- structure(list(Col1 = c("A", "B", "C", "D", "<NA>", "F", "G",
"<NA>", "I", "J"), Col2 = c("O", "P", "<NA>", "R", "S", "<NA>",
"U", "<NA>", "W", "X"), Col3 = c(0.09929126, 0.50314123, 0.75798665,
0.96825047, 0.47079716, NA, 0.71947656, 0.90518907, 0.79208514,
0.39093317), Col4 = c(0.40435352, 0.81725456, NA, 0.0187514,
0.04181401, NA, NA, 0.20661633, 0.63233902, 0.97107464), Col5 = c(0.1536083,
NA, 0.04483692, 0.07383107, 0.21423046, NA, NA, 0.65788523, NA,
NA), Col6 = c(0.038304, 0.07054851, 0.54671014, NA, NA, 0.33702657,
0.99142181, 0.0553433, 0.7208508, 0.86417719), Col7 = c(0.80157985,
0.65521042, NA, 0.04498563, 0.55493444, 0.5498926, 0.69548691,
0.78420756, NA, 0.3989017)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
代码:
#Reshape
m %>% pivot_longer(cols = -c(Col1,Col2)) %>%
group_by(Col1,Col2) %>% mutate(MinVal=min(value,na.rm=T),
Flag=sum(is.na(value))) %>% ungroup() %>%
mutate(value=ifelse(is.na(value) & Flag<3,MinVal,value)) %>%
select(-c(MinVal,Flag)) %>%
pivot_wider(names_from = name,values_from=value)
输出:
# A tibble: 10 x 7
Col1 Col2 Col3 Col4 Col5 Col6 Col7
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A O 0.0993 0.404 0.154 0.0383 0.802
2 B P 0.503 0.817 0.0705 0.0705 0.655
3 C <NA> 0.758 0.0448 0.0448 0.547 0.0448
4 D R 0.968 0.0188 0.0738 0.0188 0.0450
5 <NA> S 0.471 0.0418 0.214 0.0418 0.555
6 F <NA> NA NA NA 0.337 0.550
7 G U 0.719 0.695 0.695 0.991 0.695
8 <NA> <NA> 0.905 0.207 0.658 0.0553 0.784
9 I W 0.792 0.632 0.632 0.721 0.632
10 J X 0.391 0.971 0.391 0.864 0.399