使用 R 用 NA 替换异常值的函数
Function to replace outliers with NA using R
假设我有以下示例数据集:
set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)),
Y = c(runif(1000),
rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
使用下一个函数,我已经确定了异常值,即 3 sd 之外的观察值:
findOutlier <- function(data, cutoff = 3) {
sds <- apply(data, 2, sd, na.rm = TRUE)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
outliers <- findOutlier(data)
现在我需要用 NA 替换所有异常值。我使用了以下功能:
OutliersToNA <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
它returns出现以下错误:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 4, 0, 1, 2, 3
能否请您对函数提出任何改进建议,以用 NA 替换异常值?
我认为有一种更简单的方法可以做到这一点。您可以在一个函数中组合查找和替换异常值,然后简单地使用它来更新数据框列。让我知道这是否适合你 -
library(dplyr) # for mutate_all()
# summary of input data
summary(data)
X Y Z
Min. :-19.774100 Min. :0.000264 Min. :-2.2037
1st Qu.: -0.716794 1st Qu.:0.235934 1st Qu.: 0.4144
Median : 0.007454 Median :0.484328 Median : 1.0390
Mean : -0.027200 Mean :0.516226 Mean : 1.0428
3rd Qu.: 0.702163 3rd Qu.:0.749178 3rd Qu.: 1.6435
Max. : 15.758520 Max. :4.346755 Max. : 4.4933
NA's :1 NA's :1 NA's :1
replaceOutlier <- function(x, cutoff = 3) {
x[abs(x) > cutoff*sd(x, na.rm = T)] <- NA_real_
x
}
result <- data %>%
mutate_all(replaceOutlier)
# summary of result data
summary(result)
X Y Z
Min. :-5.215726 Min. :0.000264 Min. :-2.2037
1st Qu.:-0.688045 1st Qu.:0.234386 1st Qu.: 0.3932
Median : 0.009348 Median :0.476328 Median : 0.9879
Mean : 0.014648 Mean :0.486287 Mean : 0.9571
3rd Qu.: 0.697789 3rd Qu.:0.737633 3rd Qu.: 1.5897
Max. : 4.047586 Max. :0.998272 Max. : 3.0065
NA's :17 NA's :18 NA's :37
这里有一个更简洁的版本,感谢@andrew_reece -
data %>%
mutate_all(list(~if_else(abs(.) > cutoff*sd(., na.rm = T), NA_real_, .)))
假设我有以下示例数据集:
set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)),
Y = c(runif(1000),
rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
使用下一个函数,我已经确定了异常值,即 3 sd 之外的观察值:
findOutlier <- function(data, cutoff = 3) {
sds <- apply(data, 2, sd, na.rm = TRUE)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
outliers <- findOutlier(data)
现在我需要用 NA 替换所有异常值。我使用了以下功能:
OutliersToNA <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
它returns出现以下错误:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 4, 0, 1, 2, 3
能否请您对函数提出任何改进建议,以用 NA 替换异常值?
我认为有一种更简单的方法可以做到这一点。您可以在一个函数中组合查找和替换异常值,然后简单地使用它来更新数据框列。让我知道这是否适合你 -
library(dplyr) # for mutate_all()
# summary of input data
summary(data)
X Y Z
Min. :-19.774100 Min. :0.000264 Min. :-2.2037
1st Qu.: -0.716794 1st Qu.:0.235934 1st Qu.: 0.4144
Median : 0.007454 Median :0.484328 Median : 1.0390
Mean : -0.027200 Mean :0.516226 Mean : 1.0428
3rd Qu.: 0.702163 3rd Qu.:0.749178 3rd Qu.: 1.6435
Max. : 15.758520 Max. :4.346755 Max. : 4.4933
NA's :1 NA's :1 NA's :1
replaceOutlier <- function(x, cutoff = 3) {
x[abs(x) > cutoff*sd(x, na.rm = T)] <- NA_real_
x
}
result <- data %>%
mutate_all(replaceOutlier)
# summary of result data
summary(result)
X Y Z
Min. :-5.215726 Min. :0.000264 Min. :-2.2037
1st Qu.:-0.688045 1st Qu.:0.234386 1st Qu.: 0.3932
Median : 0.009348 Median :0.476328 Median : 0.9879
Mean : 0.014648 Mean :0.486287 Mean : 0.9571
3rd Qu.: 0.697789 3rd Qu.:0.737633 3rd Qu.: 1.5897
Max. : 4.047586 Max. :0.998272 Max. : 3.0065
NA's :17 NA's :18 NA's :37
这里有一个更简洁的版本,感谢@andrew_reece -
data %>%
mutate_all(list(~if_else(abs(.) > cutoff*sd(., na.rm = T), NA_real_, .)))