用 R 中的最后一个已知值替换离群值

Question

示例数据： 2个 4个 6个 10 99 150 14 15 45

这个问题的先验知识，我知道任何高于 35 的东西都是异常值，但是因为数据取决于时间，所以我想用最近已知的小于 35 的数字替换所有高于 35 的值。数据集包含超过数百万行，所以我需要自动执行此操作，而不是一一替换。

需要的结果：2 4 6 10 10 10 14 15 15

Answer 1

x <- c(2, 4, 6, 10, 99, 150, 14, 15, 45)

#set outliers to NA
x[x > 35] <- NA

#fill NA values with Last Observation Carried Forward
library(zoo)
x <- na.locf(x)
#[1]  2  4  6 10 10 10 14 15 15

Answer 2

对于那些不想要 zoo 包依赖性的人，这里有一个使用来自 base 的运行级别编码的简单版本。这个想法很简单，我们只使用 rle() 并将索引中的 NA 替换为左边的任何内容（即前面的值），我们跳过任何第一个 NA 因为左边没有任何价值。然后我们使用 inverse.rle() 得到一个全长向量。逆向操作(，我们只是把vector前后颠倒一下。我没有做过benchmarking，但是因为所有的操作都是vector化的，所以应该很快。

rle() 出于某种原因没有将 NA 分组。文档状态 "Missing values are regarded as unequal to the previous value, even if that is also missing."。因此，为什么我将 NA 重新编码为临时字符串值，并且必须将向量转换回正确的 class。不完美，但适用于大多数情况。

#' Last observation carried forward
#'
#' @param x A vector
#' @param reverse Whether to do it in reverse
#'
#' @return A vector
#' @export
#'
#' @examples
#' c(NA, 1, NA, 2, NA) %>% locf()
#' c(NA, 1, NA, 2, NA) %>% locf(reverse = T)
locf = function(x, reverse = F) {
  #reverse?
  if (reverse) x = rev(x)

  #recode NA
  #these are kept distinct by rle() by default for same reason ???
  x_class = class(x)
  x[is.na(x)] = "___tmp"

  #run level encoding
  x_rle = rle(x)

  #swap values for NAs
  which_na = which(x_rle$values == "___tmp")

  #skip 1st
  which_na = setdiff(which_na, 1)

  #replace values
  x_rle$values[which_na] = x_rle$values[which_na - 1]

  #back to normal
  y = inverse.rle(x_rle)

  #NA recode
  y[y == "___tmp"] = NA

  #fix type/class
  if (x_class[1] == "logical") y = as.logical(y)
  if (x_class[1] == "integer") y = as.integer(y)
  if (x_class[1] == "numeric") y = as.double(y)
  if (x_class[1] == "factor") y = factor(y, levels = levels(x))
  if (x_class[1] == "ordered") y = ordered(y, levels = levels(x))

  #reverse?
  if (reverse) y = rev(y)

  y
}

测试：

> c(NA, 1, NA, 2, NA) %>% rle()
Run Length Encoding
  lengths: int [1:5] 1 1 1 1 1
  values : num [1:5] NA 1 NA 2 NA
> c(NA, 1, NA, 2, NA) %>% rle() %>% str()
List of 2
 $ lengths: int [1:5] 1 1 1 1 1
 $ values : num [1:5] NA 1 NA 2 NA
 - attr(*, "class")= chr "rle"
> #swap the values for one to left
> #reverse rle
> c(NA, 1, NA, 2, NA) %>% locf()
[1] NA  1  1  2  2
> c(NA, 1, NA, 2, NA) %>% locf(reverse = T)
[1]  1  1  2  2 NA
> c(NA, 1, NA, 2, NA, NA, NA) %>% locf()
[1] NA  1  1  2  2  2  2

用 R 中的最后一个已知值替换离群值

Replace outlier with last known value in R

r

outliers