如果字符串与向量中的最后一个和下一个不同，则替换该字符串

Question

我有一个按代理和日期分组的大型数据集，我要清理的变量是一个字符串类型变量。例如，对于以下数据集

agent_id<-c("1","1","1","2","2","2","2")
date<-c("2007-02-01","2007-02-02","2007-02-05","2000-05-01","2000-05-02","2000-05-10","2000-05-20")
office<-c("A","A","B","C","D","C","C")
mydata<-data.frame(agent_id,date,office)

我想替换办公室向量中的离群值，如果它不同于每个 agent_id 中的上一次观察和下一次观察。例如，对于 agent_id=1，我不想替换任何东西。对于agent_id=2，我想把办公室里的"D"换成"C"，因为我前后都观察过C。有什么办法可以用 dplyr 做到这一点吗？此外，如果我可以定义截止值来替换异常值会更好，即如果我之前观察到 n 个相同的值，之后观察到 n 个相同的值。

Answer 1

你可以这样做：

library(dplyr)

mydata %>%
  group_by(agent_id) %>%
  mutate(
    office = replaceOutliers(x = office, window = 1)
  )

其中 replaceOutliers 是自定义函数：

replaceOutliers <- function(x, window = 1, fixed_wind = FALSE) {

  x <- as.character(x)

  flag_Outl <- c(FALSE, sapply(2:(length(x) - 1), function(y) length(setdiff(x[pmax(1, y - window):pmax(1, y - 1)],
                                                     x[pmin(length(x) - 1, y + 1):pmin(length(x) - 1, y + window)])) == 0), FALSE)

  if (fixed_wind) {

  len_Lag <- sapply(1:length(x), function(y) length(office[pmax(1, y - window):pmax(1, y - 1)]))
  len_Lead <- sapply(1:length(x), function(y) length(office[pmin(length(x), y + 1):pmin(length(x), y + window)]))

  x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y] & len_Lag[y] == window & len_Lead[y] == window, x[y - 1], x[y]))

  }

  else x <- sapply(1:length(flag_Outl), function(y) ifelse(flag_Outl[y], x[y - 1], x[y]))

  return(x)

}

输出：

# A tibble: 7 x 3
# Groups:   agent_id [2]
  agent_id date       office
  <fct>    <fct>      <chr> 
1 1        2007-02-01 A     
2 1        2007-02-02 A     
3 1        2007-02-05 C     
4 2        2000-05-01 C     
5 2        2000-05-02 C     
6 2        2000-05-10 C     
7 2        2000-05-20 C

正如您将看到的，我包含了一个 fixed_wind 参数 - 基本上您可以决定是否始终需要在考虑异常值之前和之后进行准确的观察次数。

默认情况下这是 FALSE，当您在您的示例中将 window 增加到 2 时，它仍然会替换 D，但是如果您将其设置为 TRUE，它会保持原样（因为组中它之前只有 1 个观察）：

mydata %>%
  group_by(agent_id) %>%
  mutate(
    office2 = replaceOutliers(x = office, window = 2),
    office3 = replaceOutliers(x = office, window = 2, fixed_wind = TRUE)
  )

输出：

# A tibble: 7 x 5
# Groups:   agent_id [2]
  agent_id date       office office2 office3
  <fct>    <fct>      <fct>  <chr>   <chr>  
1 1        2007-02-01 A      A       A      
2 1        2007-02-02 A      A       A      
3 1        2007-02-05 C      C       C      
4 2        2000-05-01 C      C       C      
5 2        2000-05-02 D      C       D      
6 2        2000-05-10 C      C       C      
7 2        2000-05-20 C      C       C

如果字符串与向量中的最后一个和下一个不同，则替换该字符串

Replace a string if it is different from the last one and the next one within a vector

replace

r

character

outliers

dplyr