计算字符串序列的平均值,删除任何 1SD 或更大的值,然后用平均值替换删除的值

Calculate Average for Sequence of Strings, Remove anything 1SD or greater, then REPLACE value that is removed with the Mean

我有一个超过 10,000 行的大型数据集:df:

  User              duration

  amy                582         
  amy                27
  amy                592
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                100

期望输出:

User              duration

  amy                293.4         
  amy                27
  amy                293.4
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                45.75

我们在这里看到,任何大于不同用户组平均值 1SD 的值都被删除,然后替换为(唯一用户名的)平均值。 amy 组的平均值为 293.4 tom组的平均值是:45.75

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这是我按照其中一位成员的建议尝试过的方法,效果非常好,我不确定现在如何用每个组的平均值替换删除的值:

 df %>% 
 group_by(User) %>%
 filter(between(duration, mean(duration) -  1 * sd(duration), 
 mean(duration) +  1 * sd(duration)))

欢迎任何建议

我们可以使用replace

library(dplyr)
df %>% 
    group_by(User) %>%
    mutate(duration = replace(duration,
        !between(duration, mean(duration) -  1 * sd(duration), 
                 mean(duration) +  1 * sd(duration)), mean(duration)))

# A tibble: 9 x 2
# Groups:   User [2]
#  User  duration
#  <fct>    <dbl>
#1 amy      293. 
#2 amy       27  
#3 amy      293. 
#4 amy       16  
#5 amy      250  
#6 tom       33  
#7 tom       10  
#8 tom       40  
#9 tom       45.8

或者用base R

f1 <- function(x) as.numeric(abs(scale(x)) > 1)
with(df, ifelse(f1(duration), ave(duration, User), duration))