计算字符串序列的平均值，删除任何 1SD 或更大的值，然后用平均值替换删除的值

Question

我有一个超过 10,000 行的大型数据集：df:

  User              duration

  amy                582         
  amy                27
  amy                592
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                100

期望输出：

User              duration

  amy                293.4         
  amy                27
  amy                293.4
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                45.75

我们在这里看到，任何大于不同用户组平均值 1SD 的值都被删除，然后替换为（唯一用户名的）平均值。 amy 组的平均值为 293.4 tom组的平均值是：45.75

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这是我按照其中一位成员的建议尝试过的方法，效果非常好，我不确定现在如何用每个组的平均值替换删除的值：

 df %>% 
 group_by(User) %>%
 filter(between(duration, mean(duration) -  1 * sd(duration), 
 mean(duration) +  1 * sd(duration)))

欢迎任何建议

Answer 1

我们可以使用replace

library(dplyr)
df %>% 
    group_by(User) %>%
    mutate(duration = replace(duration,
        !between(duration, mean(duration) -  1 * sd(duration), 
                 mean(duration) +  1 * sd(duration)), mean(duration)))

# A tibble: 9 x 2
# Groups:   User [2]
#  User  duration
#  <fct>    <dbl>
#1 amy      293. 
#2 amy       27  
#3 amy      293. 
#4 amy       16  
#5 amy      250  
#6 tom       33  
#7 tom       10  
#8 tom       40  
#9 tom       45.8

或者用base R

f1 <- function(x) as.numeric(abs(scale(x)) > 1)
with(df, ifelse(f1(duration), ave(duration, User), duration))

计算字符串序列的平均值，删除任何 1SD 或更大的值，然后用平均值替换删除的值

Calculate Average for Sequence of Strings, Remove anything 1SD or greater, then REPLACE value that is removed with the Mean

r

lubridate

dplyr