计算一系列字符串的平均值,然后删除 R 中大于平均值 2SD 的任何内容

Calculate average for a sequence of strings, then remove anything greater than 2SD of the average in R

我有一个超过 10,000 行的大型数据集:df:

  User              duration

  amy                582         
  amy                27
  amy                592
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                100

期望输出:

User               duration

amy                 582
amy                 592
amy                 250
tom                 33
tom                 10
tom                 40

本质上,这将从每个唯一用户均值中删除任何 2SD 异常值。 该代码将采用每个唯一用户的平均值,确定其平均值和标准差,然后删除大于平均值 2SD 的值。

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这是我试过的:

first define average and standard deviation


      ave = ave(df$duration)
      sd =  sd(df$duration)

然后为此设置某种参数:

     for i in df {
     remove all if > 2*sd}

我不确定,希望得到一些建议。

这里有一个 data.table 方法,对于许多行来说可能更快。

library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000, 
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
   User duration
1   amy    50000
2   amy      582
3   amy       27
4   amy      592
5   amy       16
6   amy      250
7   tom       33
8   tom       10
9   tom       40
10  tom      100

代码

setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
               duration >= mean(duration) - (2 * sd(duration)),]
          ,by=User]
   User duration
1:  amy      582
2:  amy       27
3:  amy      592
4:  amy       16
5:  amy      250
6:  tom       33
7:  tom       10
8:  tom       40
9:  tom      100

我们可以用dplyr,和between

一起使用会简洁很多
library(dplyr)
df %>% 
   group_by(User) %>%
   filter(between(duration, mean(duration) -  sd(duration), 
                           mean(duration) +   sd(duration)))

您可以使用 scale() 找到 z 分数并保持绝对值小于 2:

library(dplyr)

df %>%
  group_by(User) %>%
  filter(abs(scale(duration)) < 2)

# A tibble: 9 x 2
# Groups:   User [2]
  User  duration
  <fct>    <int>
1 amy        582
2 amy         27
3 amy        592
4 amy         16
5 amy        250
6 tom         33
7 tom         10
8 tom         40
9 tom        100

我们可以尝试在dplyr

中使用mutatefilter函数
library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>% 
filter(duration < ave_plus2sd) 

This will give you the following output which allows comparison of each entry with average + 2*sd for the user.

# Groups:   User [2]
  User  duration ave_plus2sd
  <fct>    <int>       <dbl>
1 amy        582        861.
2 amy         27        861.
3 amy        592        861.
4 amy         16        861.
5 amy        250        861.
6 tom         33        122.
7 tom         10        122.
8 tom         40        122.
9 tom        100        122.

我们可以进一步将 %>% select (User,duration) 添加到 select 感兴趣的用户和持续时间列。