计算一系列字符串的平均值，然后删除 R 中大于平均值 2SD 的任何内容

Question

我有一个超过 10,000 行的大型数据集：df:

  User              duration

  amy                582         
  amy                27
  amy                592
  amy                16
  amy                250
  tom                33
  tom                10
  tom                40
  tom                100

期望输出：

User               duration

amy                 582
amy                 592
amy                 250
tom                 33
tom                 10
tom                 40

本质上，这将从每个唯一用户均值中删除任何 2SD 异常值。该代码将采用每个唯一用户的平均值，确定其平均值和标准差，然后删除大于平均值 2SD 的值。

dput:

structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L, 
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA, 
-9L))

这是我试过的：

first define average and standard deviation


      ave = ave(df$duration)
      sd =  sd(df$duration)

然后为此设置某种参数：

     for i in df {
     remove all if > 2*sd}

我不确定，希望得到一些建议。

Answer 1

这里有一个 data.table 方法，对于许多行来说可能更快。

library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000, 
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
   User duration
1   amy    50000
2   amy      582
3   amy       27
4   amy      592
5   amy       16
6   amy      250
7   tom       33
8   tom       10
9   tom       40
10  tom      100

代码

setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
               duration >= mean(duration) - (2 * sd(duration)),]
          ,by=User]
   User duration
1:  amy      582
2:  amy       27
3:  amy      592
4:  amy       16
5:  amy      250
6:  tom       33
7:  tom       10
8:  tom       40
9:  tom      100

Answer 2

我们可以用dplyr，和between

一起使用会简洁很多

library(dplyr)
df %>% 
   group_by(User) %>%
   filter(between(duration, mean(duration) -  sd(duration), 
                           mean(duration) +   sd(duration)))

Answer 3

您可以使用 scale() 找到 z 分数并保持绝对值小于 2:

library(dplyr)

df %>%
  group_by(User) %>%
  filter(abs(scale(duration)) < 2)

# A tibble: 9 x 2
# Groups:   User [2]
  User  duration
  <fct>    <int>
1 amy        582
2 amy         27
3 amy        592
4 amy         16
5 amy        250
6 tom         33
7 tom         10
8 tom         40
9 tom        100

Answer 4

我们可以尝试在dplyr

中使用mutate和filter函数

library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>% 
filter(duration < ave_plus2sd)

This will give you the following output which allows comparison of each entry with average + 2*sd for the user.

# Groups:   User [2]
  User  duration ave_plus2sd
  <fct>    <int>       <dbl>
1 amy        582        861.
2 amy         27        861.
3 amy        592        861.
4 amy         16        861.
5 amy        250        861.
6 tom         33        122.
7 tom         10        122.
8 tom         40        122.
9 tom        100        122.

我们可以进一步将 %>% select (User,duration) 添加到 select 感兴趣的用户和持续时间列。

计算一系列字符串的平均值，然后删除 R 中大于平均值 2SD 的任何内容

Calculate average for a sequence of strings, then remove anything greater than 2SD of the average in R

r

lubridate

stringr

dplyr