计算一系列字符串的平均值,然后删除 R 中大于平均值 2SD 的任何内容
Calculate average for a sequence of strings, then remove anything greater than 2SD of the average in R
我有一个超过 10,000 行的大型数据集:df:
User duration
amy 582
amy 27
amy 592
amy 16
amy 250
tom 33
tom 10
tom 40
tom 100
期望输出:
User duration
amy 582
amy 592
amy 250
tom 33
tom 10
tom 40
本质上,这将从每个唯一用户均值中删除任何 2SD 异常值。
该代码将采用每个唯一用户的平均值,确定其平均值和标准差,然后删除大于平均值 2SD 的值。
dput:
structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L,
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
这是我试过的:
first define average and standard deviation
ave = ave(df$duration)
sd = sd(df$duration)
然后为此设置某种参数:
for i in df {
remove all if > 2*sd}
我不确定,希望得到一些建议。
这里有一个 data.table 方法,对于许多行来说可能更快。
library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000,
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
User duration
1 amy 50000
2 amy 582
3 amy 27
4 amy 592
5 amy 16
6 amy 250
7 tom 33
8 tom 10
9 tom 40
10 tom 100
代码
setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
duration >= mean(duration) - (2 * sd(duration)),]
,by=User]
User duration
1: amy 582
2: amy 27
3: amy 592
4: amy 16
5: amy 250
6: tom 33
7: tom 10
8: tom 40
9: tom 100
我们可以用dplyr
,和between
一起使用会简洁很多
library(dplyr)
df %>%
group_by(User) %>%
filter(between(duration, mean(duration) - sd(duration),
mean(duration) + sd(duration)))
您可以使用 scale()
找到 z 分数并保持绝对值小于 2:
library(dplyr)
df %>%
group_by(User) %>%
filter(abs(scale(duration)) < 2)
# A tibble: 9 x 2
# Groups: User [2]
User duration
<fct> <int>
1 amy 582
2 amy 27
3 amy 592
4 amy 16
5 amy 250
6 tom 33
7 tom 10
8 tom 40
9 tom 100
我们可以尝试在dplyr
中使用mutate
和filter
函数
library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>%
filter(duration < ave_plus2sd)
This will give you the following output which allows comparison of each entry with average + 2*sd for the user.
# Groups: User [2]
User duration ave_plus2sd
<fct> <int> <dbl>
1 amy 582 861.
2 amy 27 861.
3 amy 592 861.
4 amy 16 861.
5 amy 250 861.
6 tom 33 122.
7 tom 10 122.
8 tom 40 122.
9 tom 100 122.
我们可以进一步将 %>% select (User,duration)
添加到 select 感兴趣的用户和持续时间列。
我有一个超过 10,000 行的大型数据集:df:
User duration
amy 582
amy 27
amy 592
amy 16
amy 250
tom 33
tom 10
tom 40
tom 100
期望输出:
User duration
amy 582
amy 592
amy 250
tom 33
tom 10
tom 40
本质上,这将从每个唯一用户均值中删除任何 2SD 异常值。 该代码将采用每个唯一用户的平均值,确定其平均值和标准差,然后删除大于平均值 2SD 的值。
dput:
structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L,
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
这是我试过的:
first define average and standard deviation
ave = ave(df$duration)
sd = sd(df$duration)
然后为此设置某种参数:
for i in df {
remove all if > 2*sd}
我不确定,希望得到一些建议。
这里有一个 data.table 方法,对于许多行来说可能更快。
library(data.table)
df <- structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(50000,
582, 27, 592, 16, 250, 33, 10, 40, 100)), row.names = c(NA, -10L
), class = "data.frame")
df
User duration
1 amy 50000
2 amy 582
3 amy 27
4 amy 592
5 amy 16
6 amy 250
7 tom 33
8 tom 10
9 tom 40
10 tom 100
代码
setDT(df)[,.SD[duration <= mean(duration) + (2 * sd(duration)) &
duration >= mean(duration) - (2 * sd(duration)),]
,by=User]
User duration
1: amy 582
2: amy 27
3: amy 592
4: amy 16
5: amy 250
6: tom 33
7: tom 10
8: tom 40
9: tom 100
我们可以用dplyr
,和between
library(dplyr)
df %>%
group_by(User) %>%
filter(between(duration, mean(duration) - sd(duration),
mean(duration) + sd(duration)))
您可以使用 scale()
找到 z 分数并保持绝对值小于 2:
library(dplyr)
df %>%
group_by(User) %>%
filter(abs(scale(duration)) < 2)
# A tibble: 9 x 2
# Groups: User [2]
User duration
<fct> <int>
1 amy 582
2 amy 27
3 amy 592
4 amy 16
5 amy 250
6 tom 33
7 tom 10
8 tom 40
9 tom 100
我们可以尝试在dplyr
mutate
和filter
函数
library(dplyr)
df %>% group_by(User) %>% mutate(ave_plus2sd=ave(duration)+2*sd(duration)) %>%
filter(duration < ave_plus2sd)
This will give you the following output which allows comparison of each entry with average + 2*sd for the user.
# Groups: User [2]
User duration ave_plus2sd
<fct> <int> <dbl>
1 amy 582 861.
2 amy 27 861.
3 amy 592 861.
4 amy 16 861.
5 amy 250 861.
6 tom 33 122.
7 tom 10 122.
8 tom 40 122.
9 tom 100 122.
我们可以进一步将 %>% select (User,duration)
添加到 select 感兴趣的用户和持续时间列。