计算字符串序列的平均值,删除任何 1SD 或更大的值,然后用平均值替换删除的值
Calculate Average for Sequence of Strings, Remove anything 1SD or greater, then REPLACE value that is removed with the Mean
我有一个超过 10,000 行的大型数据集:df:
User duration
amy 582
amy 27
amy 592
amy 16
amy 250
tom 33
tom 10
tom 40
tom 100
期望输出:
User duration
amy 293.4
amy 27
amy 293.4
amy 16
amy 250
tom 33
tom 10
tom 40
tom 45.75
我们在这里看到,任何大于不同用户组平均值 1SD 的值都被删除,然后替换为(唯一用户名的)平均值。
amy 组的平均值为 293.4
tom组的平均值是:45.75
dput:
structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L,
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
这是我按照其中一位成员的建议尝试过的方法,效果非常好,我不确定现在如何用每个组的平均值替换删除的值:
df %>%
group_by(User) %>%
filter(between(duration, mean(duration) - 1 * sd(duration),
mean(duration) + 1 * sd(duration)))
欢迎任何建议
我们可以使用replace
library(dplyr)
df %>%
group_by(User) %>%
mutate(duration = replace(duration,
!between(duration, mean(duration) - 1 * sd(duration),
mean(duration) + 1 * sd(duration)), mean(duration)))
# A tibble: 9 x 2
# Groups: User [2]
# User duration
# <fct> <dbl>
#1 amy 293.
#2 amy 27
#3 amy 293.
#4 amy 16
#5 amy 250
#6 tom 33
#7 tom 10
#8 tom 40
#9 tom 45.8
或者用base R
f1 <- function(x) as.numeric(abs(scale(x)) > 1)
with(df, ifelse(f1(duration), ave(duration, User), duration))
我有一个超过 10,000 行的大型数据集:df:
User duration
amy 582
amy 27
amy 592
amy 16
amy 250
tom 33
tom 10
tom 40
tom 100
期望输出:
User duration
amy 293.4
amy 27
amy 293.4
amy 16
amy 250
tom 33
tom 10
tom 40
tom 45.75
我们在这里看到,任何大于不同用户组平均值 1SD 的值都被删除,然后替换为(唯一用户名的)平均值。 amy 组的平均值为 293.4 tom组的平均值是:45.75
dput:
structure(list(User = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("amy", "tom"), class = "factor"), duration = c(582L,
27L, 592L, 16L, 250L, 33L, 10L, 40L, 100L)), class = "data.frame", row.names = c(NA,
-9L))
这是我按照其中一位成员的建议尝试过的方法,效果非常好,我不确定现在如何用每个组的平均值替换删除的值:
df %>%
group_by(User) %>%
filter(between(duration, mean(duration) - 1 * sd(duration),
mean(duration) + 1 * sd(duration)))
欢迎任何建议
我们可以使用replace
library(dplyr)
df %>%
group_by(User) %>%
mutate(duration = replace(duration,
!between(duration, mean(duration) - 1 * sd(duration),
mean(duration) + 1 * sd(duration)), mean(duration)))
# A tibble: 9 x 2
# Groups: User [2]
# User duration
# <fct> <dbl>
#1 amy 293.
#2 amy 27
#3 amy 293.
#4 amy 16
#5 amy 250
#6 tom 33
#7 tom 10
#8 tom 40
#9 tom 45.8
或者用base R
f1 <- function(x) as.numeric(abs(scale(x)) > 1)
with(df, ifelse(f1(duration), ave(duration, User), duration))