如何在 R 中逐年删除值在顶部和底部百分之一的观察值

Question

我试图按年减少价格在最高和最低百分之一的观测值。我一直在尝试使用 dplyr 的 group_by 函数按 year_sold 分组，然后 mutate() 创建一个变量 to_drop，其值以变量为条件price 介于第 1 个和第 99 个百分位之间。这是我目前所拥有的：

df <- df %>%  dplyr::group_by(year_sold) %>%
  mutate(to_drop = ifelse(price <= quantile(price,0.01) | price >= quantile(price,0.99),1,0))

但是，我没有得到按 year_sold 分组的 price 的分位数。删除 dplyr::group_by(year_sold) %>% 似乎并没有改变我的结果。

我正在尝试寻找 Stata 非常有用的 bysort 命令的替代方法。以下是我在 Stata 中的做法：

gen to_drop = 0
foreach y in year_sold { 
    quietly bysort `y': summarize price, detail // To get r(p1) and r(p99)
    bysort `y': replace to_drop = 1 if ! inrange(price, r(p1), r(p99))
}

有人可以帮我弄清楚为什么 group_by 没有像我预期的那样工作，或者帮我想出另一种在 R 中完成此任务的方法吗？

Answer 1

您可以使用基础 split 和 lapply 函数来获得所需的结果。

library(magrittr)

 #generating data
 df <- data.frame(year = rep(c(2001,2002), each = 20), price = runif(40, 40, 100))

filterf <- function(df) {
  q <- quantile(df$price, c(.01, .99))
  df[ df$price > q[1] & df$price < q[2], ]
}

split(df, df$year) %>% lapply(., FUN = filterf) %>% Reduce(rbind, .)

如何在 R 中逐年删除值在顶部和底部百分之一的观察值

How to drop observations with values in the top and bottom one percent by year in R

r

stata

dplyr