带有 dbplyr 条件的随机样本

Question

我有一个数据框，我必须 select 5 个人至少有 3 行。所以我的想法是，我为至少 3 行或更多的每个 id 随机抽样。数据如下所示：

head(df)

    id year sex education no_kids health_org satisf_org health_std
1 312 2004   1        NA       1          4          7  0.5670103
2 399 2000   0        12       1          4          8  0.5670103
3 399 2001   0        12       1          4          9  0.5670103
4 457 2000   0        18       0          3          8 -0.4639175
5 457 2002   0        18       0          3          7 -0.4639175
6 457 2004   0        18       0          2          4 -1.4948454
   satisf_std
1 -0.09090909
2  0.47727272
3  1.04545450
4  0.47727272
5 -0.09090909
6 -1.79545450

我已经创建了一种方法来执行此操作，但结果还不够好。它看起来像这样：

library(dplyr) 
sample_n_groups = function(grouped_df, size, replace = FALSE, weight=NULL) {
  grp_var <- grouped_df %>% 
    groups %>%
    unlist %>% 
    as.character
  random_grp <- grouped_df %>% 
    summarise() %>% 
    sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  grouped_df %>% 
    right_join(random_grp, by=grp_var) %>% 
    group_by_(grp_var) 
}
df_sample <- df %>%  filter(n() >= 3) %>% group_by(id) %>% sample_n_groups(5)

有人有不同的方法吗？

Answer 1

你可以试试这个：

library(dplyr)

df %>%  
  group_by(id) %>%
  filter(n() >= 3) %>%
  distinct(id) %>%
  slice_sample(n = 5) %>%
  #For older version of dplyr
  #sample_n(5)
  left_join(df, by = 'id') -> df_sample

这里我们首先保留 id 的行，其中至少有 3 行，select distinct id 来自它们并从中采样 5 个 id .使用 left_join 我们然后 select 那些 id 的所有行。

带有 dbplyr 条件的随机样本

Random sample with conditions in r with dbplyr

r

dbplyr