从两组中随机取出相等数量的元素——从一个具有相同数量元素的数据帧创建两个子数据帧

Randomly take equal number of elements from two groups -- create two sub-dataframes from one dataframe with equal number of elements

我有这样一个数据集:

data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

     ID  Group
1    A1 Group1
2    A6 Group1
3    A3 Group1
4   A55 Group1
5    BC Group1
6    J5 Group2
7    Ca Group2
8   KQF Group2
9    FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2

我如何从上述数据中创建两个子数据帧,这样就没有重复,并且Group1和[中的元素数量完全相同=13=] 在每个子数据帧中? 组合在一起的两个子数据帧始终与原始数据帧相同。

ID 始终唯一

示例结果

subDF1
     ID  Group
1    A1 Group1
4   A55 Group1
11 ABBd Group1
6    J5 Group2
8   KQF Group2
9    FK Group2

subDF2
     ID  Group
2    A6 Group1
3    A3 Group1
5    BC Group1
7    Ca Group2
10 AAAA Group2
12  XXF Group2

申请distinct

后可以使用sample_n
df1 %>% 
  distinct %>% 
  group_by(Group) %>% 
  sample_n(2)

因此,我基于以下假设制作了我的解决方案版本:您需要两个子数据框,这两个子数据框不仅具有来自每个组的相同数量的元素,而且还由主数据框的完全不同的行组成:

# This function returns the list with two required sub-dataframes
split_df <- function(df, n){
# First of all let's check if you want to cut an appropriately sized slice from groups  
if (any(table(df$Group) <= n*2)){
  return("Your N is too big for a given number of elements in some group(s)")
}
# Then we sample n elements from each group for the first time
sub1 <- unlist(tapply(1:nrow(df), df$Group, function(x){
  sample(x, n)
}))
# Make a new dataframe that has no rows that we subsetted on the prev step
df_2 <- df[-sub1,]
# Subset second time
sub2 <- unlist(tapply(1:nrow(df_2), df_2$Group, function(x){
  sample(x, n)
}))
# And return the list with resulting sub-dfs
return(
  list(
    df[sub1,],
    df_2[sub2,]
  )
)

}

好的。我相信这是正确的方法。即使在一组(或什至两个)中有奇数个元素,这也能很好地工作。

x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), 
            Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

x$SubDF <- NA
x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2), 
                                               size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2), 
                                               size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))

subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)
> subDF1
    ID  Group
1   A3 Group1
2   BC Group1
3   J5 Group2
4   FK Group2
5 AAAA Group2
6 ABBd Group1

> subDF2
   ID  Group
1  A1 Group1
2  A6 Group1
3 A55 Group1
4  Ca Group2
5 KQF Group2
6 XXF Group2

实际上我不太确定这是否足够,但就是这样,

library(dplyr)

df %>% 
 mutate(new = rep(seq(n() / 2), 2)) %>% 
 arrange_at(vars(3:2)) %>% 
 mutate(new1 = rep(seq(2), each = max(new))) %>% 
 split(.$new1)

这给出了,

$`1`
   ID  Group new new1
1  A1 Group1   1    1
2  Ca Group2   1    1
3  A6 Group1   2    1
4 KQF Group2   2    1
5  A3 Group1   3    1
6  FK Group2   3    1

$`2`
     ID  Group new new1
7   A55 Group1   4    2
8  AAAA Group2   4    2
9    BC Group1   5    2
10 ABBd Group1   5    2
11   J5 Group2   6    2
12  XXF Group2   6    2