从两组中随机取出相等数量的元素——从一个具有相同数量元素的数据帧创建两个子数据帧
Randomly take equal number of elements from two groups -- create two sub-dataframes from one dataframe with equal number of elements
我有这样一个数据集:
data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))
ID Group
1 A1 Group1
2 A6 Group1
3 A3 Group1
4 A55 Group1
5 BC Group1
6 J5 Group2
7 Ca Group2
8 KQF Group2
9 FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2
我如何从上述数据中创建两个子数据帧,这样就没有重复,并且Group1
和[中的元素数量完全相同=13=] 在每个子数据帧中? 组合在一起的两个子数据帧始终与原始数据帧相同。
ID 始终唯一。
示例结果
subDF1
ID Group
1 A1 Group1
4 A55 Group1
11 ABBd Group1
6 J5 Group2
8 KQF Group2
9 FK Group2
subDF2
ID Group
2 A6 Group1
3 A3 Group1
5 BC Group1
7 Ca Group2
10 AAAA Group2
12 XXF Group2
- subDF1 和 subDF2 中的元素数量相等
- 来自 Group1 和 Group2 的元素比例相等
- subDF1 中的元素不应在 subDF2 中,反之亦然
申请distinct
后可以使用sample_n
df1 %>%
distinct %>%
group_by(Group) %>%
sample_n(2)
因此,我基于以下假设制作了我的解决方案版本:您需要两个子数据框,这两个子数据框不仅具有来自每个组的相同数量的元素,而且还由主数据框的完全不同的行组成:
# This function returns the list with two required sub-dataframes
split_df <- function(df, n){
# First of all let's check if you want to cut an appropriately sized slice from groups
if (any(table(df$Group) <= n*2)){
return("Your N is too big for a given number of elements in some group(s)")
}
# Then we sample n elements from each group for the first time
sub1 <- unlist(tapply(1:nrow(df), df$Group, function(x){
sample(x, n)
}))
# Make a new dataframe that has no rows that we subsetted on the prev step
df_2 <- df[-sub1,]
# Subset second time
sub2 <- unlist(tapply(1:nrow(df_2), df_2$Group, function(x){
sample(x, n)
}))
# And return the list with resulting sub-dfs
return(
list(
df[sub1,],
df_2[sub2,]
)
)
}
好的。我相信这是正确的方法。即使在一组(或什至两个)中有奇数个元素,这也能很好地工作。
x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"),
Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))
x$SubDF <- NA
x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2),
size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2),
size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))
subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)
> subDF1
ID Group
1 A3 Group1
2 BC Group1
3 J5 Group2
4 FK Group2
5 AAAA Group2
6 ABBd Group1
> subDF2
ID Group
1 A1 Group1
2 A6 Group1
3 A55 Group1
4 Ca Group2
5 KQF Group2
6 XXF Group2
实际上我不太确定这是否足够,但就是这样,
library(dplyr)
df %>%
mutate(new = rep(seq(n() / 2), 2)) %>%
arrange_at(vars(3:2)) %>%
mutate(new1 = rep(seq(2), each = max(new))) %>%
split(.$new1)
这给出了,
$`1`
ID Group new new1
1 A1 Group1 1 1
2 Ca Group2 1 1
3 A6 Group1 2 1
4 KQF Group2 2 1
5 A3 Group1 3 1
6 FK Group2 3 1
$`2`
ID Group new new1
7 A55 Group1 4 2
8 AAAA Group2 4 2
9 BC Group1 5 2
10 ABBd Group1 5 2
11 J5 Group2 6 2
12 XXF Group2 6 2
我有这样一个数据集:
data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))
ID Group
1 A1 Group1
2 A6 Group1
3 A3 Group1
4 A55 Group1
5 BC Group1
6 J5 Group2
7 Ca Group2
8 KQF Group2
9 FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2
我如何从上述数据中创建两个子数据帧,这样就没有重复,并且Group1
和[中的元素数量完全相同=13=] 在每个子数据帧中? 组合在一起的两个子数据帧始终与原始数据帧相同。
ID 始终唯一。
示例结果
subDF1
ID Group
1 A1 Group1
4 A55 Group1
11 ABBd Group1
6 J5 Group2
8 KQF Group2
9 FK Group2
subDF2
ID Group
2 A6 Group1
3 A3 Group1
5 BC Group1
7 Ca Group2
10 AAAA Group2
12 XXF Group2
- subDF1 和 subDF2 中的元素数量相等
- 来自 Group1 和 Group2 的元素比例相等
- subDF1 中的元素不应在 subDF2 中,反之亦然
申请distinct
sample_n
df1 %>%
distinct %>%
group_by(Group) %>%
sample_n(2)
因此,我基于以下假设制作了我的解决方案版本:您需要两个子数据框,这两个子数据框不仅具有来自每个组的相同数量的元素,而且还由主数据框的完全不同的行组成:
# This function returns the list with two required sub-dataframes
split_df <- function(df, n){
# First of all let's check if you want to cut an appropriately sized slice from groups
if (any(table(df$Group) <= n*2)){
return("Your N is too big for a given number of elements in some group(s)")
}
# Then we sample n elements from each group for the first time
sub1 <- unlist(tapply(1:nrow(df), df$Group, function(x){
sample(x, n)
}))
# Make a new dataframe that has no rows that we subsetted on the prev step
df_2 <- df[-sub1,]
# Subset second time
sub2 <- unlist(tapply(1:nrow(df_2), df_2$Group, function(x){
sample(x, n)
}))
# And return the list with resulting sub-dfs
return(
list(
df[sub1,],
df_2[sub2,]
)
)
}
好的。我相信这是正确的方法。即使在一组(或什至两个)中有奇数个元素,这也能很好地工作。
x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"),
Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))
x$SubDF <- NA
x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2),
size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2),
size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))
subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)
> subDF1 ID Group 1 A3 Group1 2 BC Group1 3 J5 Group2 4 FK Group2 5 AAAA Group2 6 ABBd Group1 > subDF2 ID Group 1 A1 Group1 2 A6 Group1 3 A55 Group1 4 Ca Group2 5 KQF Group2 6 XXF Group2
实际上我不太确定这是否足够,但就是这样,
library(dplyr)
df %>%
mutate(new = rep(seq(n() / 2), 2)) %>%
arrange_at(vars(3:2)) %>%
mutate(new1 = rep(seq(2), each = max(new))) %>%
split(.$new1)
这给出了,
$`1` ID Group new new1 1 A1 Group1 1 1 2 Ca Group2 1 1 3 A6 Group1 2 1 4 KQF Group2 2 1 5 A3 Group1 3 1 6 FK Group2 3 1 $`2` ID Group new new1 7 A55 Group1 4 2 8 AAAA Group2 4 2 9 BC Group1 5 2 10 ABBd Group1 5 2 11 J5 Group2 6 2 12 XXF Group2 6 2