有放回的样本,但限制每个成员的最大抽取频率
sample with replacement but constrain the max frequency of each member to be drawn
是否可以将 R 中的 sample
函数扩展为不 return 超过说 replace = TRUE
时相同元素的 2 个?
假设我有一个列表:
l = c(1,1,2,3,4,5)
要对 3 个元素进行替换采样,我会这样做:
sample(l, 3, replace = TRUE)
有没有办法限制它的输出,以便最多只能 return 编辑 2 个相同的元素?所以(1,1,2)
或(1,3,3)
是允许的,但(1,1,1)
或(3,3,3)
是被排除的?
set.seed(0)
基本思想是将有放回抽样转换为无放回抽样。
ll <- unique(l) ## unique values
#[1] 1 2 3 4 5
pool <- rep.int(ll, 2) ## replicate each unique so they each appear twice
#[1] 1 2 3 4 5 1 2 3 4 5
sample(pool, 3) ## draw 3 samples without replacement
#[1] 4 3 5
## replicate it a few times
## each column is a sample after out "simplification" by `replicate`
replicate(5, sample(pool, 3))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 4 2 2 3
#[2,] 4 5 1 2 5
#[3,] 2 1 2 4 1
如果您希望不同的值出现最多不同的次数,我们可以这样做
pool <- rep.int(ll, c(2, 3, 3, 4, 1))
#[1] 1 1 2 2 2 3 3 3 4 4 4 4 5
## draw 9 samples; replicate 5 times
oo <- replicate(5, sample(pool, 9))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 1 4 3 2
# [2,] 2 2 4 4 1
# [3,] 4 4 1 1 1
# [4,] 4 2 3 2 5
# [5,] 1 4 2 5 2
# [6,] 3 4 3 3 3
# [7,] 1 4 2 2 2
# [8,] 4 1 4 3 3
# [9,] 3 3 2 2 4
我们可以在每一列上调用tabulate
来计算1, 2, 3, 4, 5
的频率:
## set `nbins` in `tabulate` so frequency table of each column has the same length
apply(oo, 2L, tabulate, nbins = 5)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 2 2 1 1 2
#[2,] 1 2 3 3 3
#[3,] 2 1 2 3 2
#[4,] 3 4 3 1 1
#[5,] 1 0 0 1 1
所有列中的计数都满足我们设置的频率上限c(2, 3, 3, 4, 1)
。
Would you explain the difference between rep
and rep.int
?
rep.int
不是 rep
的 "integer" 方法。它只是一个更快的原始函数,功能比 rep
少。您可以从文档页面 ?rep
.
获取 rep
、rep.int
和 rep_len
的更多详细信息
是否可以将 R 中的 sample
函数扩展为不 return 超过说 replace = TRUE
时相同元素的 2 个?
假设我有一个列表:
l = c(1,1,2,3,4,5)
要对 3 个元素进行替换采样,我会这样做:
sample(l, 3, replace = TRUE)
有没有办法限制它的输出,以便最多只能 return 编辑 2 个相同的元素?所以(1,1,2)
或(1,3,3)
是允许的,但(1,1,1)
或(3,3,3)
是被排除的?
set.seed(0)
基本思想是将有放回抽样转换为无放回抽样。
ll <- unique(l) ## unique values
#[1] 1 2 3 4 5
pool <- rep.int(ll, 2) ## replicate each unique so they each appear twice
#[1] 1 2 3 4 5 1 2 3 4 5
sample(pool, 3) ## draw 3 samples without replacement
#[1] 4 3 5
## replicate it a few times
## each column is a sample after out "simplification" by `replicate`
replicate(5, sample(pool, 3))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 1 4 2 2 3
#[2,] 4 5 1 2 5
#[3,] 2 1 2 4 1
如果您希望不同的值出现最多不同的次数,我们可以这样做
pool <- rep.int(ll, c(2, 3, 3, 4, 1))
#[1] 1 1 2 2 2 3 3 3 4 4 4 4 5
## draw 9 samples; replicate 5 times
oo <- replicate(5, sample(pool, 9))
# [,1] [,2] [,3] [,4] [,5]
# [1,] 5 1 4 3 2
# [2,] 2 2 4 4 1
# [3,] 4 4 1 1 1
# [4,] 4 2 3 2 5
# [5,] 1 4 2 5 2
# [6,] 3 4 3 3 3
# [7,] 1 4 2 2 2
# [8,] 4 1 4 3 3
# [9,] 3 3 2 2 4
我们可以在每一列上调用tabulate
来计算1, 2, 3, 4, 5
的频率:
## set `nbins` in `tabulate` so frequency table of each column has the same length
apply(oo, 2L, tabulate, nbins = 5)
# [,1] [,2] [,3] [,4] [,5]
#[1,] 2 2 1 1 2
#[2,] 1 2 3 3 3
#[3,] 2 1 2 3 2
#[4,] 3 4 3 1 1
#[5,] 1 0 0 1 1
所有列中的计数都满足我们设置的频率上限c(2, 3, 3, 4, 1)
。
Would you explain the difference between
rep
andrep.int
?
rep.int
不是 rep
的 "integer" 方法。它只是一个更快的原始函数,功能比 rep
少。您可以从文档页面 ?rep
.
rep
、rep.int
和 rep_len
的更多详细信息