r dplyr sample_frac 在数据中使用种子

Question

我有一个分组数据框，其中分组变量是SEED。我想采用由 SEED 的值定义的组，将每个组的种子设置为 SEED 的值，然后使用 dplyr::sample_frac 打乱每个组的行。但是，我无法复制我的结果，这表明种子设置不正确。

为了以 dplyr-ish 的方式做到这一点，我编写了以下函数：

> library(dplyr)
> ss_sampleseed <- function(df, seed.){
>   set.seed(df$seed.)
>   sample_frac(df, 1)
> }

然后我在我的数据上使用这个函数：

> dg <- structure(list(Gene = c("CAMK1", "ARPC4", "CIDEC", "CAMK1", "ARPC4", 
> "CIDEC"), GENESEED = c(1, 1, 1, 2, 2, 2)), class = c("tbl_df", 
> "tbl", "data.frame"), row.names = c(NA, -6L), .Names = c("Gene", 
> "GENESEED"))

> dg2 <- dg %>%
>   group_by(GENESEED) %>%
>   ss_sampleseed(GENESEED)

> dg2
Source: local data frame [6 x 2]
Groups: GENESEED

   Gene GENESEED
1 ARPC4        1
2 CIDEC        1
3 CAMK1        1
4 CIDEC        2
5 ARPC4        2
6 CAMK1        2

但是，当我重复上面的代码时，我无法复制我的结果。

> dg2
Source: local data frame [6 x 2]
Groups: GENESEED

   Gene GENESEED
1 ARPC4        1
2 CAMK1        1
3 CIDEC        1
4 CAMK1        2
5 ARPC4        2
6 CIDEC        2

Answer 1

这里的问题是美元符号不会替代您传递的参数。请参阅这个最小示例：

df <- data.frame(x = "x", GENESEED = "GENESEED")
h <- function(df,x){
  df$x
}
h(df, GENESEED)
[1] x
Levels: x

看到 h returns x 即使你要求 GENESEED。因此，您的函数实际上是在尝试获取不存在的 df$seed，因此它是 returns NULL。

但是还有一个问题。即使纠正这个并直接传递种子，它似乎也不会如你所愿，因为，如果你看一下 sample_frac 的代码，dplyr 最终会运行以下行：

sampled <- lapply(index, sample_group, frac = TRUE, tbl = tbl, 
        size = size, replace = replace, weight = weight, .env = .env)

注意它运行是一个lapply在你设置种子之后，所以你不会根据[为每个组定义不同的种子=15=]如你所愿。

考虑到这一点，我想出了这个解决方案，使用 sample.int 和 do:

ss_sampleseed <- function(x){ set.seed(unique(x$GENESEED)) x[sample.int(nrow(x)), ] } dg %>% group_by(GENESEED) %>% do(ss_sampleseed(.))

这似乎如你所愿。

Answer 2

我认为这里的主要内容是使用 $ 编码，就像您在函数内部一样。我当然必须以艰难的方式学习这一点。另请参阅：

library(fortunes)
fortune(312)
fortune(343)

从@Carlos Cinelli 获取简单函数并尝试在任何 dplyr 函数之外使用它。

h = function(df, seed.){
    df$seed.
}

h(dg, GENESEED)
NULL

就是那些该死的美元符号。现在将函数改为使用 [[。

h2 = function(df, seed.){
    df[[seed.]]
}

h2(dg, "GENESEED")
[1] 1 1 1 2 2 2

这更像是，尽管您确实必须在函数中的变量名周围加上引号。

那你原来的功能在哪里呢？你可以走两条路。首先，您可以更改为 [[ 并在函数中的变量名称周围使用引号。

ss_sampleseed = function(df, seed.){
       set.seed(df[[seed.]])
       sample_frac(df, 1)
}

dg %>%
       group_by(GENESEED) %>%
       ss_sampleseed("GENESEED")

Source: local data frame [6 x 2]
Groups: GENESEED

   Gene GENESEED
1 CAMK1        1
2 CIDEC        1
3 ARPC4        1
4 CIDEC        2
5 CAMK1        2
6 ARPC4        2

另一种选择是在函数内部使用 deparse(substitute(seed.)) 以允许进行非标准评估。不过，您仍然需要 [[。

ss_sampleseed2 = function(df, seed.){
    set.seed(df[[deparse(substitute(seed.))]])
    sample_frac(df, 1)
}

dg %>%
    group_by(GENESEED) %>%
    ss_sampleseed2(GENESEED)

Source: local data frame [6 x 2]
Groups: GENESEED

   Gene GENESEED
1 CAMK1        1
2 CIDEC        1
3 ARPC4        1
4 CIDEC        2
5 CAMK1        2
6 ARPC4        2

我得到了其中任何一个的重复结果，尽管我没有检查种子是否专门设置为您想要的。

r dplyr sample_frac 在数据中使用种子

r dplyr sample_frac using seed in data

random

r

dplyr