用概率抽样替代方案替换数据框中的文本字符串

Question

我正在使用 R，并且我有一个大型数据框，其中的行数为数百万。我特别感兴趣的只是一列，$path。从这些数据中，我生成了一个索引，用于标识我希望替换的条目：

replace.index <- which(df$path == 'First')

假设这个索引标识了 50 行。

在单独的 table 中，我确定了一个概率 table，我希望将其用于 'sample' 以替换这些 'First' 条目中的每一个。

假设第二个实体是一系列名为 "casetable":

的命名数字

      cum
alpha 18
beta  29
gamma 40
delta 50

这 50 与我要替换的行数匹配。

我正在尝试编写某种替换操作来替代

18 cases of "First" with "alpha > First"
11 cases of "First" with "beta > First"
11 cases of "First" with "gamma > First"
10 cases of "First" with "delta > First"

并且实质上覆盖了主 table.

中先前标识的每个行中的条目

我相信我可以运行使用 for 循环，但为了提高速度，我想改用 apply 函数，但我无法解决这个问题。我尝试了以下方法，但我做错了：

#'Replacement function'
sampleprevious <- function(rndtbl,upperlimit,reattach) {
  return(paste0(names(rndtbl[max(which(rndtbl < runif(1, min=1, max=upperlimit)))])
  ,' > ', reattach))
}

df$path[replace.index] <-
    mapply(paste0, sampleprevious(casetable, 50, 'First'))

这是使用随机数采样的折衷尝试，因为我不确定还有什么方法可以重复，但我得到的只是针对每一行填充的单个采样值，而不是 50 个单独的采样。

我很乐意帮助生成 50 个随机采样，但同样对派生的拆分 18|11|11|10 感到满意。

*_____________ ** 附录 ** 我已经用这个解决了 'sampling' 版本：

sampleprevious <- function(rndtbl,upperlimit,reattach) {
  return(paste0(names(rndtbl[min(which(rndtbl > runif(1, min=1, max=upperlimit-1)))])
  ,'>', reattach))
}

df$path[replace.index] <-
  replicate(50, sampleprevious(casetable, 50, 'First'))

这将给我一个符合我的情况的随机比例table。我仍然在某种程度上更喜欢从我的案例中生成准确的行数table。

Answer 1

可重现数据，其中 tension 变量是您的 path:

data(warpbreaks)
warpbreaks$tension <- as.character(warpbreaks$tension)

casetable 将给出替换值及其权重。

casetable <- data.frame(replacement = letters[1:3], n = c(2, 4, 6),
                        stringsAsFactors = FALSE)
#   replacement n
# 1           a 2
# 2           b 4
# 3           c 6

我们需要知道有多少替换样本。

subset_n <- sum(warpbreaks$tension == "L")
# [1] 18

从 casetable 的 replacement 列中抽取 subset_n 值，使用其 n 列中的概率，并替换 tension 的现有值warpbreaks 中的列，其中 tension 是特定值 L。（这是您数据中的 First。）

warpbreaks[warpbreaks$tension == "L", "tension"] <- 
  sample(casetable$replacement, size = subset_n, replace = TRUE,
         prob = casetable$n)
warpbreaks
#    breaks wool tension
# 1      26    A       c
# 2      30    A       b
# 3      54    A       c
# 4      25    A       c
# 5      70    A       c
# 6      52    A       a
# 7      51    A       b
# 8      26    A       b
# 9      67    A       c
# 10     18    A       M
# 11     21    A       M
# 12     29    A       M

用概率抽样替代方案替换数据框中的文本字符串

Replacing text string in dataframe with a probability sampled alternative

performance

r

apply