如何根据列名子集的成对组合创建新数据 table?

How to create a new data table based on pairwise combinations of a subset of column names?

我正在尝试定义一个函数,该函数将数据框或 table 作为具有特定数量 ID 列(例如,2 或 3 个 ID 列)的输入,其余列为 NAME1、NAME2 , ..., NAMEK(数字列)。输出应该是一个数据 table,它由与以前相同的 ID 列加上一个额外的 ID 列组成,该 ID 列对列名(NAME1、NAME2、...)的每个唯一成对组合进行分组。此外,我们必须根据 ID 列将数字列的实际值收集到两个新列中;包含两个 ID 列和三个数字列的示例:

ID1 <- c("A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3)
NAME1 <- c(10,11,9,22,25,22)
NAME2 <- c(7,9,8,20,22,21)
NAME3 <- c(10,12,11,15,19,30)
DT <- data.table(ID1,ID2,NAME1,NAME2,NAME3)

我希望以 DT 作为输入的函数的输出是

ID.new <- c("NAME1 - NAME2","NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME2",
 "NAME1 - NAME2","NAME1 - NAME2", "NAME1 - NAME3", "NAME1 - NAME3",
 "NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3","NAME1 - NAME3",
 "NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3","NAME2 - NAME3",
 "NAME2 - NAME3", "NAME2 - NAME3")
ID1 <- c("A","A","A","B","B","B","A","A","A","B","B","B","A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3)
value.left <- c(10,11,9,22,25,22,10,11,9,22,25,22,7,9,8,20,22,21)
value.right <- c(7,9,8,20,22,21,10,12,11,15,19,30,10,12,11,15,19,30)
DT.output <- data.table(ID.new,ID1,ID2,value.left,value.right)

我发现 fun()(见下文)可以完成这项工作,但速度太慢了:

  fun <- function(data, ID.cols){
   data <- data.table(data)
   # Which of the columns are ID columns
   ids <-  which(colnames(data) %in% ID.cols)
   # Obtain all pairwise combinations of numeric columns into a list
   numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
   id.cols <- data[,ids, with = FALSE]
   # bind the ID columns to each pairwise combination of numeric columns inside the list
   bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x)) 
   # Create generic names for the numeric columns so that rbindlist() may be applied. In addition we make a new column that groups based on which columns we are considering
   generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x) 
   setattr(x = x[,ID.NEW:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name = 
   'names', value = c(ID.cols,"value.left","value.right","ID.NEW"))))
   return(rbindlist(l=generalize))
}

# Performance
print(microbenchmark(fun(DT,ID.cols=c("ID1","ID2")),times=1000))

有没有更快更优雅的方法来做到这一点?

注意:

Here is an inspiring idea which is not fully satisfy OP's requirement (e.g., ID.new and number order) but I think it worth to be recoreded here.

您可以先通过meltDT转成长格式。 然后将 shift 值与步骤 -nrow(DT) 一起执行 负运算,即 NAME1 - NAME2, NAME2 - NAME3, NAME3 - NAME1.

ds = melt(DT,
          measure.vars = patterns("^NAME"),
          variable.name = c("ID.new"),
          value.name = c("value.left"))
group_len = nrow(DT)
ds[, ID.new := paste(ID.new,shift(ID.new, n = -group_len, type = c("cyclic")),sep = " - ")]
ds[, value.right := shift(value.left, n = -group_len, type = c("cyclic"))]

输出:

      ID1   ID2        ID.new value.left value.right
    <char> <num>        <char>      <num>       <num>
 1:      A     1 NAME1 - NAME2         10           7
 2:      A     2 NAME1 - NAME2         11           9
 3:      A     3 NAME1 - NAME2          9           8
 4:      B     1 NAME1 - NAME2         22          20
 5:      B     2 NAME1 - NAME2         25          22
 6:      B     3 NAME1 - NAME2         22          21
 7:      A     1 NAME2 - NAME3          7          10
 8:      A     2 NAME2 - NAME3          9          12
 9:      A     3 NAME2 - NAME3          8          11
10:      B     1 NAME2 - NAME3         20          15
11:      B     2 NAME2 - NAME3         22          19
12:      B     3 NAME2 - NAME3         21          30
13:      A     1 NAME3 - NAME1         10          10
14:      A     2 NAME3 - NAME1         12          11
15:      A     3 NAME3 - NAME1         11           9
16:      B     1 NAME3 - NAME1         15          22
17:      B     2 NAME3 - NAME1         19          25
18:      B     3 NAME3 - NAME1         30          22

如果您可以使用数据框,下面将为您提供目前速度和内存效率最高的方法(参见基准 wiki)。

我认为使用 combn() 的方法对我来说似乎是合理的。而且我真的不认为它对组合进行了 18 次迭代,。此外,我个人觉得这比数据 table 融化版本更容易阅读,但这可能是因为我不习惯 data.table 语法。

注意:在数据 table 上使用它显然效率不高。如果你真的需要 data.table,r2evans 解决方案更好。

fun2 <- function(data, ID.cols){
  ids <-  which(colnames(data) %in% ID.cols)
  ## you can loop over the combinations directly
  new_dat <- combn(data[-ids], 2, function(x) {
    new_x <- setNames(x, paste("value", c("left", "right"), sep = "."))
    ## use paste with collapse for the ID.new
    new_x$ID.new <- paste(names(x), collapse = " - ")
    new_x
  }, simplify = FALSE)

## bind it with the old ID columns, outside the loop (bit faster)
  cbind(do.call(rbind, new_dat), data[ids])
}

fun2(DT,ID.cols = c("ID1", "ID2"))
#>    value.left value.right        ID.new ID1 ID2
#> 1          10           7 NAME1 - NAME2   A   1
#> 2          11           9 NAME1 - NAME2   A   2
#> 3           9           8 NAME1 - NAME2   A   3
#> 4          22          20 NAME1 - NAME2   B   1
#> 5          25          22 NAME1 - NAME2   B   2
#> 6          22          21 NAME1 - NAME2   B   3
#> 7          10          10 NAME1 - NAME3   A   1
#> 8          11          12 NAME1 - NAME3   A   2
#> 9           9          11 NAME1 - NAME3   A   3
#> 10         22          15 NAME1 - NAME3   B   1
#> 11         25          19 NAME1 - NAME3   B   2
#> 12         22          30 NAME1 - NAME3   B   3
#> 13          7          10 NAME2 - NAME3   A   1
#> 14          9          12 NAME2 - NAME3   A   2
#> 15          8          11 NAME2 - NAME3   A   3
#> 16         20          15 NAME2 - NAME3   B   1
#> 17         22          19 NAME2 - NAME3   B   2
#> 18         21          30 NAME2 - NAME3   B   3

有关基准,请参阅

熔化的自连接选项:

library(data.table)
DTlong <- melt(DT, id.vars = c("ID1", "ID2"), variable.factor = FALSE)
out <- DTlong[DTlong, on = .(ID1, ID2), allow.cartesian = TRUE
  ][variable < i.variable,
  ][, .(ID.new = paste(variable, i.variable, sep = " - "),
        ID1, ID2, value.left = value, value.right = i.value)]
out
#            ID.new    ID1   ID2 value.left value.right
#            <char> <char> <num>      <num>       <num>
#  1: NAME1 - NAME2      A     1         10           7
#  2: NAME1 - NAME2      A     2         11           9
#  3: NAME1 - NAME2      A     3          9           8
#  4: NAME1 - NAME2      B     1         22          20
#  5: NAME1 - NAME2      B     2         25          22
#  6: NAME1 - NAME2      B     3         22          21
#  7: NAME1 - NAME3      A     1         10          10
#  8: NAME2 - NAME3      A     1          7          10
#  9: NAME1 - NAME3      A     2         11          12
# 10: NAME2 - NAME3      A     2          9          12
# 11: NAME1 - NAME3      A     3          9          11
# 12: NAME2 - NAME3      A     3          8          11
# 13: NAME1 - NAME3      B     1         22          15
# 14: NAME2 - NAME3      B     1         20          15
# 15: NAME1 - NAME3      B     2         25          19
# 16: NAME2 - NAME3      B     2         22          19
# 17: NAME1 - NAME3      B     3         22          30
# 18: NAME2 - NAME3      B     3         21          30

### validation
setorder(out, ID.new, ID1, ID2)
identical(DT.output, out)
# [1] TRUE

combn 的方法当然是合理的想法,但它唯一的低效是每个组合迭代一次。也就是说,传递给 combn(..., FUN=) 的函数在本例中被调用了 18 次;如果您的数据更大,它将被调用更多次。但是,在此处的 merge/join 的情况下,一切都以我们可以管理的矢量化方式完成:merge 高效完成,过滤作为单个逻辑向量返回,并且paste(..)也是一个大向量

合并概念确实有其自身的低效率,公平地说:由于笛卡尔连接,它最初产生 54 行。这将导致更大数据的内存耗尽问题。如果你 运行 进入这个,它可能有助于使用 fuzzyjoin 并包括 variable < variable (LHS vs RHS),这应该减少(如果没有完全消除)问题。

最后一条建议也可以在 sqldf 中完成:

sqldf::sqldf("
  select t1.variable || ' - ' || t2.variable as [ID.new], t1.ID1, t1.ID2, 
    t1.value as [value.left], t2.value as [value.right]
  from DTlong t1
    join DTlong t2 on t1.ID1=t2.ID1 and t1.ID2=t2.ID2
      and t1.variable < t2.variable")
#           ID.new ID1 ID2 value.left value.right
# 1  NAME1 - NAME2   A   1         10           7
# 2  NAME1 - NAME3   A   1         10          10
# 3  NAME1 - NAME2   A   2         11           9
# 4  NAME1 - NAME3   A   2         11          12
# 5  NAME1 - NAME2   A   3          9           8
# 6  NAME1 - NAME3   A   3          9          11
# 7  NAME1 - NAME2   B   1         22          20
# 8  NAME1 - NAME3   B   1         22          15
# 9  NAME1 - NAME2   B   2         25          22
# 10 NAME1 - NAME3   B   2         25          19
# 11 NAME1 - NAME2   B   3         22          21
# 12 NAME1 - NAME3   B   3         22          30
# 13 NAME2 - NAME3   A   1          7          10
# 14 NAME2 - NAME3   A   2          9          12
# 15 NAME2 - NAME3   A   3          8          11
# 16 NAME2 - NAME3   B   1         20          15
# 17 NAME2 - NAME3   B   2         22          19
# 18 NAME2 - NAME3   B   3         21          30

基准测试:

bench::mark(
  pernkf  = fun(DT, c("ID1", "ID2")),
  tjebo   = fun2(DT, c("ID1", "ID2")),
  r2evans = fun3(DT, c("ID1", "ID2")),  # native data.table
  r2evans2 = fun4(),                    # sqldf
  check = FALSE)
# # A tibble: 4 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory        time     gc        
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>        <list>   <list>    
# 1 pernkf       5.38ms   6.06ms     161.      287KB    13.2     61     5      379ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 2 tjebo        5.08ms   5.63ms     172.      230KB     8.83    78     4      453ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 3 r2evans      2.97ms   3.48ms     280.      170KB    11.0    127     5      454ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~
# 4 r2evans2    17.19ms  18.91ms      52.0     145KB    13.0     20     5      384ms <NULL> <Rprofmem[,3~ <bch:tm~ <tibble [~

(sqldf 在此示例中确实会影响性能,我欢迎改进查询:-)

基准测试,代表。如果你真的不需要数据 table,base R 似乎可以解决这个问题。

注意这是比较 r2evans 和 pernkf 在数据上的函数 table 与 tjebo 和 tarjae 在数据帧上的函数。

PeaceWang 建议的方法目前未包括在内,因为它们要么无法扩展到 k 列,要么提供的结果不正确。

bench::mark(
  pernkf  = fun(DT, c("ID1", "ID2")),
  tjebo   = fun2(DF, c("ID1", "ID2")),
  r2evans = fun3(DT, c("ID1", "ID2")), 
  tarjae = fun4(DF, c("ID1", "ID2")),
  check = FALSE)

#> # A tibble: 4 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 pernkf       2.95ms    3.2ms     302.     2.29MB     6.33
#> 2 tjebo      359.33µs 373.85µs    2423.    18.65KB    10.5 
#> 3 r2evans      1.65ms   1.79ms     535.   756.16KB     6.30
#> 4 tarjae      26.49ms  27.74ms      34.3    4.75MB     7.35

m <- microbenchmark::microbenchmark(
  pernkf = fun(DT, ID.cols = c("ID1", "ID2")),
  r2evans = fun3(DT, ID.cols = c("ID1", "ID2")),
  tjebo = fun2(DF, ID.cols = c("ID1", "ID2")), 
  tarjae = fun4(DF, c("ID1", "ID2")),
  times = 1000
)
m
#> Unit: microseconds
#>     expr       min         lq       mean    median        uq       max neval
#>   pernkf  2885.714  3055.1450  3439.1257  3150.457  3298.404  95391.80  1000
#>  r2evans  1629.028  1739.5715  1949.8389  1829.696  1922.227  10843.33  1000
#>    tjebo   354.714   410.0975   469.1457   427.948   443.237   4344.00  1000
#>   tarjae 25854.416 26564.8420 29103.6948 27142.758 30982.328 118592.10  1000

ggplot2::autoplot(m)
#> Coordinate system already present. Adding new coordinate system, which will replace the existing one.

数据与函数

library(tidyverse)
library(data.table)

ID1 <- c("A","A","A","B","B","B")
ID2 <- c(1,2,3,1,2,3)
NAME1 <- c(10,11,9,22,25,22)
NAME2 <- c(7,9,8,20,22,21)
NAME3 <- c(10,12,11,15,19,30)
DF <- data.frame(ID1,ID2,NAME1,NAME2,NAME3)
DT <- data.table(DF)

fun <- function(data, ID.cols){
  data <- data.table(data)
  ids <-  which(colnames(data) %in% ID.cols)
  numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
  id.cols <- data[,ids, with = FALSE]
  bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x)) 
  generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x) 
    setattr(x = x[,ID.NEW:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name = 
              'names', value = c(ID.cols,"value.left","value.right","ID.NEW"))))
  return(rbindlist(l=generalize))
}

fun2 <- function(data, ID.cols){
  ids <-  which(colnames(data) %in% ID.cols)
  new_dat <- combn(data[-ids], 2, function(x) {
    new_x <- setNames(x, paste("value", c("left", "right"), sep = "."))
    new_x$ID.new <- paste(names(x), collapse = " - ")
    new_x
  }, simplify = FALSE)
  cbind(do.call(rbind, new_dat), data[ids])
}

fun3 <- function(data, ID.cols) {
  DTlong <- melt(data, id.vars = ID.cols, variable.factor = FALSE)
  out <- DTlong[DTlong, on = .(ID1, ID2), allow.cartesian = TRUE
  ][variable < i.variable,
  ][, .(ID.new = paste(variable, i.variable, sep = " - "),
        ID1, ID2, value.left = value, value.right = i.value)]
  out
}
fun4 <- function(x, id.cols){
DT1 <- DT %>% 
  pivot_longer(
    -id.cols
  ) %>% 
  mutate(name1 = lead(name, default=last(name)),
         value1 = lead(value, default=last(value)))%>% 
  arrange(name, name1) %>% 
  group_by(name) %>% 
  mutate(n = n()) %>% 
  mutate(name_nr = parse_number(name)) %>% 
  ungroup()


DT1 %>% 
  mutate(name1 = lead(name, unique(n)*(max(name_nr)-min(name_nr)))) %>% 
  mutate(value1 = lead(value, unique(n)*(max(name_nr)-min(name_nr)))) %>% 
  slice(seq_len(first(n))) %>% 
  bind_rows(DT1 %>% 
              slice(1:(n() - unique(n))), .
  ) %>% 
  mutate(ID.new = paste(name, name1, sep = " - "), .before=1) %>% 
  select(ID.new, ID1, ID2, value.left=value, value.right = value1) %>% 
  arrange(ID.new)
}

检查解是否相同:

## convert all to data frame
## column names and order need to be the same
## rows need to be sorted in the same way (caveat row names!)
preparetocompare <- function(x){
x <- data.frame(x)
names(x) <- tolower(names(x))
x <- x[c("id1", "id2", "value.left", "value.right", "id.new")]
x <- x[with(x, order(id.new, id1, id2)),]
rownames(x) <- NULL
}
compare_df <- function(...){
 # credit to 
 ls_df <-  c(as.list(environment()), list(...))
 ls_compare <- lapply(ls_df, preparetocompare)
 # inspired by 
 all.identical <- function(l) mapply(all.equal, head(l, 1), tail(l, -1))
 all.identical(ls_compare)
}

compare_df(fun(DT, c("ID1", "ID2")), 
           fun2(DF, c("ID1", "ID2")), 
           fun3(DT, c("ID1", "ID2")),
           fun4(DF, c("ID1", "ID2"))
           )
#> [1] TRUE TRUE TRUE

UPDATE II(删除了错误的解决方案)

现在经过非常努力的工作和社区的良好支持(感谢 @akrun@tjebo)我想我有正确且可扩展的 tidyverse 解决方案:(HURRAY):-)

library(tidyverse)

DT1 <- DT %>% 
  pivot_longer(
    -c(ID1, ID2)
  ) %>% 
  mutate(name1 = lead(name, default=last(name)),
         value1 = lead(value, default=last(value)))%>% 
  arrange(name, name1) %>% 
  group_by(name) %>% 
  mutate(n = n()) %>% 
  mutate(name_nr = parse_number(name)) %>% 
  ungroup()

DT1 %>% 
  mutate(name1 = lead(name, unique(n)*(max(name_nr)-min(name_nr)))) %>% 
  mutate(value1 = lead(value, unique(n)*(max(name_nr)-min(name_nr)))) %>% 
  slice(seq_len(first(n))) %>% 
  bind_rows(DT1 %>% 
              slice(1:(n() - unique(n))), .
            ) %>% 
  mutate(ID.new = paste(name, name1, sep = " - "), .before=1) %>% 
  select(ID.new, ID1, ID2, value.left=value, value.right = value1) %>% 
  arrange(ID.new)
ID.new        ID1     ID2 value.left value.right
   <chr>         <chr> <dbl>      <dbl>       <dbl>
 1 NAME1 - NAME2 A         1         10           7
 2 NAME1 - NAME2 A         2         11           9
 3 NAME1 - NAME2 A         3          9           8
 4 NAME1 - NAME2 B         1         22          20
 5 NAME1 - NAME2 B         2         25          22
 6 NAME1 - NAME2 B         3         22          21
 7 NAME1 - NAME3 A         1         10          10
 8 NAME1 - NAME3 A         2         11          12
 9 NAME1 - NAME3 A         3          9          11
10 NAME1 - NAME3 B         1         22          15
11 NAME1 - NAME3 B         2         25          19
12 NAME1 - NAME3 B         3         22          30
13 NAME2 - NAME3 A         1          7          10
14 NAME2 - NAME3 A         2          9          12
15 NAME2 - NAME3 A         3          8          11
16 NAME2 - NAME3 B         1         20          15
17 NAME2 - NAME3 B         2         22          19
18 NAME2 - NAME3 B         3         21          30

这个post由3部分组成:

  1. 原始答案(非相等自连接,带有两个 ID 列)
  2. 第一次编辑(具有可变数量的 ID 列的非相等自连接)
  3. 第二次编辑(6 种不同问题大小的不同方法的基准)

原始答案:具有两个 ID 列的非相等自连接

为了完整起见,这里有一个解决方案,它使用熔化数据(重塑为长格式)的非等自连接

library(data.table)
mdt <- melt(DT, id.vars = c("ID1", "ID2"))
res <- mdt[mdt, on = .(ID1, ID2, variable < variable), nomatch = NULL,
    .(ID.new = paste(x.variable, i.variable, sep = " - "), 
      ID1, ID2, value.left = x.value, value.right = i.value)]

all.equal(res, DT.output, ignore.row.order = TRUE)
[1] TRUE

此方法类似于 但避免了笛卡尔连接。我没有将基准测试结果显示为对 6 行 5 列样本数据集进行基准测试,相关性有限。

编辑 1:具有 变量 数量的 ID 列的非相等自连接

OP 要求 ID 列的数量可能会有所不同(事实上,ID 列的名称作为参数传递给 OP 自己的函数)。

可以增强非相等自联接以处理任意数量的 ID 列:

library(data.table)
id_cols <- c("ID1", "ID2")
mdt <- melt(DT, id.vars = id_cols)
res <- mdt[mdt, on = c(id_cols, "variable < variable"), nomatch = NULL,
           c(.(ID.new = paste(x.variable, i.variable, sep = " - "), 
               value.left = x.value, value.right = i.value), .SD), 
           .SDcols = id_cols]

all.equal(res, DT.output, ignore.col.order = TRUE, ignore.row.order = TRUE)
[1] TRUE

请注意,在这里使用 .SD 是安全的,因为 .SDcols 只会选择那些已经用于加入的列(由 id_cols 指定)。

编辑 2:不同 问题规模

的基准测试

到目前为止 and 提供的基准仅使用具有 2 个 id 列、3 个数字列和 6 行的原始数据集。由于问题规模较小,这些基准比较了开销,但不能代表较大问题规模的性能。

描述问题大小的参数有 3 个:

  1. 样本数据集的行数nrDT,
  2. 从中创建成对行的数字列数nc,以及
  3. id 列数 ni

最终结果由 nc * (nc - 1) / 2 * nr 行和 ni + 3 列组成。

通过使用 bench 包中的 press() 函数,我们可以轻松地执行一组具有不同问题大小的基准测试 运行。

基准中包含 6 种不同的方法 运行s:

  • pernkf(): 函数 使用 combn(),
  • r2evans() 但已修改为使用任意数量的 id 列,
  • tjebo(): 使用 combn()data.frame,
  • nej():熔化数据的非等自连接,类似于但避免了笛卡尔连接,
  • dtc():tjebos combn() 方法的 data.table 版本,
  • mvl() 的一个实现,使用虚构的 measure.vars 列表调用 melt()

所有方法都实现为使用 2 个参数调用的函数,分别是数据集 DTDF,以及具有任意 id 列名称的字符向量。

pernkf <- function(data, ID.cols){
  data <- data.table(data)
  # Which of the columns are ID columns
  ids <-  which(colnames(data) %in% ID.cols)
  # Obtain all pairwise combinations of numeric columns into a list
  numeric.combs <- combn(x = data.table(data)[,!ids, with = FALSE], m = 2, simplify = FALSE)
  id.cols <- data[,ids, with = FALSE]
  # bind the ID columns to each pairwise combination of numeric columns inside the list
  bind.columns.each.numeric.comb <- lapply(X = numeric.combs, FUN = function(x) cbind(id.cols,x)) 
  # Create generic names for the numeric columns so that rbindlist() may be applied. In addition we make a new column that groups based on which columns we are considering
  generalize <- suppressWarnings(lapply(X = bind.columns.each.numeric.comb, FUN = function(x) 
    setattr(x = x[,ID.new:=paste(colnames(x[,!ids,with=FALSE]),collapse=" - ")], name = 
              'names', value = c(ID.cols,"value.left","value.right","ID.new"))))
  return(rbindlist(l=generalize))
}

r2evans = \(DT, id_cols) {
  DTlong <- melt(DT, id.vars = id_cols, variable.factor = FALSE)
  DTlong[DTlong, on = c(id_cols), allow.cartesian = TRUE
  ][variable < i.variable,
  ][, .(ID.new = paste(variable, i.variable, sep = " - "), setnames(.SD, id_cols), 
        value.left = value, value.right = i.value), .SDcols = id_cols
  ]
}

tjebo <- \(data, ID.cols) {
  ids <-  which(colnames(data) %in% ID.cols)
  ## you can loop over the combinations directly
  new_dat <- combn(data[-ids], 2, function(x) {
    new_x <- setNames(x, paste("value", c("left", "right"), sep = "."))
    ## use paste with collapse for the ID.new
    new_x$ID.new <- paste(names(x), collapse = " - ")
    new_x
  }, simplify = FALSE)
  ## bind it with the old ID columns, outside the loop (bit faster)
  cbind(do.call(rbind, new_dat), data[ids])
}

nej <- \(DT, id_cols) {
  mdt <- melt(DT, id.vars = id_cols)
  mdt[mdt, on = c(id_cols, "variable < variable"), nomatch = NULL,
      .(setnames(.SD, id_cols), ID.new = paste(x.variable, i.variable, sep = " - "), 
        value.left = x.value, value.right = i.value), 
      .SDcols = id_cols]
}

dtc <- \(DT, id_cols) {
  combn(setdiff(colnames(DT), id_cols), 2, 
        \(x) DT[, ..x][, ID.new := paste(x, collapse = " - ")], 
        simplify = FALSE) |>
    rbindlist(use.names = FALSE) |>
    setnames(1:2, c("value.left", "value.right")) |>
    cbind(DT[, ..id_cols])
}

mvl <- \(DT, id_cols) {
  num_cols <- setdiff(colnames(DT), id_cols)
  combos <- combn(num_cols, 2L, simplify = TRUE)
  id_new_levels <- apply(combos, 2, paste, collapse = " - ") 
  melt(DT, measure.vars = list(combos[1L, ],combos[2L, ]), 
       value.name = c("value.left", "value.right"), variable.name = "ID.new")[
         , ID.new := as.character(`levels<-`(ID.new, id_new_levels))]
}

and 的两种方法已被省略,因为我无法将它们转化为可扩展的函数。

在对 press() 的调用中,行数 nr 从 10 到 100'000 不等,数字列数 nc 从 3 到 10 不等。相应地,结果数据集的行数从 30 到 450 万行不等。所有 运行 都使用 3 个 id 列来验证 ni 是可扩展的(并且不限于 2 个)。

检查功能设置为忽略不同顺序的行 and/or 列,因为这些可能因不同的方法而异。

library(bench)
bm <- press(
  nr = c(10L, 1000L, 100000L),
  nc = c(3L, 5L, 10L),
  {
    ni <- 3L
    DT <- data.table()
    id_cols <- sprintf("ID%01i", seq(ni))
    # append id cols
    for (id in id_cols) set(DT, , id, seq(nr))
    # append data cols
    for (j in seq(nc)) {
      col_name <- sprintf("NAME%04i", j)
      set(DT, , col_name, seq(nr) + (j / 1000))
    }
    DF <- as.data.frame(DT)
    mark(
      pernkf(DT, id_cols),
      r2evans(DT, id_cols),
      tjebo(DF, id_cols),
      nej(DT, id_cols),
      dtc(DT, id_cols),
      mvl(DT, id_cols),
      check = \(x,y) all.equal(x, setDT(y), ignore.row.order = TRUE, ignore.col.order = TRUE),
      min_iterations = 3L
    )
  }
)

基准时间由

可视化
ggplot2::autoplot(bm)

(注意对数时间尺度)。

几乎总是,mvl() 是最快的方法。仅对于具有 3 个数字列和最多 1000 行的最小问题大小,tjebo() 稍快一些。对于 100'000 行的大问题,dtc()pernkf() 分别排在第二和第三位,分别是

下一张图表显示了性能如何随数值列的数量变化 nc:

library(ggplot2)
ggplot(bm) +
  aes(nc, median, colour = attr(expression, "description")) +
  geom_point() + 
  geom_line() +
  scale_x_log10() +
  labs(colour = "expression") +
  facet_wrap(~nr, scales = "free_y") +
  ggtitle("Median run time")

(注意刻面的对数尺度和独立时间尺度)

tjebo() 的 运行 倍比 nc 比任何其他方法都更陡峭。对于某些用例,mvl() 比任何其他方法快一个数量级。

一个经常被忽视的方面是内存消耗。下图显示了内存分配如何随问题大小而变化:

ggplot(bm) +
  aes(nc, mem_alloc, colour = attr(expression, "description")) +
  geom_point() + 
  geom_line() +
  scale_x_log10() +
  labs(colour = "expression") +
  facet_wrap(~nr, scales = "free_y") +
  ggtitle("Memory allocation")

(注意 y 轴上的对数-对数刻度和独立刻度)

每个用例的最佳方法和最差方法之间的内存分配差异非常大,大约是 7 到 8 倍。同样,tjebo() 显示内存分配的增长最快 nc。对于大问题,mvl() 分配的内存比 dtc()pernkf().

之后的任何其他方法都要少

我认为由于数据结构良好,有人可能会使用以下代码(这是可扩展的,但为简单起见,我提供了一个简单的变体)

melt(DT, measure.vars=list(c(3,3,4), c(4,5,5)))