子集 data.table 的速度以奇怪的方式取决于特定的键值？

Question

有人可以解释以下输出吗？除非我遗漏了某些东西（我可能遗漏了什么），否则 data.table 的子集化速度似乎取决于存储在其中一列中的特定值，即使它们具有相同的 class 和除了价值之外没有明显的差异。

这怎么可能？

> dim(otherTest)
[1] 3572069       2
> dim(test)
[1] 3572069       2
> length(unique(test$keys))
[1] 28741
> length(unique(otherTest$keys))
[1] 28742
> sapply(test,class)
 thingy        keys 
"character" "character" 
> sapply(otherTest,class)
 thingy        keys 
"character" "character" 
> class(test)
[1] "data.table" "data.frame"
> class(otherTest)
[1] "data.table" "data.frame"
> start = Sys.time()
>   newTest = otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.5438871 secs
> start = Sys.time()
>   newTest = test[keys%in%partition]
>   end  = Sys.time() 
>   print(end - start)
Time difference of 42.78009 secs

摘要编辑：因此速度差异与 data.table 大小不同无关，也与唯一值的不同数量无关。正如您在我上面修改过的示例中看到的那样，即使在生成键后它们具有相同数量的唯一值（并且在相同的一般运行ge 中并且共享至少 1 个值，但通常是不同的），我得到了相同的性能差异。

关于共享数据，很遗憾我不能共享测试table，但我可以共享其他测试。整个想法是我试图尽可能接近地复制测试 table（相同的大小、相同的 classes/types、相同的键、NA 值的数量等）以便我可以 post 到 SO -- 但后来 st运行gely 我编造的 data.table 表现得非常不同，我不明白为什么！

此外，我还要补充一点，我怀疑问题出在 data.table 的唯一原因是几周前我运行遇到了类似的子集问题 data.table 原来是新 data.table 版本中的一个实际错误（我最终删除了这个问题，因为它是重复的）。该错误还涉及使用 %in% 函数对 data.table 进行子集化——如果在 %in% 的正确参数中存在重复条目，则为 returning 重复输出。所以如果 x = c(1,2,3) 和 y = c(1,1,2,2)，x%in% y 将 return 一个长度为 8 的向量。我重新安装了 data.table 包，所以我不认为它可能是同一个错误——但也许相关？

编辑（重新 Dean MacGregor 的评论）

> sapply(test,class)
 thingy        keys 
"character" "character" 
> sapply(otherTest,class)
 thingy        keys 
"character" "character" 


# benchmarking the original test table
>   test2 =data.table(sapply(test ,as.numeric))
>   otherTest2 =data.table(sapply(otherTest ,as.numeric))
>   start = Sys.time()
>   newTest = test[keys%in%partition])
>   end  = Sys.time()
>   print(end - start)
Time difference of 52.68567 secs
> start = Sys.time()
>   newTest = otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.3503151 secs

#benchmarking after converting to numeric
> partition = as.numeric(partition)
> start = Sys.time()
>   newTest = otherTest2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.7240109 secs
> start = Sys.time()
>    newTest = test2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 42.18522 secs

#benchmarking again after converting back to character
> partition = as.character(partition)
> otherTest2 =data.table(sapply(otherTest2 ,as.character))
> test2 =data.table(sapply(test2 ,as.character))
> start = Sys.time()
>   newTest =test2[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 48.39109 secs
> start = Sys.time()
>   newTest = data.table(otherTest2[keys%in%partition])
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.1846113 secs

所以减速不依赖于class。

编辑：问题显然来自 data.table，因为我可以转换为矩阵，问题消失，然后转换回 data.table，问题又来了。

编辑：我注意到问题与 data.table 函数如何处理重复项有关，这听起来是正确的，因为它类似于我上周在数据 table 中发现的错误我上面描述的 1.9.4。

>   newTest =test[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 39.19983 secs
> start = Sys.time()
>   newTest =otherTest[keys%in%partition]
>   end  = Sys.time()
 >   print(end - start)
 Time difference of 0.3776946 secs
> sum(duplicated(test))/length(duplicated(test))
[1] 0.991954
> sum(duplicated(otherTest))/length(duplicated(otherTest))
[1] 0.9918879
> otherTest[duplicated(otherTest)] =NA
 > test[duplicated(test)]= NA
> start = Sys.time()
>   newTest =otherTest[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.2272599 secs
> start = Sys.time()
>   newTest =test[keys%in%partition]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.2041721 secs

因此，即使它们具有相同数量的重复项，两个 data.tables（或更具体地说，data.table 中的 %in% 函数）显然以不同方式处理重复项。另一个与重复项相关的有趣观察是这个（注意我再次从原始的 tables 开始）：

> start = Sys.time()
>   newTest =test[keys%in%unique(partition)]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.6649222 secs
> start = Sys.time()
>   newTest =otherTest[keys%in%unique(partition)]
>   end  = Sys.time()
>   print(end - start)
Time difference of 0.205637 secs

因此，从 %in% 的正确参数中删除重复项也可以解决问题。

所以有了这个新信息，问题仍然存在：为什么这两个 data.table 以不同的方式处理重复值？

Answer 1

我希望操作时间与 thing 和 otherThing 的大小成正比，但我在您的输出中看不到它们的大小，因此很难确切知道期待什么。

但是 otherthing$keys 中的唯一值比 thing$keys 中的唯一值多得多（多 124.28 倍），所以您不希望操作花费更长的时间吗？它必须检查 table 中找到的每个唯一值的值（自从打印出该值后，您似乎就意识到了这一点）。

注意时间比率约为 60.8。

Answer 2

当它是 match 时，您正在关注 data.table（%in% 由 match 操作定义）以及您应该关注的向量的大小.一个可重现的例子：

library(microbenchmark)

set.seed(1492)

# sprintf to keep the same type and nchar of your values

keys_big <- sprintf("%014d", sample(5000, 4000000, replace=TRUE))
keys_small <- sprintf("%014d", sample(5000, 30000, replace=TRUE))

partition <- sample(keys_big, 250)

microbenchmark(
  "big"=keys_big %in% partition,
  "small"=keys_small %in% partition
)

## Unit: milliseconds
##   expr        min         lq       mean     median         uq        max neval cld
##    big 167.544213 184.222290 205.588121 195.137671 205.043641 376.422571   100   b
##  small   1.129849   1.269537   1.450186   1.360829   1.506126   2.848666   100  a

来自文档：

match returns a vector of the positions of (first) matches of its first argument in its second.

这本质上意味着它将取决于向量的大小以及如何找到（或找不到）"near to the top" 个匹配项。

但是，您可以使用 data.table 中的 %chin% 来加速整个过程，因为您使用的是字符向量：

library(data.table)

microbenchmark(
  "big"=keys_big %chin% partition,
  "small"=keys_small %chin% partition
)
## Unit: microseconds
##   expr       min         lq       mean     median        uq        max neval cld
##    big 36312.570 40744.2355 47884.3085 44814.3610 48790.988 119651.803   100   b
##  small   241.045   264.8095   336.1641   283.9305   324.031   1207.864   100  a

您也可以使用 fastmatch 包（但是您已经加载了 data.table 并且正在使用字符向量，所以 6/1|0.5*12):

library(fastmatch)

# gives us similar syntax & functionality as %in% and %chin%

"%fmin%" <- function(x, table) fmatch(x, table, nomatch = 0) > 0

microbenchmark(
  "big"=keys_big %fmin% partition,
  "small"=keys_small %fmin% partition
)

## Unit: microseconds
##   expr       min         lq       mean     median        uq        max neval cld
##    big 75189.818 79447.5130 82508.8968 81460.6745 84012.374 124988.567   100   b
##  small   443.014   471.7925   525.2719   498.0755   559.947    850.353   100  a

无论如何，任一向量的大小将最终决定 fast/slow 操作的方式。但是后两个选项至少能让你更快地得到结果。以下是小型和大型向量的所有三者之间的比较：

library(ggplot2)
library(gridExtra)

microbenchmark(
  "small_in"=keys_small %in% partition,
  "small_ch"=keys_small %chin% partition,
  "small_fm"=keys_small %fmin% partition,
  unit="us"
) -> small

microbenchmark(
  "big_in"=keys_big %in% partition,
  "big_ch"=keys_big %chin% partition,
  "big_fm"=keys_big %fmin% partition,
  unit="us"
) -> big

grid.arrange(autoplot(small), autoplot(big))

更新

根据 OP 评论，这是考虑有无 data.table 子集的另一个基准：

dat_big <- data.table(keys=keys_big)

microbenchmark(

  "dt"        = dat_big[keys %in% partition],
  "not_dt"    = dat_big$keys %in% partition,

  "dt_ch"     = dat_big[keys %chin% partition],
  "not_dt_ch" = dat_big$keys %chin% partition,

  "dt_fm"     = dat_big[keys %fmin% partition],
  "not_dt_fm" = dat_big$keys %fmin% partition

)

## Unit: milliseconds
##       expr       min        lq      mean    median        uq      max neval    cld
##         dt  11.74225  13.79678  15.90132  14.60797  15.66586 129.2547   100 a     
##     not_dt 160.61295 174.55960 197.98885 184.51628 194.66653 305.9615   100      f
##      dt_ch  46.98662  53.96668  66.40719  58.13418  63.28052 201.3181   100   c   
##  not_dt_ch  37.83380  42.22255  50.53423  45.42392  49.01761 147.5198   100  b    
##      dt_fm  78.63839  92.55691 127.33819 102.07481 174.38285 374.0968   100     e 
##  not_dt_fm  67.96827  77.14590  99.94541  88.75399  95.47591 205.1925   100    d

Answer 3

如果您的数据处理速度比您消耗它们的速度慢，您可以考虑在每次加载后对您的数据设置键，以便使用聚簇键和索引进行任何进一步的查询。
由于排序算法的精确和现代实现，设置密钥相对便宜。

library(data.table)
library(microbenchmark)

set.seed(1492)

keys_big <- sprintf("%014d", sample(5000, 4000000, replace=TRUE))
keys_small <- sprintf("%014d", sample(5000, 30000, replace=TRUE))

partition <- sample(keys_big, 250)

dat_big <- data.table(keys=keys_big, key = "keys")

microbenchmark(

    "dt"        = dat_big[keys %in% partition],
    "not_dt"    = dat_big$keys %in% partition,

    "dt_ch"     = dat_big[keys %chin% partition],
    "not_dt_ch" = dat_big$keys %chin% partition,

    "dt_key"    = dat_big[partition]

)
# Unit: milliseconds
#       expr        min         lq       mean     median         uq       max neval
#         dt   5.810935   6.100602   6.830618   6.493006   6.825171  20.47223   100
#     not_dt 237.730092 246.318824 266.484226 257.507188 272.433109 461.17918   100
#      dt_ch  62.822514  66.169728  71.522330  69.865380  75.056333 103.45799   100
#  not_dt_ch  51.292627  52.551307  58.236860  54.920637  59.762000 215.65466   100
#     dt_key   5.941748   6.210253   7.251318   6.568678   7.004453  23.45361   100

设置密钥时间

dat_big <- data.table(keys=keys_big)
system.time(setkey(dat_big, keys))
#    user  system elapsed 
#   0.230   0.008   0.238

是最近的1.9.5.

子集 data.table 的速度以奇怪的方式取决于特定的键值？

Speed of subsetting data.table depends on the particular key values in strange way?

r

data.table