在 R 中混合 [tidyverse] 和 [data.table] 语法的危险？

Question

混合使用 tidyverse 和 data.table 语法时出现一些非常奇怪的行为。对于上下文，我经常发现自己使用 tidyverse 语法，然后在需要速度与需要代码可读性时将管道添加回 data.table。我知道 Hadley 正在开发一个使用 tidyverse 语法且速度 data.table 的新包，但据我所知，它仍处于初期阶段，所以我还没有使用它。

有人愿意解释一下这是怎么回事吗？这对我来说非常可怕，因为我可能不假思索地做了数千次。

library(dplyr); library(data.table)
DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

# now, what happens if I use a different tidyverse function (arrange) 
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1:   ALB Albania   UMIC

# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1:   ZMB  Zambia   LMIC
# 2:   ALB Albania   UMIC

Answer 1

我曾多次遇到同样的问题，这导致我避免将 dplyr 与 data.table 语法混合使用，因为我没有花时间找出原因。因此，感谢您提供 MRE。

看起来 dplyr::arrange 正在干扰 data.table auto-indexing :

index will be used when subsetting dataset with == or %in% on a single variable

by default if index for a variable is not present on filtering, it is automatically created and used

indexes are lost if you change the order of data

you can check if you are using index with options(datatable.verbose=TRUE)

如果我们明确设置自动索引：

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = TRUE)

DT <- distinct(DT) %>%   as.data.table()

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu) 
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu) 
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#>    iso3c country income
#> 1:   ALB Albania   UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

为避免此问题，您可以禁用自动索引：

library(dplyr); 
library(data.table)

DT <- fread(
"iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options(datatable.auto.index = FALSE) # Disabled

DT <- distinct(DT) %>%   as.data.table()

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange(iso3c) %>% as.data.table()

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

我在 data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone 上报告了这个问题。

Answer 2

正在使用tidytable package this doesn't happen (see below). It's now available on CRAN。 tidytable 允许您在获得 data.table 的速度的同时使用 tidyverse 语法进行最少的更改（distinct.、arrange.），这似乎是 OP 总体上想要的（谁不想要！）。

library(data.table)
library(tidytable)



DT <-
  fread(
    "iso3c  country income
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
MOZ Mozambique  LIC
ZMB Zambia  LMIC
ALB Albania UMIC
"
  )

codes <- c("ALB", "ZMB")

DT <- distinct.(DT) %>% as.data.table()

# this works like normal
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ZMB  Zambia   LMIC
#> 2:   ALB Albania   UMIC

DT <- DT %>% arrange.(iso3c) %>% as.data.table()

# this is no longer wack
DT[iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

# and these work as normal:
DT[(iso3c %in% codes), ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes, ]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

DT[DT$iso3c %in% codes]
#>    iso3c country income
#> 1:   ALB Albania   UMIC
#> 2:   ZMB  Zambia   LMIC

在 R 中混合 [tidyverse] 和 [data.table] 语法的危险？

Dangers of mixing [tidyverse] and [data.table] syntax in R?

r

dplyr

data.table

tidyverse