在 R 中混合 [tidyverse] 和 [data.table] 语法的危险?
Dangers of mixing [tidyverse] and [data.table] syntax in R?
混合使用 tidyverse
和 data.table
语法时出现一些非常奇怪的行为。
对于上下文,我经常发现自己使用 tidyverse
语法,然后在需要速度与需要代码可读性时将管道添加回 data.table
。我知道 Hadley 正在开发一个使用 tidyverse
语法且速度 data.table
的新包,但据我所知,它仍处于初期阶段,所以我还没有使用它。
有人愿意解释一下这是怎么回事吗?这对我来说非常可怕,因为我可能不假思索地做了数千次。
library(dplyr); library(data.table)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
# now, what happens if I use a different tidyverse function (arrange)
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1: ALB Albania UMIC
# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
我曾多次遇到同样的问题,这导致我避免将 dplyr
与 data.table
语法混合使用,因为我没有花时间找出原因。因此,感谢您提供 MRE。
看起来 dplyr::arrange
正在干扰 data.table
auto-indexing :
- index will be used when subsetting dataset with
==
or %in%
on a single variable
- by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
如果我们明确设置自动索引:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE)
DT <- distinct(DT) %>% as.data.table()
# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC
# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
为避免此问题,您可以禁用自动索引:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = FALSE) # Disabled
DT <- distinct(DT) %>% as.data.table()
# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
我在 data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone 上报告了这个问题。
正在使用tidytable package this doesn't happen (see below). It's now available on CRAN。 tidytable 允许您在获得 data.table 的速度的同时使用 tidyverse 语法进行最少的更改(distinct.
、arrange.
),这似乎是 OP 总体上想要的(谁不想要!)。
library(data.table)
library(tidytable)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
DT <- distinct.(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange.(iso3c) %>% as.data.table()
# this is no longer wack
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
# and these work as normal:
DT[(iso3c %in% codes), ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes, ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
混合使用 tidyverse
和 data.table
语法时出现一些非常奇怪的行为。
对于上下文,我经常发现自己使用 tidyverse
语法,然后在需要速度与需要代码可读性时将管道添加回 data.table
。我知道 Hadley 正在开发一个使用 tidyverse
语法且速度 data.table
的新包,但据我所知,它仍处于初期阶段,所以我还没有使用它。
有人愿意解释一下这是怎么回事吗?这对我来说非常可怕,因为我可能不假思索地做了数千次。
library(dplyr); library(data.table)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
# now, what happens if I use a tidyverse function (distinct) and then
# convert back to data.table?
DT <- distinct(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
# now, what happens if I use a different tidyverse function (arrange)
# and then convert back to data.table?
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack: (!!!!!!!!!!!!)
DT[iso3c %in% codes]
# iso3c country income
# 1: ALB Albania UMIC
# but these work:
DT[(iso3c %in% codes), ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes, ]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
DT[DT$iso3c %in% codes]
# iso3c country income
# 1: ZMB Zambia LMIC
# 2: ALB Albania UMIC
我曾多次遇到同样的问题,这导致我避免将 dplyr
与 data.table
语法混合使用,因为我没有花时间找出原因。因此,感谢您提供 MRE。
看起来 dplyr::arrange
正在干扰 data.table
auto-indexing :
- index will be used when subsetting dataset with
==
or%in%
on a single variable- by default if index for a variable is not present on filtering, it is automatically created and used
- indexes are lost if you change the order of data
- you can check if you are using index with
options(datatable.verbose=TRUE)
如果我们明确设置自动索引:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = TRUE)
DT <- distinct(DT) %>% as.data.table()
# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC
# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
为避免此问题,您可以禁用自动索引:
library(dplyr);
library(data.table)
DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")
options(datatable.auto.index = FALSE) # Disabled
DT <- distinct(DT) %>% as.data.table()
# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange(iso3c) %>% as.data.table()
# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
我在 data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone 上报告了这个问题。
正在使用tidytable package this doesn't happen (see below). It's now available on CRAN。 tidytable 允许您在获得 data.table 的速度的同时使用 tidyverse 语法进行最少的更改(distinct.
、arrange.
),这似乎是 OP 总体上想要的(谁不想要!)。
library(data.table)
library(tidytable)
DT <-
fread(
"iso3c country income
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
MOZ Mozambique LIC
ZMB Zambia LMIC
ALB Albania UMIC
"
)
codes <- c("ALB", "ZMB")
DT <- distinct.(DT) %>% as.data.table()
# this works like normal
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC
DT <- DT %>% arrange.(iso3c) %>% as.data.table()
# this is no longer wack
DT[iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
# and these work as normal:
DT[(iso3c %in% codes), ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes, ]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC
DT[DT$iso3c %in% codes]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC