data.table 可以按引用分配(变异)保留向量名称吗

Can data.table assign-by-reference (mutate) preserve vector names

我有一个命名的颜色向量,我使用它根据列值使用 := 分配给新列。如果我使用 dplyr mutate,结果将不同于 data.table 样式 mutate。使用 dplyr,向量名称将被保留,而在 data.table 中,名称将丢失。

让我向您介绍一下我到目前为止所了解的内容。

# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
                 "Beauty" = "gold1",
                 "Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"), 
                             movieNum = 1:3)

# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt, 
                     movColor = movieColors[movie])
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]

# check the results and they look the same
dt1
dt2

# check that they are the same
dplyr::all_equal(dt1, dt2)
# they're not the same?

# the dplyr mutate is preserving the named vector
dt1$movColor
# the data.table mutate does not preserve the named vector
dt2$movColor

如果你 运行 它,你可以看到 dt1,dplyr 版本,打印:

Aladdin            Beauty         Brave 
"steelblue1"       "gold1"        "darkorange1" 

而 data.table 版本,dt2 打印:

[1] "steelblue1"   "gold1"        "darkorange1"

为什么 data.table 不保留命名向量?有没有办法强制它这样做?

正如@dww 指出的那样,没有特别需要将名字保留在电影栏中。虽然我不知道为什么 {dplyr} 支持这个而 {data.table} 不支持,但您可以使用 data.table::setattr() 来获得相同的结果。

# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
                 "Beauty" = "gold1",
                 "Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"), 
                             movieNum = 1:3)

# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt, 
                     movColor = movieColors[movie])
#> Registered S3 methods overwritten by 'tibble':
#>   method     from  
#>   format.tbl pillar
#>   print.tbl  pillar
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]



# size prior to adding names
object.size(dt2)
#> 2008 bytes

# add names to the movies column in place
data.table::setattr(dt2$movColor, "names", dt2$movie)

#size after adding names
object.size(dt2)
#> 2368 bytes



# check the results and they look the same
dt1
#>      movie movieNum    movColor
#> 1: Aladdin        1  steelblue1
#> 2:  Beauty        2       gold1
#> 3:   Brave        3 darkorange1
dt2
#>      movie movieNum    movColor
#> 1: Aladdin        1  steelblue1
#> 2:  Beauty        2       gold1
#> 3:   Brave        3 darkorange1

# check that they are the same
dplyr::all_equal(dt1, dt2)
#> [1] TRUE

# the dplyr mutate is preserving the named vector
dt1$movColor
#>       Aladdin        Beauty         Brave 
#>  "steelblue1"       "gold1" "darkorange1"
# the data.table mutate does now preserve the named vector
dt2$movColor
#>       Aladdin        Beauty         Brave 
#>  "steelblue1"       "gold1" "darkorange1"

如您所见,我只使用 data.table 中已有的信息。因此,对象的大小增加。可能这也是为什么 data.table 自动剥离 names 的原因。