data.table 可以按引用分配(变异)保留向量名称吗
Can data.table assign-by-reference (mutate) preserve vector names
我有一个命名的颜色向量,我使用它根据列值使用 :=
分配给新列。如果我使用 dplyr mutate,结果将不同于 data.table 样式 mutate。使用 dplyr,向量名称将被保留,而在 data.table 中,名称将丢失。
让我向您介绍一下我到目前为止所了解的内容。
# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
"Beauty" = "gold1",
"Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"),
movieNum = 1:3)
# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt,
movColor = movieColors[movie])
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]
# check the results and they look the same
dt1
dt2
# check that they are the same
dplyr::all_equal(dt1, dt2)
# they're not the same?
# the dplyr mutate is preserving the named vector
dt1$movColor
# the data.table mutate does not preserve the named vector
dt2$movColor
如果你 运行 它,你可以看到 dt1
,dplyr 版本,打印:
Aladdin Beauty Brave
"steelblue1" "gold1" "darkorange1"
而 data.table 版本,dt2
打印:
[1] "steelblue1" "gold1" "darkorange1"
为什么 data.table 不保留命名向量?有没有办法强制它这样做?
正如@dww 指出的那样,没有特别需要将名字保留在电影栏中。虽然我不知道为什么 {dplyr} 支持这个而 {data.table} 不支持,但您可以使用 data.table::setattr()
来获得相同的结果。
# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
"Beauty" = "gold1",
"Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"),
movieNum = 1:3)
# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt,
movColor = movieColors[movie])
#> Registered S3 methods overwritten by 'tibble':
#> method from
#> format.tbl pillar
#> print.tbl pillar
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]
# size prior to adding names
object.size(dt2)
#> 2008 bytes
# add names to the movies column in place
data.table::setattr(dt2$movColor, "names", dt2$movie)
#size after adding names
object.size(dt2)
#> 2368 bytes
# check the results and they look the same
dt1
#> movie movieNum movColor
#> 1: Aladdin 1 steelblue1
#> 2: Beauty 2 gold1
#> 3: Brave 3 darkorange1
dt2
#> movie movieNum movColor
#> 1: Aladdin 1 steelblue1
#> 2: Beauty 2 gold1
#> 3: Brave 3 darkorange1
# check that they are the same
dplyr::all_equal(dt1, dt2)
#> [1] TRUE
# the dplyr mutate is preserving the named vector
dt1$movColor
#> Aladdin Beauty Brave
#> "steelblue1" "gold1" "darkorange1"
# the data.table mutate does now preserve the named vector
dt2$movColor
#> Aladdin Beauty Brave
#> "steelblue1" "gold1" "darkorange1"
如您所见,我只使用 data.table 中已有的信息。因此,对象的大小增加。可能这也是为什么 data.table 自动剥离 names
的原因。
我有一个命名的颜色向量,我使用它根据列值使用 :=
分配给新列。如果我使用 dplyr mutate,结果将不同于 data.table 样式 mutate。使用 dplyr,向量名称将被保留,而在 data.table 中,名称将丢失。
让我向您介绍一下我到目前为止所了解的内容。
# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
"Beauty" = "gold1",
"Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"),
movieNum = 1:3)
# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt,
movColor = movieColors[movie])
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]
# check the results and they look the same
dt1
dt2
# check that they are the same
dplyr::all_equal(dt1, dt2)
# they're not the same?
# the dplyr mutate is preserving the named vector
dt1$movColor
# the data.table mutate does not preserve the named vector
dt2$movColor
如果你 运行 它,你可以看到 dt1
,dplyr 版本,打印:
Aladdin Beauty Brave "steelblue1" "gold1" "darkorange1"
而 data.table 版本,dt2
打印:
[1] "steelblue1" "gold1" "darkorange1"
为什么 data.table 不保留命名向量?有没有办法强制它这样做?
正如@dww 指出的那样,没有特别需要将名字保留在电影栏中。虽然我不知道为什么 {dplyr} 支持这个而 {data.table} 不支持,但您可以使用 data.table::setattr()
来获得相同的结果。
# first I make a named vector of colors
movieColors <- c("Aladdin" = "steelblue1",
"Beauty" = "gold1",
"Brave" = "darkorange1")
# lets create some dummy data
dt <- data.table::data.table(movie = c("Aladdin", "Beauty", "Brave"),
movieNum = 1:3)
# I want a new column that tells me the color of each movie for each row
# a dplyr mutate
dt1 <- dplyr::mutate(.data = dt,
movColor = movieColors[movie])
#> Registered S3 methods overwritten by 'tibble':
#> method from
#> format.tbl pillar
#> print.tbl pillar
# a data.table mutate
dt2 <- dt[, movColor := movieColors[movie]]
# size prior to adding names
object.size(dt2)
#> 2008 bytes
# add names to the movies column in place
data.table::setattr(dt2$movColor, "names", dt2$movie)
#size after adding names
object.size(dt2)
#> 2368 bytes
# check the results and they look the same
dt1
#> movie movieNum movColor
#> 1: Aladdin 1 steelblue1
#> 2: Beauty 2 gold1
#> 3: Brave 3 darkorange1
dt2
#> movie movieNum movColor
#> 1: Aladdin 1 steelblue1
#> 2: Beauty 2 gold1
#> 3: Brave 3 darkorange1
# check that they are the same
dplyr::all_equal(dt1, dt2)
#> [1] TRUE
# the dplyr mutate is preserving the named vector
dt1$movColor
#> Aladdin Beauty Brave
#> "steelblue1" "gold1" "darkorange1"
# the data.table mutate does now preserve the named vector
dt2$movColor
#> Aladdin Beauty Brave
#> "steelblue1" "gold1" "darkorange1"
如您所见,我只使用 data.table 中已有的信息。因此,对象的大小增加。可能这也是为什么 data.table 自动剥离 names
的原因。