如何让 unique() 在 data.tables 上使用字符列？

Question

如果我在不调用 stringsAsFactors=TRUE 的情况下创建一个包含字符串列的 R data.table，然后尝试使用 unique 获取数据的唯一行 table，则字符串从结果 table 中删除，尽管在确定哪些行是唯一的时会考虑它们。

> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=FALSE)
> unique(dt)
   x y
1:   1
2:   2
3:   2
> dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2), stringsAsFactors=TRUE)
> unique(dt)
   x y
1: a 1
2: b 2
3: c 2

这是正确的行为吗？我在 Cygwin 上，之前在 R 内部发现了一些神秘的 Cygwin 特定问题。这是 sessionInfo() 的读数：

R version 3.4.0 (2017-04-21)
Platform: x86_64-unknown-cygwin (64-bit)
Running under: CYGWIN_NT-6.1 INT-3A02 2.8.1(0.312/5/3) 2017-07-03 14:11 x86_64 Cygwin

Matrix products: default
LAPACK: /usr/lib/R/modules/lapack.dll

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] bit_1.1-12     compiler_3.4.0 bit64_0.9-7

Answer 1

duplicated() 函数可能提供解决方法。 dt[!duplicated(dt), ] returns 对于我系统上的两种情况，结果与 unique(dt) 相同（Ubuntu linux，R 版本 3.13.0-121-generic）

library(data.table)
dt <- data.table(x=factor(c('a', 'a', 'b', 'c')), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

dt <- data.table(x=c('a', 'a', 'b', 'c'), y=c(1, 1, 2, 2))
all.equal(unique(dt), dt[!duplicated(dt), ])
[1] TRUE
>

相关post：Finding ALL duplicate rows, including "elements with smaller subscripts"

如何让 unique() 在 data.tables 上使用字符列？

How to get unique() to work on data.tables with character columns?

cygwin

r

data.table