一个热编码 R 中 Int 矩阵中的每一列

Question

我在将矩阵转换为 R 中的一种热编码时遇到问题。我在 Matlab 中实现，但我很难处理 R 中的对象。这里我有一个 'matrix'.[=14 类型的对象=]

我想对这个矩阵应用一种热编码。我对列名有疑问。

这里有一个例子：

> set.seed(4)
> t <- matrix(floor(runif(10, 1,9)),5,5)

      [,1] [,2] [,3] [,4] [,5]
[1,]    5    3    5    3    5
[2,]    1    6    1    6    1
[3,]    3    8    3    8    3
[4,]    3    8    3    8    3
[5,]    7    1    7    1    7
> class(t)
[1] "matrix"

期待：

      1_1 1_3 1_5 1_7  2_1 2_3 2_6 2_8 ...
[1,]   0   0   1   0    0   1   0   0  ...
[2,]   1   0   0   0    0   0   1   0  ...
[3,]   0   1   0   0    0   0   0   1  ...
[4,]   0   1   0   0    0   0   0   1  ...   
[5,]   0   0   0   1    1   0   0   0  ...

我尝试了以下方法，但矩阵保持不变。

library(data.table)
library(mltools)
test_table <- one_hot(as.data.table(t))

如有任何建议，我们将不胜感激。

Answer 1

您的数据 table 必须包含一些具有 class "factor" 的列（变量）。试试这个：

> t <- data.table(t)
> t[,V1:=factor(V1)]
> one_hot(t)
   V1_1 V1_3 V1_5 V1_7 V2 V3 V4 V5
1:    0    0    1    0  3  5  3  5
2:    1    0    0    0  6  1  6  1
3:    0    1    0    0  8  3  8  3
4:    0    1    0    0  8  3  8  3
5:    0    0    0    1  1  7  1  7

但我从 here 那里了解到，如果矩阵很大，caret 包中的 dummyVars 函数会更快。

编辑: 忘记设置种子了。 :P

以及在数据中分解所有变量的快速方法 table:

t.f <- t[, lapply(.SD, as.factor)]

Answer 2

可能有更简洁的方法来做到这一点，但这应该可行（并且至少易于阅读和理解；）

使用 base R 和双循环的建议解决方案：

set.seed(4)  
t <- matrix(floor(runif(10, 1,9)),5,5)

# initialize result object
#
t_hot <- NULL

# for each column in original matrix
#
for (col in seq_along(t[1,])) {
  # for each unique value in this column (sorted so the resulting
  # columns appear in order)
  #
  for (val in sort(unique(t[, col]))) {
    t_hot <- cbind(t_hot, ifelse(t[, col] == val, 1, 0))
    # make name for this column
    #
    colnames(t_hot)[ncol(t_hot)] <- paste0(col, "_", val)
  }
}

这个returns:

     1_1 1_3 1_5 1_7 2_1 2_3 2_6 2_8 3_1 3_3 3_5 3_7 4_1 4_3 4_6 4_8 5_1 5_3 5_5 5_7
[1,]   0   0   1   0   0   1   0   0   0   0   1   0   0   1   0   0   0   0   1   0
[2,]   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0   0
[3,]   0   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0
[4,]   0   1   0   0   0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0
[5,]   0   0   0   1   1   0   0   0   0   0   0   1   1   0   0   0   0   0   0   1

一个热编码 R 中 Int 矩阵中的每一列

one hot encode each column in a Int matrix in R

r

one-hot-encoding