如何定义具有数千个可能分类值的输入张量？

Question

我知道如果我有一个具有多个可能值（例如国家或颜色）的分类输入，我可以使用单热张量（表示为多个 0，只有一个 1）。

我也明白，如果变量有很多可能的值（例如，数千个可能的邮政编码或学校 ID），单热张量可能效率不高，我们应该使用其他表示（基于哈希？）。但是我没有找到关于如何使用 JavaScript 版本的 TensorFlow 执行此操作的文档或示例。

有什么提示吗？

更新 @edkeveked 给了我关于使用嵌入的正确建议，但现在我需要一些关于如何在 tensorflowjs 中实际使用嵌入的帮助。

让我举个具体的例子：

假设我有人的记录，我有年龄（整数）、状态（0 到 49 之间的整数）和风险（0 或 1）。

const data = [
  {age: 20, state: 0, risk: 0},
  {age: 30, state: 35, risk: 0},
  {age: 60, state: 35, risk: 1},
  {age: 75, state: 17, risk: 1},
  ...
]

当我想用 tensorflowjs 创建分类器模型时，我会将状态编码为单热张量，将风险 - 标签 - 作为单热张量（风险：01，无风险 10）并构建一个具有如下密集层的模型：

const inputTensorAge = tf.tensor(data.map(d => d.age),[data.length,1])
const inputTensorState =  tf.oneHot(data.map(d => d.state),50)
const labelTensor = tf.oneHot(data.map(d => d.risk),2)

const inputDims = 51;
const model = tf.sequential({
  layers: [
    tf.layers.dense({units: 8, inputDim:inputDims, activation: 'relu'}),
    tf.layers.dense({units: 2, activation: 'softmax'}),
  ]
});

model.compile({loss: 'categoricalCrossentropy', "optimizer": "Adam", metrics:["accuracy"]});

model.fit(tf.concat([inputTensorState, inputTensorAge],1), labelTensor, {epochs:10})

（顺便说一句……我是 tensorflow 的新手，所以可能有更好的方法……但这对我有用）

现在……我的挑战。如果我想要一个类似的模型，但现在我有一个 postcode 而不是 state（假设邮政编码有 10000 个可能的值）：

const data = [
  {age: 20, postcode: 0, risk: 0},
  {age: 30, postcode: 11, risk: 0},
  {age: 60, postcode: 11, risk: 1},
  {age: 75, postcode: 9876, risk: 1},
  ...
]

如果我想使用嵌入来表示邮政编码，我知道我应该使用嵌入层，例如：

tf.layers.embedding({inputDim:10000, outputDim: 20})

所以，如果我只使用邮政编码作为输入并忽略年龄，模型将是：

const model = tf.sequential({
  layers: [
    tf.layers.embedding({inputDim:10000, outputDim: 20})
    tf.layers.dense({units: 2, activation: 'softmax'}),
  ]
});

如果我将输入张量创建为

inputTensorPostcode = tf.tensor(data.map(d => d.postcode);

然后试试 model.fit(inputTensorPostcode, labelTensor, {epochs:10})

它不会工作...所以我显然做错了什么。

关于如何创建模型并使用嵌入执行 model.fit 的任何提示？

此外...如果我想合并多个输入（比如说邮政编码和年龄），我应该怎么做？

Answer 1

对于分类数据，可以使用one-hot编码来解决问题。 one-hot 编码的问题是它经常导致稀疏数据有很多零。

另一种处理分类数据的方法是降低输入数据的维度。这种技术在 Js API.

中被称为 embeddings. For creating models involving categorical data, one might use the embedding layer

编辑：数据并不是真正的分类数据，尽管可以这样构建它并且没有理由这样做。推荐系统的经典分类数据的一个例子是包含用户观看或未观看的电影的数据。数据将如下所示：

       ________________________________________________
       | moovie 1 | moovie 2 | moovie 3| ---  | moovie n|
       |__________|__________|_________|______|_________|
user 1 |    0     |    1     |    1    | ---  |     0   |
user 2 |    0     |    0     |    1    | ---  |     0   |
user 3 |    0     |    1     |    0    | ---  |     0   |
  .    |    .     |    .     |    .    | ---  |     .   | 
  .    |    .     |    .     |    .    | ---  |     .   |
  .    |    .     |    .     |    .    | ---  |     .   |

这里的输入维度是电影的数量n。这样的数据可能非常稀疏，有很多零。因为数据库可能包含数十万部电影，而普通用户几乎看不到超过一千部。在这种情况下，将有一千个字段为 1，其余所有字段为 0。此类数据需要使用 embeddings 进行聚合，以便将维度从 n 降低到更小的值。

这里不是这种情况。输入数据只有 2 个特征 age 和 postcode。输入数据维度为 2，输出（标签）始终为一维（此处的标签为 risk 属性）。但由于有两个类别，输入维度的大小为 2。邮政编码的取值范围不影响我们的分类

const data = [
  {age: 20, state: 0, risk: 0},
  {age: 30, state: 35, risk: 0},
  {age: 60, state: 35, risk: 1},
  {age: 75, state: 17, risk: 1}
]

const model = tf.sequential()
model.add(tf.layers.dense({inputShape: [2], units: 10, activation: 'relu'}))
model.add(tf.layers.dense({activation: 'softmax', units: 2}))
const x = tf.tensor2d(data.map(e => [e.age, e.state]), [data.length, 2])
const y = tf.oneHot(tf.tensor1d(data.map(e => e.risk), "int32"), 2)

model.compile({optimizer: 'adam', loss: 'categoricalCrossentropy' })
model.fit(x, y, {epochs: 10}).then(() => {
  // prediction will look like [p, 1-p] with  0 <= p <= 1
  // predictions [p, 1-p] such that p > 0.5 are in one category
  // predictions [p, 1-p] such that 1-p > 0.5 are in the 2 category
  // prediction for age 30 and postcode 35 is the same with age 0 and postcode 35 
  // (they both will either have p > 0.5 or p < 0.5)
  // the previous prediction will be different for age 75 postcode 17
  model.predict(tf.tensor2d([[30, 35], [0, 20], [75, 17]])).print()
})

<html>
  <head>
    <!-- Load TensorFlow.js -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@0.13.0"> </script>
  </head>

  <body>
  </body>
</html>

如何定义具有数千个可能分类值的输入张量？

How to define Input tensor that has thousands of possible categorical values?

tensorflow.js