如何正确地为 PyTorch 中的嵌入、LSTM 和线性层提供输入？

Question

我需要清楚地了解如何使用 torch.nn 模块的不同组件正确准备批处理训练的输入。具体来说，我希望为 seq2seq 模型创建一个编码器-解码器网络。

假设我有一个包含这三层的模块，顺序为：

nn.Embedding
nn.LSTM
nn.Linear

`nn.Embedding`

输入： batch_size * seq_length
输出： batch_size * seq_length * embedding_dimension

我在这里没有任何问题，我只是想明确说明输入和输出的预期形状。

`nn.LSTM`

输入： seq_length * batch_size * input_size（本例中为embedding_dimension）
输出： seq_length * batch_size * hidden_size
last_hidden_state: batch_size * hidden_size
last_cell_state: batch_size * hidden_size

要将 Embedding 层的输出用作 LSTM 层的输入，我需要转置轴 1 和轴 2。

我在网上找到的许多示例都在做类似 x = embeds.view(len(sentence), self.batch_size , -1) 的事情，但这让我感到困惑。这个视图如何确保同一批次的元素保留在同一批次中？当 len(sentence) 和 self.batch 大小相同时会发生什么？

`nn.Linear`

输入： batch_size x input_size（在这种情况下为 LSTM 的 hidden_size 或 ??）
输出： batch_size x output_size

如果我只需要LSTM的last_hidden_state，那么我可以把它作为nn.Linear的输入。

但是如果我想使用输出（也包含所有中间隐藏状态），那么我需要将 nn.Linear 的输入大小更改为 seq_length * hidden_size 并将输出用作输入对于 Linear 模块，我需要转置输出的轴 1 和轴 2，然后我可以使用 Output_transposed(batch_size, -1).

查看

我的理解对吗？如何在张量中执行这些转置操作 (tensor.transpose(0, 1))?

Answer 1

你对大部分概念的理解是准确的，但也有一些缺失。

将嵌入嵌入到 LSTM（或任何其他循环单元）

您有 (batch_size, seq_len, embedding_size) 形状的嵌入输出。现在，您可以通过多种方式将其传递给 LSTM。
* 如果 LSTM 接受作为 batch_first 的输入，您可以将其直接传递给 LSTM。因此，在创建 LSTM 时传递参数 batch_first=True.
* 或者，您可以以 (seq_len, batch_size, embedding_size) 的形式传递输入。因此，要将嵌入输出转换为这种形状，您需要使用 torch.transpose(tensor_name, 0, 1) 转置第一维和第二维，就像您提到的那样。

Q. I see many examples online which do something like x = embeds.view(len(sentence), self.batch_size , -1) which confuses me.
A. This is wrong. It will mix up batches and you will be trying to learn a hopeless learning task. Wherever you see this, you can tell the author to change this statement and use transpose instead.

有一个论点支持不使用 batch_first，它指出 Nvidia CUDA 运行提供的底层 API 使用批处理作为辅助要快得多。

使用上下文大小

您直接将嵌入输出提供给 LSTM，这会将 LSTM 的输入大小固定为上下文大小 1。这意味着如果您的输入是 LSTM 的单词，您将一次给它一个单词总是。但是，这并不是我们一直想要的。因此，您需要扩展上下文大小。这可以按如下方式完成 -

# Assuming that embeds is the embedding output and context_size is a defined variable
embeds = embeds.unfold(1, context_size, 1)  # Keeping the step size to be 1
embeds = embeds.view(embeds.size(0), embeds.size(1), -1)

Unfold documentation
现在，您可以按照上面提到的方式将其提供给 LSTM，只需记住 seq_len 现在更改为 seq_len - context_size + 1 和 embedding_size（这是LSTM) 现在更改为 context_size * embedding_size

使用可变序列长度

批次中不同实例的输入大小不会始终相同。例如，您的某些句子可能有 10 个单词长，有的可能有 15 个单词，有的可能有 1000 个单词。因此，您肯定希望将可变长度序列输入到循环单元中。为此，在将输入提供给网络之前，需要执行一些额外的步骤。您可以按照以下步骤操作 -
1. 从最大序列到最小序列对您的批次进行排序。
2. 创建一个 seq_lengths 数组来定义批次中每个序列的长度。（这可以是一个简单的 python 列表）
3. 将所有序列填充到与最大序列等长。
4.创建该批次的LongTensor变量。
5. 现在，通过嵌入传递上述变量并创建适当的上下文大小输入后，您需要按如下方式打包序列 -

# Assuming embeds to be the proper input to the LSTM
lstm_input = nn.utils.rnn.pack_padded_sequence(embeds, [x - context_size + 1 for x in seq_lengths], batch_first=False)

了解 LSTM 的输出

现在，一旦您准备好 lstm_input 账户。根据您的需要，您可以将 lstm 称为

lstm_outs, (h_t, h_c) = lstm(lstm_input, (h_t, h_c))

这里需要提供(h_t, h_c)作为初始隐藏状态，它会输出最终的隐藏状态。你可以看到，为什么需要打包变长序列，否则 LSTM 也会运行覆盖 non-required 填充的单词。
现在，lstm_outs 将是一个打包序列，它是 lstm 在每一步的输出，(h_t, h_c) 分别是最终输出和最终单元状态。 h_t 和 h_c 的形状为 (batch_size, lstm_size)。您可以直接使用这些作为进一步的输入，但如果您还想使用中间输出，则需要先解压 lstm_outs，如下所示

lstm_outs, _ = nn.utils.rnn.pad_packed_sequence(lstm_outs)

现在，您的 lstm_outs 将变为 (max_seq_len - context_size + 1, batch_size, lstm_size)。现在，您可以根据需要提取 lstm 的中间输出。

Remember that the unpacked output will have 0s after the size of each batch, which is just padding to match the length of the largest sequence (which is always the first one, as we sorted the input from largest to the smallest).

Also note that, h_t will always be equal to the last element for each batch output.

将 lstm 连接到线性

现在，如果你只想使用 lstm 的输出，你可以直接将 h_t 馈送到你的线性层，它会起作用。但是，如果你也想使用中间输出，那么你需要弄清楚，你将如何将其输入到线性层（通过一些注意力网络或一些池化）。你不想将完整的序列输入到线性层，因为不同的序列会有不同的长度，你无法固定线性层的输入大小。是的，您需要转置 lstm 的输出以供进一步使用（同样，您不能在此处使用视图）。

Ending Note: I have purposefully left some points, such as using bidirectional recurrent cells, using step size in unfold, and interfacing attention, as they can get quite cumbersome and will be out of the scope of this answer.