PyTorch：动态计算图之间的关系-填充-DataLoader

Question

据我了解，PyTorch 的优势应该在于它适用于动态计算图。在 NLP 的上下文中，这意味着具有可变长度的序列不一定需要填充到相同的长度。但是，如果我想使用 PyTorch DataLoader，我无论如何都需要填充我的序列，因为 DataLoader 只需要张量——因为我作为一个初学者不想构建一些定制的 collate_fn.

现在这让我想知道 - 这不会抹杀动态计算图在这种情况下的全部优势吗？此外，如果我填充我的序列以将其作为张量提供给 DataLoader，最后有许多零作为填充标记（在单词 id 的情况下），它会对我的训练产生任何负面影响，因为 PyTorch 可能不会针对使用填充序列进行计算（因为整个前提是它可以在动态图中使用可变序列长度），还是根本没有任何区别？

我也会 post 在 PyTorch 论坛上问这个问题...

谢谢！

Answer 1

In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.

这意味着您不需要填充序列除非您正在进行数据批处理，这是目前在 PyTorch 中添加并行性的唯一方法。 DyNet 有一个名为 autobatching (which is described in detail in this paper) 的方法，它对图形操作而不是数据进行批处理，因此这可能是您想要研究的内容。

But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.

您可以使用 DataLoader，因为您编写了自己的 Dataset class，并且您正在使用 batch_size=1。方法是对可变长度序列使用 numpy 数组（否则 default_collate 会给你带来困难）：

from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader

class FooDataset(Dataset):
    def __init__(self, data, target):
        assert len(data) == len(target)
        self.data = data
        self.target = target
    def __getitem__(self, index):
        return self.data[index], self.target[index]
    def __len__(self):
        return len(self.data)

data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']

ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)

print(list(enumerate(dl)))
# [(0, [
#  1  2  3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
#  4  5  6  7  8
# [torch.LongTensor of size 1x5]
# , ('b',)])]

Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?

公平点，但动态计算图的主要优势（至少目前）主要是使用 pdb 等调试工具的可能性，这会迅速减少您的开发时间。使用静态计算图进行调试要困难得多。 PyTorch 也没有理由不在未来实施进一步的即时优化或类似于 DyNet 的自动批处理的概念。

Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?

是的，在运行时和渐变中。 RNN 将像普通数据一样迭代填充，这意味着您必须以某种方式处理它。 PyTorch 为您提供了处理填充序列和 RNN 的工具，即 pad_packed_sequence and pack_padded_sequence。这些会让你在 RNN 执行期间忽略填充元素，但要注意：这不适用于你自己实现的 RNN（或者至少如果你不手动添加支持的话）。

PyTorch：动态计算图之间的关系-填充-DataLoader

PyTorch: Relation between Dynamic Computational Graphs - Padding - DataLoader

nlp

padding

deep-learning

pytorch