如何在 Graphcore IPU 上实现模型并行性？

How can I implement model parallelism on a Graphcore IPU?

我已经成功地将我的 TensorFlow 模型的一个版本移植到 Graphcore IPU 并运行具有数据并行性。然而，全尺寸模型不适合单个 IPU，我正在寻找实现模型并行性的策略。

到目前为止，除了 https://www.graphcore.ai/docs/targeting-the-ipu-from-tensorflow#sharding-a-graph Targeting the IPU from TensorFlow 指南中的 https://www.graphcore.ai/docs/targeting-the-ipu-from-tensorflow#sharding-a-graph 之外，我还没有找到有关模型并行方法的信息，其中介绍了分片的概念。

分片是将我的模型拆分到多个 IPU 的推荐方法吗？还有更多资源可以参考吗？

Sharding 包括跨多个 IPU 对模型进行分区，以便每个 IPU 设备计算图形的一部分。但是，这种方法通常推荐用于在单个图中涉及多个模型的利基用例，例如合奏。

跨多个 IPU 实现模型并行的另一种方法是 流水线。该模型仍然在多个 IPU 上分为多个计算阶段；这些阶段是并行执行的，一个阶段的输出是下一个阶段的输入。与分片相比，流水线确保在执行期间提高硬件利用率，从而在吞吐量和延迟方面带来更好的效率和性能。

因此，流水线是跨多个 IPU 并行化模型的推荐方法。

您可以在 Targeting the IPU from TensorFlow 指南的 this section 中找到有关流水线训练的更多详细信息。

this dedicated guide 中提供了对这两种模型并行方法的更全面的回顾。

您也可以考虑使用 IPUPipelineEstimator ：它是 IPUEstimator 的变体，可以自动处理运行 IPU 上的（流水线）程序的大多数方面。 Here you can find a code example showing how to use the IPUPipelineEstimator to train a simple CNN on the CIFAR-10 dataset.