如何在多个节点上运行 TensorFlow，每个节点有多个 CPU

Question

我想运行在非常大的数据集上使用 TensorFlow 进行线性回归。我有一个集群，其中有 9 个节点，每个节点有 36 个 CPU。将计算分布到所有可用资源的最佳方式是什么？

根据本课程https://www.coursera.org/learn/intro-tensorflow, the best way to use TensorFlow on distributed setting is to use Estimators. So I wrote my code as suggested there and followed the instructions at https://www.tensorflow.org/deploy/distributed进行并行化。然后我尝试运行我的脚本 my_code.py （在具有 1.2 亿个数据点和 2 个特征列的 "small" 数据集上测试代码）如下所示：

python my_code.py \ 
--ps_hosts=node1:2222 \
--worker_hosts=node2:2222,node3:2222
--job_name=worker
--task_index="i-2"

其中 i 是节点的编号（2 或 3）；而在节点 1 上，我做同样的事情，但使用 --job_name=ps 和 --task_index=0。然而，这种方式似乎每个节点只使用一个 CPU 。我需要单独指定每个 CPU 吗？

提前致谢。

Answer 1

据我所知，最好的办法是将同一节点上的所有 CPU 作为一个工作程序一起使用，以充分利用共享内存。因此，例如在上述情况下，必须手动仅指定 9 个 worker，并确保每个 worker 对应于一个节点，该节点使用了所有 36 个 CPU。执行此操作的命令取决于所使用的特定集群。

如何在多个节点上运行 TensorFlow，每个节点有多个 CPU

How to run TensorFlow on multiple nodes with several CPUs each

distributed-computing

tensorflow

如何在多个节点上 运行 TensorFlow，每个节点有多个 CPU

How to run TensorFlow on multiple nodes with several CPUs each

distributed-computing

tensorflow

如何在多个节点上运行 TensorFlow，每个节点有多个 CPU