什么是强度为 1 的边缘矩阵的设备互连 StreamExecutor

Question

我有四块 NVIDIA GTX 1080 显卡，当我初始化会话时，我看到以下控制台输出：

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N Y N N 
 1:   Y N N N 
 2:   N N N Y 
 3:   N N Y N

我还有 2 张 NVIDIA M60 Tesla 显卡，初始化如下所示：

Adding visible gpu devices: 0, 1, 2, 3
 Device interconnect StreamExecutor with strength 1 edge matrix:
      0 1 2 3 
 0:   N N N N 
 1:   N N N N 
 2:   N N N N 
 3:   N N N N

而且我注意到自从上次更新 1080 gpu 从 1.6 到 1.8 后，这个输出对我来说发生了变化。大概是这样的（记不清了，回忆一下）：

 Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
     0 1 2 3            0 1 2 3
0:   Y N N N         0: N N Y N
1:   N Y N N    or   1: N N N Y
2:   N N Y N         2: Y N N N
3:   N N N Y         3: N Y N N

我的问题是：

这是什么设备互连？
对算力有什么影响？
为什么不同的 GPU 会有所不同？
它是否会由于硬件原因（故障、驱动程序不一致...）而随时间变化？

Answer 1

TL;DR

what is this Device interconnect?

正如 Almog David 在评论中所述，这会告诉您一个 GPU 是否可以直接访问另一个 GPU。

what influence it has on computation power?

这对多 GPU 训练的唯一影响。如果两个 GPU 具有设备互连，则数据传输速度更快。

why it differ for different GPUs?

这取决于硬件设置的拓扑结构。主板只有这么多的 PCI-e 插槽，它们通过同一条总线连接。（使用 nvidia-smi topo -m 检查拓扑）

can it change over time due to hardware reasons (failures, drivers inconsistency...)?

我认为顺序不会随时间改变，除非 NVIDIA 更改默认枚举方案。有更详细的一点here

说明

此消息生成于 BaseGPUDeviceFactory::CreateDevices function. It iterates through each pair of devices in the given order and calls cuDeviceCanAccessPeer。正如 Almog David 在评论中提到的那样，这只是表明您是否可以在设备之间执行 DMA。

您可以进行一些测试来检查顺序是否重要。考虑以下片段：

#test.py
import tensorflow as tf

#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

sess = tf.Session(config=config)

现在让我们检查 CUDA_VISIBLE_DEVICES

中不同设备顺序的输出

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y N N 
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N N N 
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N N N Y 
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   N N Y N 
...

$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2 3 
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N N N Y 
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   N N Y N 
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   N Y N N 
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3:   Y N N N
...

您可以通过运行 nvidia-smi topo -m 获得更详细的连接说明。例如：

       GPU0      GPU1    GPU2   GPU3    CPU Affinity
GPU0     X       PHB    SYS     SYS     0-7,16-23
GPU1    PHB       X     SYS     SYS     0-7,16-23
GPU2    SYS      SYS     X      PHB     8-15,24-31
GPU3    SYS      SYS    PHB      X      8-15,24-31

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

我相信您在列表中的位置越靠后，转移速度就越快。

什么是强度为 1 的边缘矩阵的设备互连 StreamExecutor

what is Device interconnect StreamExecutor with strength 1 edge matrix

nvidia

tensorflow

TL;DR

说明