什么是强度为 1 的边缘矩阵的设备互连 StreamExecutor
what is Device interconnect StreamExecutor with strength 1 edge matrix
我有四块 NVIDIA GTX 1080 显卡,当我初始化会话时,我看到以下控制台输出:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N Y N N
1: Y N N N
2: N N N Y
3: N N Y N
我还有 2 张 NVIDIA M60 Tesla 显卡,初始化如下所示:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N N N N
1: N N N N
2: N N N N
3: N N N N
而且我注意到自从上次更新 1080 gpu 从 1.6 到 1.8 后,这个输出对我来说发生了变化。大概是这样的(记不清了,回忆一下):
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3 0 1 2 3
0: Y N N N 0: N N Y N
1: N Y N N or 1: N N N Y
2: N N Y N 2: Y N N N
3: N N N Y 3: N Y N N
我的问题是:
- 这是什么设备互连?
- 对算力有什么影响?
- 为什么不同的 GPU 会有所不同?
- 它是否会由于硬件原因(故障、驱动程序不一致...)而随时间变化?
TL;DR
what is this Device interconnect?
正如 Almog David 在评论中所述,这会告诉您一个 GPU 是否可以直接访问另一个 GPU。
what influence it has on computation power?
这对多 GPU 训练的唯一影响。如果两个 GPU 具有设备互连,则数据传输速度更快。
why it differ for different GPUs?
这取决于硬件设置的拓扑结构。主板只有这么多的 PCI-e 插槽,它们通过同一条总线连接。 (使用 nvidia-smi topo -m
检查拓扑)
can it change over time due to hardware reasons (failures, drivers inconsistency...)?
我认为顺序不会随时间改变,除非 NVIDIA 更改默认枚举方案。有更详细的一点here
说明
此消息生成于 BaseGPUDeviceFactory::CreateDevices
function. It iterates through each pair of devices in the given order and calls cuDeviceCanAccessPeer
。正如 Almog David 在评论中提到的那样,这只是表明您是否可以在设备之间执行 DMA。
您可以进行一些测试来检查顺序是否重要。考虑以下片段:
#test.py
import tensorflow as tf
#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
现在让我们检查 CUDA_VISIBLE_DEVICES
中不同设备顺序的输出
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
...
$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N
...
您可以通过 运行 nvidia-smi topo -m
获得更详细的连接说明。例如:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SYS SYS 0-7,16-23
GPU1 PHB X SYS SYS 0-7,16-23
GPU2 SYS SYS X PHB 8-15,24-31
GPU3 SYS SYS PHB X 8-15,24-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
我相信您在列表中的位置越靠后,转移速度就越快。
我有四块 NVIDIA GTX 1080 显卡,当我初始化会话时,我看到以下控制台输出:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N Y N N
1: Y N N N
2: N N N Y
3: N N Y N
我还有 2 张 NVIDIA M60 Tesla 显卡,初始化如下所示:
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3
0: N N N N
1: N N N N
2: N N N N
3: N N N N
而且我注意到自从上次更新 1080 gpu 从 1.6 到 1.8 后,这个输出对我来说发生了变化。大概是这样的(记不清了,回忆一下):
Adding visible gpu devices: 0, 1, 2, 3
Device interconnect StreamExecutor with strength 1 edge matrix:
0 1 2 3 0 1 2 3
0: Y N N N 0: N N Y N
1: N Y N N or 1: N N N Y
2: N N Y N 2: Y N N N
3: N N N Y 3: N Y N N
我的问题是:
- 这是什么设备互连?
- 对算力有什么影响?
- 为什么不同的 GPU 会有所不同?
- 它是否会由于硬件原因(故障、驱动程序不一致...)而随时间变化?
TL;DR
what is this Device interconnect?
正如 Almog David 在评论中所述,这会告诉您一个 GPU 是否可以直接访问另一个 GPU。
what influence it has on computation power?
这对多 GPU 训练的唯一影响。如果两个 GPU 具有设备互连,则数据传输速度更快。
why it differ for different GPUs?
这取决于硬件设置的拓扑结构。主板只有这么多的 PCI-e 插槽,它们通过同一条总线连接。 (使用 nvidia-smi topo -m
检查拓扑)
can it change over time due to hardware reasons (failures, drivers inconsistency...)?
我认为顺序不会随时间改变,除非 NVIDIA 更改默认枚举方案。有更详细的一点here
说明
此消息生成于 BaseGPUDeviceFactory::CreateDevices
function. It iterates through each pair of devices in the given order and calls cuDeviceCanAccessPeer
。正如 Almog David 在评论中提到的那样,这只是表明您是否可以在设备之间执行 DMA。
您可以进行一些测试来检查顺序是否重要。考虑以下片段:
#test.py
import tensorflow as tf
#allow growth to take up minimal resources
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
现在让我们检查 CUDA_VISIBLE_DEVICES
$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 test.py
...
2019-03-26 15:26:16.111423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:18.635894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:18.635965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:18.635974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y N N
2019-03-26 15:26:18.635982: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N N N
2019-03-26 15:26:18.635987: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N N N Y
2019-03-26 15:26:18.636010: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: N N Y N
...
$ CUDA_VISIBLE_DEVICES=2,0,1,3 python3 test.py
...
2019-03-26 15:26:30.090493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2019-03-26 15:26:32.758272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-26 15:26:32.758349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2019-03-26 15:26:32.758358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N N N Y
2019-03-26 15:26:32.758364: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: N N Y N
2019-03-26 15:26:32.758389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: N Y N N
2019-03-26 15:26:32.758412: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y N N N
...
您可以通过 运行 nvidia-smi topo -m
获得更详细的连接说明。例如:
GPU0 GPU1 GPU2 GPU3 CPU Affinity
GPU0 X PHB SYS SYS 0-7,16-23
GPU1 PHB X SYS SYS 0-7,16-23
GPU2 SYS SYS X PHB 8-15,24-31
GPU3 SYS SYS PHB X 8-15,24-31
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
我相信您在列表中的位置越靠后,转移速度就越快。