无法使用 Python 3.7 运行 dask-mpi -- 将客户端连接到 dask-mpi 调度程序时超时
Cannot run dask-mpi with Python 3.7 -- timeout when connecting client to dask-mpi scheduler
我正尝试在全新的 Anaconda 环境中 运行 Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) 示例。
我使用
设置环境
conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi
在环境里面,我运行
mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json
然后,从同一台机器(和同一文件夹)上的 python 解释器,我 运行
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
这会导致以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
self.start(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
sync(self.loop, self._start, **kwargs)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
yield self._ensure_connected(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
timedelta(seconds=timeout), self._update_scheduler_info()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
我 运行 dask-mpi 来自的终端没有任何输出表明正在尝试连接。我已确认相关端口 8786 已打开。我还通过调试器验证了客户端正在从调度程序文件中获取正确的地址。
我已经在很多不同的环境和几台不同的机器上尝试过这个,包括一个新的 Ubuntu 18.04 docker 容器。我完全不知道我可能缺少哪些步骤。
事实证明,这是由于 dask.distributed (1.25.3) 的较新版本中的一个错误破坏了 dask-mpi 的行为。这似乎已从 dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3) 开始修复。
我正尝试在全新的 Anaconda 环境中 运行 Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) 示例。
我使用
设置环境conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi
在环境里面,我运行
mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json
然后,从同一台机器(和同一文件夹)上的 python 解释器,我 运行
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
这会导致以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
self.start(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
sync(self.loop, self._start, **kwargs)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
yield self._ensure_connected(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
timedelta(seconds=timeout), self._update_scheduler_info()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
我 运行 dask-mpi 来自的终端没有任何输出表明正在尝试连接。我已确认相关端口 8786 已打开。我还通过调试器验证了客户端正在从调度程序文件中获取正确的地址。
我已经在很多不同的环境和几台不同的机器上尝试过这个,包括一个新的 Ubuntu 18.04 docker 容器。我完全不知道我可能缺少哪些步骤。
事实证明,这是由于 dask.distributed (1.25.3) 的较新版本中的一个错误破坏了 dask-mpi 的行为。这似乎已从 dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3) 开始修复。