无法使用 Python 3.7 运行 dask-mpi -- 将客户端连接到 dask-mpi 调度程序时超时

Cannot run dask-mpi with Python 3.7 -- timeout when connecting client to dask-mpi scheduler

我正尝试在全新的 Anaconda 环境中 运行 Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) 示例。

我使用

设置环境
conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi

在环境里面,我运行

mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json

然后,从同一台机器(和同一文件夹)上的 python 解释器,我 运行

from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')

这会导致以下错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
    self.start(timeout=timeout)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
    sync(self.loop, self._start, **kwargs)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
    six.reraise(*error[0])
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
    result[0] = yield future
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
    yielded = self.gen.throw(*exc_info)  # type: ignore
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
    timedelta(seconds=timeout), self._update_scheduler_info()
  File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

我 运行 dask-mpi 来自的终端没有任何输出表明正在尝试连接。我已确认相关端口 8786 已打开。我还通过调试器验证了客户端正在从调度程序文件中获取正确的地址。

我已经在很多不同的环境和几台不同的机器上尝试过这个,包括一个新的 Ubuntu 18.04 docker 容器。我完全不知道我可能缺少哪些步骤。

事实证明,这是由于 dask.distributed (1.25.3) 的较新版本中的一个错误破坏了 dask-mpi 的行为。这似乎已从 dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3) 开始修复。