如何运行 ray 依赖于项目代码的任务

How to run ray task that depends on project code

我有一个很大的 python 项目,里面有很多文件夹

-model
-utils
-compute

我的 ray 远程代码是计算文件夹中的一些函数 我需要 运行 来自 model 和 util

的远程任务代码

目前,我遇到错误,不同的项目文件夹没有这样的模块

from utils.osops import run_command
from model.model_desc import ModelInsance
from compute.ray_remote import 

@ray.remote
def run_eval_remote(cmd_data, model_json):
    model_ins = ModelInsance.read_from_json(model_json)
    run_command(model_ins.bash_cmd) 
    # do some more staff
    return some_value 

如何正确操作?

这是堆栈跟踪:

  "/Users/me/proj/compute/evaluator_ray.py", line 178, in <listcomp>
ray_res = [self.eval_instance(instance, eval_metric) for instance in mutations_for_search]
  File "/Users/me/proj/compute/evaluator_ray.py", line 175, in eval_instance
return run_eval_remote.remote(cmd_data, instance_json)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/remote_function.py", line 114, in _remote_proxy
return self._remote(args=args, kwargs=kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 292, in _invocation_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/remote_function.py", line 202, in _remote
return client_mode_convert_function(
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 133, in client_mode_convert_function
return client_func._remote(in_args, in_kwargs, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 98, in _remote
return self.options(**option_args).remote(*args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 296, in remote
 return return_refs(ray.call_remote(self, *args, **kwargs))
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/api.py", line 103, in call_remote
return self.worker.call_remote(instance, *args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 322, in call_remote
task = instance._prepare_client_task()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 302, in _prepare_client_task
task = self.remote_stub._prepare_client_task()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 119, in _prepare_client_task
self._ensure_ref()
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/common.py", line 115, in _ensure_ref
self._ref = ray.put(
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/api.py", line 52, in put
return self.worker.put(*args, **kwargs)
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in put
out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 260, in <listcomp>
out = [self._put(x, client_ref_id=client_ref_id) for x in to_put]
  File "/Users/me/miniconda3/envs/proj/lib/python3.8/site-packages/ray/util/client/worker.py", line 280, in _put
raise cloudpickle.loads(resp.error)
ModuleNotFoundError: No module named 'compute'

我运行遇到了类似的问题,解决方法如下:

  • 分发代码到每个节点(我只是在每个节点git cloned)
  • 确保代码的版本/b运行ch/等在每个节点中都相同
  • 在每个节点中设置一个虚拟环境,在其中安装 ray(和其他项目依赖项)
  • 从 virtualenv 启动 ray,并加入集群

现在,当您从头节点(或集群外部)启动作业时,存在依赖关系并且作业运行良好。

当然,更简洁的分发方式是通过容器,但就我的目的而言,这种方法效果很好。