并行启动两个脚本并根据另一个脚本停止一个 return
start two scripts in parallel and stop one based on the other’s return
我想在不同的 GPU 上并行启动两个不同的 python 脚本(tensorflow 对象检测 train.py 和 eval.py),当 train.py 完成后,kill eval.py。
我有以下代码来并行启动两个子进程 (How to terminate a python subprocess launched with shell=True)。但是子进程是在同一台设备上启动的(我能猜到为什么。我只是不知道如何在不同的设备上启动它们)。
start_train = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=0 train.py ...”
start_eval = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=1 eval.py ...”
commands = [start_train, start_eval]
procs = [subprocess.Popen(i, shell=True, stdout=subprocess.PIPE, preexec_fn=os.setsid) for i in commands]
到了这一步我就不知道该怎么办了。我需要像下面这样的东西吗?我应该使用 p.communicate()
来避免死锁吗?或者如果我只为 train.py 调用 wait() 或 communicate() 就足够了,因为我只需要它的完成。
for p in procs:
p.wait() # I assume this command won’t affect the parallel running
然后我需要以某种方式使用以下命令。我不需要来自 train.py 的 return 值,只需要来自子进程的 return 代码。 Popen.returncode documentation wait() 和 communicate() 看起来需要 return 代码设置。我不明白如何设置这个。我更喜欢
if train is done without any error:
os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM)
else:
write the error to the console, or to a file (but how?)
或?
train_return = proc[0].wait()
if train_return == 0:
os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM)
问题解决后更新:
这是我的主要内容:
if __name__ == "__main__":
exp = 1
go = True
while go:
create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'train'))
create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'eval'))
copy_tree(os.path.join(MAIN_PATH,"kitti/eval_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"eval"))
copy_tree(os.path.join(MAIN_PATH,"kitti/train_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"train"))
err_log = open('./kitti/'+str(exp)+'/error_log' + str(exp) + '.txt', 'w')
train_command = CUDA_COMMAND_PREFIX + "0 python3 " + str(MAIN_PATH) + "legacy/train.py \
--logtostderr --train_dir " + str(MAIN_PATH) + "kitti/" \
+ str(exp) + "/train/ --pipeline_config_path " + str(MAIN_PATH) \
+ "kitti/faster_rcnn_resnet101_coco.config"
eval_command = CUDA_COMMAND_PREFIX + "1 python3 " + str(MAIN_PATH) + "legacy/eval.py \
--logtostderr --eval_dir " + str(MAIN_PATH) + "kitti/" \
+ str(exp) + "/eval/ --pipeline_config_path " + str(MAIN_PATH) \
+ "kitti/faster_rcnn_resnet101_coco.config --checkpoint_dir " + \
str(MAIN_PATH) + "kitti/" + str(exp) + "/train/"
os.system("python3 dataset_tools/random_sampler_with_replacement.py --random_set_id " + str(exp))
time.sleep(20)
update_train_set(exp)
train_proc = subprocess.Popen(train_command,
stdout=subprocess.PIPE,
stderr=err_log, # write errors to a file
shell=True)
time.sleep(20)
eval_proc = subprocess.Popen(eval_command,
stdout=subprocess.PIPE,
shell=True)
time.sleep(20)
if train_proc.wait() == 0: # successfull termination
os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
clean_train_set(exp)
time.sleep(20)
exp += 1
if exp == 51:
go = False
默认情况下,即使您有多个 GPU,TensorFlow 也会将操作分配给“/gpu:0”(或“/cpu:0”)。解决它的唯一方法是使用上下文管理器
在您的一个脚本中将每个操作手动分配给第二个 GPU
with tf.device("/gpu:1"):
# your ops here
更新
如果我没理解错的话,你需要的是:
import subprocess
import os
err_log = open('error_log.txt', 'w')
train_proc = subprocess.Popen(start_train,
stdout=subprocess.PIPE,
stderr=err_log, # write errors to a file
shell=True)
eval_proc = subprocess.Popen(start_eval,
stdout=subprocess.PIPE,
shell=True)
if train_proc.wait() == 0: # successfull termination
os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
# else, errors will be written to the 'err_log.txt' file
我想在不同的 GPU 上并行启动两个不同的 python 脚本(tensorflow 对象检测 train.py 和 eval.py),当 train.py 完成后,kill eval.py。
我有以下代码来并行启动两个子进程 (How to terminate a python subprocess launched with shell=True)。但是子进程是在同一台设备上启动的(我能猜到为什么。我只是不知道如何在不同的设备上启动它们)。
start_train = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=0 train.py ...”
start_eval = “CUDA_DEVICE_ORDER= PCI_BUS_ID CUDA VISIBLE_DEVICES=1 eval.py ...”
commands = [start_train, start_eval]
procs = [subprocess.Popen(i, shell=True, stdout=subprocess.PIPE, preexec_fn=os.setsid) for i in commands]
到了这一步我就不知道该怎么办了。我需要像下面这样的东西吗?我应该使用 p.communicate()
来避免死锁吗?或者如果我只为 train.py 调用 wait() 或 communicate() 就足够了,因为我只需要它的完成。
for p in procs:
p.wait() # I assume this command won’t affect the parallel running
然后我需要以某种方式使用以下命令。我不需要来自 train.py 的 return 值,只需要来自子进程的 return 代码。 Popen.returncode documentation wait() 和 communicate() 看起来需要 return 代码设置。我不明白如何设置这个。我更喜欢
if train is done without any error:
os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM)
else:
write the error to the console, or to a file (but how?)
或?
train_return = proc[0].wait()
if train_return == 0:
os.killpg(os.getpgid(procs[1].pid), signal.SIGTERM)
问题解决后更新:
这是我的主要内容:
if __name__ == "__main__":
exp = 1
go = True
while go:
create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'train'))
create_dir(os.path.join(MAIN_PATH,'kitti',str(exp),'eval'))
copy_tree(os.path.join(MAIN_PATH,"kitti/eval_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"eval"))
copy_tree(os.path.join(MAIN_PATH,"kitti/train_after_COCO"), os.path.join(MAIN_PATH,"kitti",str(exp),"train"))
err_log = open('./kitti/'+str(exp)+'/error_log' + str(exp) + '.txt', 'w')
train_command = CUDA_COMMAND_PREFIX + "0 python3 " + str(MAIN_PATH) + "legacy/train.py \
--logtostderr --train_dir " + str(MAIN_PATH) + "kitti/" \
+ str(exp) + "/train/ --pipeline_config_path " + str(MAIN_PATH) \
+ "kitti/faster_rcnn_resnet101_coco.config"
eval_command = CUDA_COMMAND_PREFIX + "1 python3 " + str(MAIN_PATH) + "legacy/eval.py \
--logtostderr --eval_dir " + str(MAIN_PATH) + "kitti/" \
+ str(exp) + "/eval/ --pipeline_config_path " + str(MAIN_PATH) \
+ "kitti/faster_rcnn_resnet101_coco.config --checkpoint_dir " + \
str(MAIN_PATH) + "kitti/" + str(exp) + "/train/"
os.system("python3 dataset_tools/random_sampler_with_replacement.py --random_set_id " + str(exp))
time.sleep(20)
update_train_set(exp)
train_proc = subprocess.Popen(train_command,
stdout=subprocess.PIPE,
stderr=err_log, # write errors to a file
shell=True)
time.sleep(20)
eval_proc = subprocess.Popen(eval_command,
stdout=subprocess.PIPE,
shell=True)
time.sleep(20)
if train_proc.wait() == 0: # successfull termination
os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
clean_train_set(exp)
time.sleep(20)
exp += 1
if exp == 51:
go = False
默认情况下,即使您有多个 GPU,TensorFlow 也会将操作分配给“/gpu:0”(或“/cpu:0”)。解决它的唯一方法是使用上下文管理器
在您的一个脚本中将每个操作手动分配给第二个 GPUwith tf.device("/gpu:1"):
# your ops here
更新
如果我没理解错的话,你需要的是:
import subprocess
import os
err_log = open('error_log.txt', 'w')
train_proc = subprocess.Popen(start_train,
stdout=subprocess.PIPE,
stderr=err_log, # write errors to a file
shell=True)
eval_proc = subprocess.Popen(start_eval,
stdout=subprocess.PIPE,
shell=True)
if train_proc.wait() == 0: # successfull termination
os.killpg(os.getpgid(eval_proc.pid), subprocess.signal.SIGTERM)
# else, errors will be written to the 'err_log.txt' file