当前位置: 首页 > news >正文

torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容

问题现象:

使用nohup 启动torch的分布式训练后, 由于ssh断开与服务器的连接, 导致训练过程出错:

WARNING:torch.distributed.elastic.agent.server.api:Received 1 death signal, shutting down workers
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971878 closing signal SIGHUP
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3971879 closing signal SIGHUP
Traceback (most recent call last):File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 194, in _run_module_as_mainreturn _run_code(code, main_globals, None,File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/runpy.py", line 87, in _run_codeexec(code, run_globals)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>main()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in mainlaunch(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launchrun(args)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in runelastic_launch(File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__return launch_agent(self._config, self._entrypoint, list(args))File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agentresult = agent.run()File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapperresult = f(*args, **kwargs)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in runresult = self._invoke_run(role)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 850, in _invoke_runtime.sleep(monitor_interval)File "/home/pinefield/anaconda3/envs/leo_py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 62, in _terminate_process_handlerraise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval)
torch.distributed.elastic.multiprocessing.api.SignalException: Process 3971841 got signal: 1

执行的命令如下:

nohup ./my_train.sh   >log.log 2>&1   &

报错的原因可能是torch.distributed.launch 、 torchrun 和 torch.distributed.run 无法与 nohup 兼容 , 当ssh连接断开, 窗口被关闭时,torch.distribute 接管了相关异常, 导致nohup没起作用。

ref: https://discuss.pytorch.org/t/ddp-error-torch-distributed-elastic-agent-server-api-received-1-death-signal-shutting-down-workers/135720/6

http://www.lqws.cn/news/103393.html

相关文章:

  • Redis:常用数据结构 单线程模型
  • 【Typst】3.Typst脚本语法
  • 使用Redis作为缓存优化ElasticSearch读写性能
  • AutoGenTestCase - 借助AI大模型生成测试用例
  • 批量大数据并发处理中的内存安全与高效调度设计(以Qt为例)
  • 基于Matlab实现LDA算法
  • MySQL 全量、增量备份与恢复
  • 医疗内窥镜影像工作站技术方案(续)——EFISH-SCB-RK3588国产化替代技术深化解析
  • Linux 命令全讲解:从基础操作到高级运维的实战指南
  • Python实例题:Flask实现简单聊天室
  • SIFT 算法原理详解
  • 户外摄像头监控如何兼顾安全实时监控
  • 深度学习与特征交叉:揭秘FNN与SNN在点击率预测中的应用
  • 电工基础【4】点动接线实操
  • 【电力电子】什么是并网?为什么要并网?并网需要考虑哪些因素?
  • matlab实现求解兰伯特问题
  • 华为OD机试_2025 B卷_精准核酸检测(Python,100分)(附详细解题思路)
  • 相机camera开发之差异对比核查一:测试机和对比机的硬件配置差异对比
  • Linux随记(十八)
  • 我的技术笔记
  • Docker部署与应用、指令
  • Linux——初步认识Shell、深刻理解Linux权限
  • Windows下WSL(Ubuntu)安装1Panel
  • IBMS系统整合数据资源,破除建筑信息壁垒助力运营效率腾飞
  • 简简单单探讨下starter
  • OD 算法题 B卷【矩阵稀疏扫描】
  • Spark 单机模式部署与启动
  • uniapp 开发企业微信小程序,如何区别生产环境和测试环境?来处理不同的服务请求
  • VScode自动添加指定内容
  • 简单实现Ajax基础应用