当前位置: 首页 > news >正文

DDP学习

不用手动指定RANK,WORLD_SIZE,LOCAL_RANK
用torchrun或者python -m torch.distributed.launch就会自动设置

import osprint('rank', os.environ["RANK"])
print('world_size', os.environ["WORLD_SIZE"])
print('local_rank', os.environ["LOCAL_RANK"])

输入

python test.py

则得到

Traceback (most recent call last):File "/root/test.py", line 2, in <module>print('rank', os.environ["RANK"])File "/root/anaconda3/envs/moba_ai/lib/python3.10/os.py", line 680, in __getitem__raise KeyError(key) from None
KeyError: 'RANK'

输入

torchrun --master_port 29501  test.py

则得到

rank 0
world_size 1
local_rank 0

输入

torchrun --master_port 29501 --nproc_per_node 4 test.py

则得到

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
rank 1
world_size 4
local_rank 1
rank 2
world_size 4
local_rank 2
rank 0
world_size 4
local_rank 0
rank 3
world_size 4
local_rank 3

术语

Node - A physical instance or a container; maps to the unit that the job manager works with.一台机子

Worker - A worker in the context of distributed training.一个进程

WorkerGroup - The set of workers that execute the same function (e.g. trainers).进程组

LocalWorkerGroup - A subset of the workers in the worker group running on the same node.一台机子的进程组

RANK - The rank of the worker within a worker group.当前进程排名(第几个)

WORLD_SIZE - The total number of workers in a worker group.总进程数(每台机子的进程加起来)

LOCAL_RANK - The rank of the worker within a local worker group.一台机子中进程排名

LOCAL_WORLD_SIZE - The size of the local worker group.一台机子总进程数
后面的不是很懂

rdzv_id - A user-defined id that uniquely identifies the worker group for a job. This id is used by each node to join as a member of a particular worker group.

rdzv_backend - The backend of the rendezvous (e.g. c10d). This is typically a strongly consistent key-value store.

rdzv_endpoint - The rendezvous backend endpoint; usually in form :.

报错

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

可以利用 --master_port 29501换个端口

vscode

{// Use IntelliSense to learn about possible attributes.// Hover to view descriptions of existing attributes.// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387"version": "0.2.0","configurations": [{"name": "test1","type": "debugpy","request": "launch","program": "${file}","console": "integratedTerminal","justMyCode": false,"args": ["--device", "2"]},{   //torchrun --nproc_per_node 4 train_multi_gpu.py"name": "torchrun","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","${file}"]},{   //CUDA_VISIBLE_DEVICES=0,2 torchrun --nproc_per_node 2 train_multi_gpu.py"name": "torchrun_v2","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "2","${file}"],"env": {"CUDA_VISIBLE_DEVICES":"0,2"},},{   //python -m torch.distributed.launch --nproc_per_node 4 --use_env train_multi_gpu.py"name": "DDP","type": "debugpy","request": "launch","module": "torch.distributed.launch",// "program": "${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","--use_env","${file}"]},
]
}

https://pytorch.org/docs/stable/elastic/run.html
https://www.bilibili.com/video/BV1b84y1R75V/?spm_id_from=333.337.search-card.all.click
https://zhuanlan.zhihu.com/p/681694092
https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py

http://www.lqws.cn/news/84025.html

相关文章:

  • 什么是煤矿智能掘进
  • edg浏览器打开后默认是360界面
  • 【算法设计与分析】实验——改写二分搜索算法,众数问题(算法分析:主要算法思路),有重复元素的排列问题,整数因子分解问题(算法实现:过程,分析,小结)
  • 操作系统复习
  • 分词算法BBPE详解和Qwen的应用
  • 【深度学习新浪潮】多模态模型如何处理任意分辨率输入?
  • 项目采购管理习题剖析
  • 振动力学:有阻尼单自由度系统
  • 《操作系统真相还原》——中断
  • Python训练营打卡 Day43
  • 2023年12月6级第一套第一篇
  • mybatisplus的总结
  • Linux配置DockerHub镜像源配置
  • 代码随想录算法训练营第六天| 242.有效的字母异位词 、 349. 两个数组的交集 、 202. 快乐数 、1. 两数之和
  • 【看到哪里写到哪里】C的指针-3(函数指针)
  • TC3xx学习笔记-启动过程详解(一)
  • Arch安装botw-save-state
  • deep forest安装及使用教程
  • 一步一步配置 Ubuntu Server 的 NodeJS 服务器详细实录——4. 配置服务器终端环境 zsh , oh my zsh, vim
  • 基于爬取的典籍数据重新设计前端界面
  • 前端八股之CSS
  • 推荐一款使用html开发桌面应用的工具——mixone
  • 力扣HOT100之多维动态规划:62. 不同路径
  • 力扣HOT100之多维动态规划:64. 最小路径和
  • 量子物理:深入学习量子物理的基本概念与应用
  • Python_day43
  • Linux运维笔记:服务器感染 netools 病毒案例
  • mysql专题上
  • Vue 项目创建教程 (开发前的准备工作保姆级辅助文档)
  • 专注成就技术传奇:一路向前的力量