当前位置：首页 > news >正文

DDP学习

news 2025/8/17 1:44:14

不用手动指定RANK，WORLD_SIZE，LOCAL_RANK
用torchrun或者python -m torch.distributed.launch就会自动设置

import osprint('rank', os.environ["RANK"])
print('world_size', os.environ["WORLD_SIZE"])
print('local_rank', os.environ["LOCAL_RANK"])

输入

python test.py

则得到

Traceback (most recent call last):File "/root/test.py", line 2, in <module>print('rank', os.environ["RANK"])File "/root/anaconda3/envs/moba_ai/lib/python3.10/os.py", line 680, in __getitem__raise KeyError(key) from None
KeyError: 'RANK'

输入

torchrun --master_port 29501  test.py

则得到

rank 0
world_size 1
local_rank 0

输入

torchrun --master_port 29501 --nproc_per_node 4 test.py

则得到

WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
rank 1
world_size 4
local_rank 1
rank 2
world_size 4
local_rank 2
rank 0
world_size 4
local_rank 0
rank 3
world_size 4
local_rank 3

术语

Node - A physical instance or a container; maps to the unit that the job manager works with.一台机子

Worker - A worker in the context of distributed training.一个进程

WorkerGroup - The set of workers that execute the same function (e.g. trainers).进程组

LocalWorkerGroup - A subset of the workers in the worker group running on the same node.一台机子的进程组

RANK - The rank of the worker within a worker group.当前进程排名（第几个）

WORLD_SIZE - The total number of workers in a worker group.总进程数（每台机子的进程加起来）

LOCAL_RANK - The rank of the worker within a local worker group.一台机子中进程排名

LOCAL_WORLD_SIZE - The size of the local worker group.一台机子总进程数
后面的不是很懂

rdzv_id - A user-defined id that uniquely identifies the worker group for a job. This id is used by each node to join as a member of a particular worker group.

rdzv_backend - The backend of the rendezvous (e.g. c10d). This is typically a strongly consistent key-value store.

rdzv_endpoint - The rendezvous backend endpoint; usually in form :.

报错

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).

可以利用 --master_port 29501换个端口

vscode

{// Use IntelliSense to learn about possible attributes.// Hover to view descriptions of existing attributes.// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387"version": "0.2.0","configurations": [{"name": "test1","type": "debugpy","request": "launch","program": "${file}","console": "integratedTerminal","justMyCode": false,"args": ["--device", "2"]},{   //torchrun --nproc_per_node 4 train_multi_gpu.py"name": "torchrun","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","${file}"]},{   //CUDA_VISIBLE_DEVICES=0,2 torchrun --nproc_per_node 2 train_multi_gpu.py"name": "torchrun_v2","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "2","${file}"],"env": {"CUDA_VISIBLE_DEVICES":"0,2"},},{   //python -m torch.distributed.launch --nproc_per_node 4 --use_env train_multi_gpu.py"name": "DDP","type": "debugpy","request": "launch","module": "torch.distributed.launch",// "program": "${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","--use_env","${file}"]},
]
}

https://pytorch.org/docs/stable/elastic/run.html
https://www.bilibili.com/video/BV1b84y1R75V/?spm_id_from=333.337.search-card.all.click
https://zhuanlan.zhihu.com/p/681694092
https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py

查看全文

http://www.lqws.cn/news/84025.html