DDP学习
不用手动指定RANK,WORLD_SIZE,LOCAL_RANK
用torchrun或者python -m torch.distributed.launch就会自动设置
import osprint('rank', os.environ["RANK"])
print('world_size', os.environ["WORLD_SIZE"])
print('local_rank', os.environ["LOCAL_RANK"])
输入
python test.py
则得到
Traceback (most recent call last):File "/root/test.py", line 2, in <module>print('rank', os.environ["RANK"])File "/root/anaconda3/envs/moba_ai/lib/python3.10/os.py", line 680, in __getitem__raise KeyError(key) from None
KeyError: 'RANK'
输入
torchrun --master_port 29501 test.py
则得到
rank 0
world_size 1
local_rank 0
输入
torchrun --master_port 29501 --nproc_per_node 4 test.py
则得到
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
rank 1
world_size 4
local_rank 1
rank 2
world_size 4
local_rank 2
rank 0
world_size 4
local_rank 0
rank 3
world_size 4
local_rank 3
术语
Node - A physical instance or a container; maps to the unit that the job manager works with.一台机子
Worker - A worker in the context of distributed training.一个进程
WorkerGroup - The set of workers that execute the same function (e.g. trainers).进程组
LocalWorkerGroup - A subset of the workers in the worker group running on the same node.一台机子的进程组
RANK - The rank of the worker within a worker group.当前进程排名(第几个)
WORLD_SIZE - The total number of workers in a worker group.总进程数(每台机子的进程加起来)
LOCAL_RANK - The rank of the worker within a local worker group.一台机子中进程排名
LOCAL_WORLD_SIZE - The size of the local worker group.一台机子总进程数
后面的不是很懂
rdzv_id - A user-defined id that uniquely identifies the worker group for a job. This id is used by each node to join as a member of a particular worker group.
rdzv_backend - The backend of the rendezvous (e.g. c10d). This is typically a strongly consistent key-value store.
rdzv_endpoint - The rendezvous backend endpoint; usually in form :.
报错
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:29500 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:29500 (errno: 98 - Address already in use).
可以利用 --master_port 29501
换个端口
vscode
{// Use IntelliSense to learn about possible attributes.// Hover to view descriptions of existing attributes.// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387"version": "0.2.0","configurations": [{"name": "test1","type": "debugpy","request": "launch","program": "${file}","console": "integratedTerminal","justMyCode": false,"args": ["--device", "2"]},{ //torchrun --nproc_per_node 4 train_multi_gpu.py"name": "torchrun","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","${file}"]},{ //CUDA_VISIBLE_DEVICES=0,2 torchrun --nproc_per_node 2 train_multi_gpu.py"name": "torchrun_v2","type": "debugpy","request": "launch","program": "/root/anaconda3/envs/moba_ai/bin/torchrun", //"${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "2","${file}"],"env": {"CUDA_VISIBLE_DEVICES":"0,2"},},{ //python -m torch.distributed.launch --nproc_per_node 4 --use_env train_multi_gpu.py"name": "DDP","type": "debugpy","request": "launch","module": "torch.distributed.launch",// "program": "${file}","console": "integratedTerminal","justMyCode": false,"args": [//"--nnodes", "1","--nproc_per_node", "4","--use_env","${file}"]},
]
}
https://pytorch.org/docs/stable/elastic/run.html
https://www.bilibili.com/video/BV1b84y1R75V/?spm_id_from=333.337.search-card.all.click
https://zhuanlan.zhihu.com/p/681694092
https://github.com/pytorch/pytorch/blob/main/torch/distributed/run.py