配置有nvlink的H20使用pytorch报错
背景
装有nvlink的h20机器上配置好驱动和cuda之后使用pytorch报错
(pytorch2.4) root@xx-dev-H20:~# python
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import torch
torch.>>> torch.cuda.is_available()
/root/anaconda3/envs/pytorch2.4/lib/python3.12/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /opt/conda/conda-bld/pytorch_1724789220573/work/c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
解决
在nvidia fabricmanager官网找到和H20机器上的驱动版本相对应的fabricmanager版本安装,启动即可
(pytorch2.4) root@xx-dev-H20:/opt/fabricmanager-linux-x86_64-550.163.01-archive# python
Python 3.12.0 | packaged by Anaconda, Inc. | (main, Oct 2 2023, 17:29:18) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>>
>>>
>>> import torch
>>> torch.
KeyboardInterrupt
>>> torch.cuda.is_available()
True
查看nvlink吞吐量
nvidia-smi nvlink --getthroughput d
watch -n 1 ‘nvidia-smi nvlink -gt d’
reference
fabricmanager下载地址
https://developer.download.nvidia.cn/compute/nvidia-driver/redist/fabricmanager/linux-x86_64/
nccl使用nvlink通信
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html