H100 GPU 장비 Docker Container 내부에서 정상적으로 사용하기.

H100 GPU 장비에서
Docker를 사용하기 위한 설정들이다.

기존에 알고 있던 설정들을 다 수행했지만 제대로 작동하지 않는다.

아무래도 H100 Architecture 혹은 NVLink 같은 다뤄보지 않은 것에서 문제가 발생하는 것 같다.

1. nvidia-smi & nvcc 정상 작동.

# nvidia-smi
Thu Apr 10 04:09:12 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:19:00.0 Off |                    0 |
| N/A   35C    P0             73W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:3B:00.0 Off |                    0 |
| N/A   32C    P0             72W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:4C:00.0 Off |                    0 |
| N/A   31C    P0            112W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:5D:00.0 Off |                    0 |
| N/A   35C    P0            114W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:9B:00.0 Off |                    0 |
| N/A   37C    P0            113W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:BB:00.0 Off |                    0 |
| N/A   33C    P0            111W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:CB:00.0 Off |                    0 |
| N/A   35C    P0            112W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   31C    P0            111W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

가장 어려운 이유는, 기존에 두 명령어만 정상이면 제대로 될 것이라고 ... 생각했지만
이게 정상적으로 작동하는 데 문제가 발생한다는 것이다.

python -c "import torch; print(torch.cuda.is_available(), torch.cuda.device_count())"
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
False 8

CUDA 12.4 버전에 맞추어, 그래픽카드 드라이버도 설치하고
그에 맞는 pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel 도커 이미지도 다운받아서 실행하였으나

내부에서는 여전히 CUDA 사용이 불가능 하다는 출력만 나타난다.

실패 ...

2. nvidia-docker의 문제인가?

기존에는 Docker 내부에서 정상적으로 GPU를 사용하기 위해서는
nvidia-docker라는 toolkit을 사용했어야 했다.

공식 문서에서는 이제 NVIDIA Container Toolkit이라는 이름으로 제공되는 것 같다.
(https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
다른 곳에서도 비슷한 문제를 호소하는 사람들이 존재한다.

위 문서에서 알려준 방식을 그대로 적용하자.

실패 ...

3. Docker Runtime의 문제인가?

도커 실행의 runtime을 잘못 지정하면 정상적으로 GPU를 잡지 못한다.

물론 그러면 nvidia-smi가 제대로 실행되지 않겠지만 ... 그래도 검증을 해보자.

cat /etc/docker/daemon.json

{
    "default-runtime": "nvidia",
    "dns": [
        "8.8.8.8",
        "8.8.4.4"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

이것도 아니다.
정상적으로 default-ruitime과 runtimes가 설정되어 있다.

4. 도커 실행시 --gpu랑 --runtime을 지정했는가 ?

실수로 이를 누락했을지도 모른다.

물론 런타임은 default가 nvidia라 추가적인 설정이 필요하지는 않다.

그래도 혹시 모르니 살펴보자.

sudo docker run --rm --gpus all --runtime=nvidia   pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel   bash -c "python -c 'import torch; print(\"CUDA available
:\", torch.cuda.is_available()); print(\"CUDA version:\", torch.version.cuda)'"

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
CUDA version: 12.4

그렇다 이건 아니다.

아무리 생각해봐도 답이 보이지 않지만 다른 것 하나하나 다 살펴보자 ...

5. 호스트 및 컨테이너에 드라이버가 정상적으로 전달되지 않은 것은 아닌가?

그럴 이유는 없지만 ...
nvidia의 driver 관련 요소들이 전달되지 않을 가능성도 있다.

살펴보자 ...

# HOST
ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Apr  9 07:46 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Apr  9 07:46 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Apr  9 07:46 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Apr  9 07:46 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Apr  9 07:46 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Apr  9 07:46 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Apr  9 07:46 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Apr  9 07:46 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Apr  9 07:46 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Apr  9 09:25 /dev/nvidia-modeset
crw-rw-rw- 1 root root 502,   0 Apr  9 07:46 /dev/nvidia-uvm
crw-rw-rw- 1 root root 502,   1 Apr  9 07:46 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
cr-------- 1 root root 505, 1 Apr  9 07:46 nvidia-cap1
cr--r--r-- 1 root root 505, 2 Apr  9 07:46 nvidia-cap2

# Container
ls -l /dev/nvidia*
crw-rw-rw- 1 root root 502,   0 Apr  9 07:46 /dev/nvidia-uvm
crw-rw-rw- 1 root root 502,   1 Apr  9 07:46 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Apr  9 07:46 /dev/nvidia0
crw-rw-rw- 1 root root 195,   1 Apr  9 07:46 /dev/nvidia1
crw-rw-rw- 1 root root 195,   2 Apr  9 07:46 /dev/nvidia2
crw-rw-rw- 1 root root 195,   3 Apr  9 07:46 /dev/nvidia3
crw-rw-rw- 1 root root 195,   4 Apr  9 07:46 /dev/nvidia4
crw-rw-rw- 1 root root 195,   5 Apr  9 07:46 /dev/nvidia5
crw-rw-rw- 1 root root 195,   6 Apr  9 07:46 /dev/nvidia6
crw-rw-rw- 1 root root 195,   7 Apr  9 07:46 /dev/nvidia7
crw-rw-rw- 1 root root 195, 255 Apr  9 07:46 /dev/nvidiactl

이것도 아니다.

정상적으로 전달이 안 됐다면 당연하지만 nvidia-smi에서 driver version도 안 나왔을 것이다.

6. 그러면 혹시 driver의 version과 cuda가 호환이 되지 않는 것은 아닌가?

아니다. 분명하기 550.54.14 이상에서 된다고 나와있다.

왜 그런진 모르지만 이에 맞춘 550.54.15 버전이 설치되어 있다.

여하튼 기제된 사항만 봐서는 문제가 없다.

7. 그러면 혹시 container 내부에서 권한의 문제로 driver에 접근할 수 없는 것은 아닐까?

sudo docker run --rm --gpus all --runtime=nvidia --privileged  pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel   bash -c "python -c 'import torch; print(\"CUDA available:\", torch.cuda.is_available()); print(\"CUDA version:\", torch.version.cuda)'"

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
CUDA version: 12.4

혹시 그런가 싶어 --privileged를 줬지만 여전히 안 된다.

도대체 뭘까.

8. 기타 환경 변수들이 설정되지 않은 것은 아닐까?

sudo docker run --rm -it \
  --gpus all \
  --runtime=nvidia \
  --privileged \
  -e LD_LIBRARY_PATH=/usr/local/cuda/lib64:/lib/x86_64-linux-gnu \
  -v /usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu \
  pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel \
  bash -c "ldconfig && nvidia-smi && python -c 'import torch; print(\"CUDA available:\", torch.cuda.is_available())'"

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:19:00.0 Off |                    0 |
| N/A   36C    P0            115W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:3B:00.0 Off |                    0 |
| N/A   33C    P0            114W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
...
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:DB:00.0 Off |                    0 |
| N/A   30C    P0            111W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
CUDA available: False

안 된다.

일반적으로 시도해 볼 수 있는 것들은 모두 해 봤다.

9. 최후의 수단이다. 버전을 낮춰보자.

먼저 pytorch 먼저 낮춰보자

2.6.0 --> 2.5.1로 다운그래이드를 시키자.

sudo docker run --rm --gpus all --runtime=nvidia --privileged  pytorch/pytorch:2.5.1-cuda12.4-cudnn9-devel   bash -c "python -c 'import torch; print(\"CU
DA available:\", torch.cuda.is_available()); print(\"CUDA version:\", torch.version.cuda)'"

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
CUDA version: 12.4

실패.

다음이다. cuda 버전을 낮춰보자.

sudo docker run --rm --gpus all --runtime=nvidia --privileged  pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel   bash -c "python -c 'import torch; print(\"CU
DA available:\", torch.cuda.is_available()); print(\"CUDA version:\", torch.version.cuda)'"

/opt/conda/lib/python3.11/site-packages/torch/cuda/__init__.py:129: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 802: system not yet initialized (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
CUDA available: False
CUDA version: 12.1

이것도 안 된다.

10. 모르겠다. 계속 뒤져보자.

CUDA 버전이 낮아서 처음에는 살펴보지 않았지만
해당 글에서는 802 에러가 아마도 fabric manager와 관련된 것인 것이라고 이야기한다.

A100 혹은 H100 같은 상위레벨 Graphic 카드에서는 충분히 발생할 문제인 것 같다.

이걸 살펴보자.

NVIDIA Fabric Manager는 NVLink와 관련된 설정인 것 같다.

8Way Server의 설정을 만져본 적이 없으니 한 번도 본 적이 없는 놈이다.

dpkg -l | grep nvidia-fabricmanager
sudo systemctl status nvidia-fabricmanager

실행 결과 아무것도 나오지 않는다.

아마 이 놈이 문제인 것 같다.

현재 nvidia driver에 버전에 맞는 fabricmanager를 설치해야 한다

단순하게 apt를 통해 install 했더니 에러가 계속 발생했다....

https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/

Index of /compute/cuda/repos/ubuntu2204/x86_64

developer.download.nvidia.com

dev로 가서 하나하나 찾아봐야 한다 ...

버전에 맞는 fabric manager를 찾자.
다만 하나 주의할 것은

"cuda-drivers-fabricmanager_550.54.15-1_amd64.deb" 이 녀석이 아니라는 것이다.

아래에 가서 살펴보면

"nvidia-fabricmanager-550_550.54.15-1_amd64.deb" 이게 존재한다.
반드시 nvidia fabricmanager로 설치하여야 한다.

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager-550_550.54.15-1_amd64.deb
sudo dpkg -i nvidia-fabricmanager-550_550.54.15-1_amd64.deb

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-kernel-source-550_550.54.15-0ubuntu1_amd64.deb
sudo dpkg -i nvidia-kernel-source-550_550.54.15-0ubuntu1_amd64.deb

>>> torch.cuda.is_available()
True

드디어 성공!
Multi-GPU 환경에서는 Fabric Manager를 반드시 설치해야 한다는 것을 처음 알게 되었다. (직접 셋팅할 일이 잘 없으니 ...)

11. NVML 버전 불일치 이슈 발생

root@dell:/data# nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.120

위와 같은 이슈가 발생했다.
550 버전 설치를 요청하면, 우리가 원하는 550.54.15 버전이 아닌 다른 버전이 기본으로 설치되도록 되어 있다.

문제는 550.54.15 버전의 의존성 라이브러리들은 기본적으로 그냥 550 버전 설치를 요청해버린다는 것이다 (....)
이런 문제로 "sudo apt-get install -f" 이런 명령어를 입력하면 갑작스럽게 버전 불일치 문제가 발생한다.

wget https://us.download.nvidia.com/tesla/550.54.15/NVIDIA-Linux-x86_64-550.54.15.run
chmod +x NVIDIA-Linux-x86_64-550.54.15.run
sudo ./NVIDIA-Linux-x86_64-550.54.15.run

설치된 드라이버를 제거한 뒤 550.54.15 버전을 다시 다운로드 받아서 실행했다.
(왜 인지는 모르겠으나 550.120이 사라지지 않는다 ...)