Gaudi로 vLLM을 통해 LLM을 서빙해보자

Gaudi2에서는 vLLM을 통해서 서빙하는 것이 가능합니다. 그렇기에 이를 활용하여 서빙하는 방법에 대해서 소개해보고자 합니다.

https://github.com/HabanaAI/vllm-fork

GitHub - HabanaAI/vllm-fork: A high-throughput and memory-efficient inference and serving engine for LLMs

A high-throughput and memory-efficient inference and serving engine for LLMs - HabanaAI/vllm-fork

github.com

하바나 공식 문서 :

https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/vLLM_Inference.html

vLLM Inference Server with Gaudi — Gaudi Documentation 1.20.0 documentation

docs.habana.ai

그러면 이제 vLLM을 활용해서, infernce 하는 방법을 소개드리고자 합니다. 우선, vllm-fork를 클론 받습니다.

git clone https://github.com/HabanaAI/vllm-fork

cd vllm-fork

그 다음으로, vllm을 활용하기 위해서 dockerfile을 빌드할건데, 이 vllm-fork 폴더 내에서, Dockerfile.hpu를 활용합니다.

이 때 수정해주셔야 할 부분이 있습니다.

FROM vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest

COPY ./ /workspace/vllm

WORKDIR /workspace/vllm

RUN pip install --upgrade pip && \
    pip install -v -r requirements-hpu.txt

ENV no_proxy=localhost,127.0.0.1
ENV PT_HPU_ENABLE_LAZY_COLLECTIVES=true

RUN VLLM_TARGET_DEVICE=hpu python3 setup.py install

# install development dependencies (for testing)
RUN python3 -m pip install -e tests/vllm_test_utils

WORKDIR /workspace/

RUN ln -s /workspace/vllm/tests && ln -s /workspace/vllm/examples && ln -s /workspace/vllm/benchmarks

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

Dockerfile 내부에 requirements-hpu.txt에서

# Common dependencies
-r requirements-common.txt

# Dependencies for HPU code
ray
triton==3.1.0
pandas
tabulate
setuptools>=61
setuptools-scm>=8
vllm-hpu-extension @ git+https://github.com/HabanaAI/vllm-hpu-extension.git@21284c9

이런 식으로 requirments-common.txt를 설치 해줍니다. 그러나, 이대로 설치하면 아래와 같은 에러가 발생합니다.

debug: pynvml.NVMLError_LibraryNotFound: NVML Shared Library Not Found

이 문제는 huggingface 의 transformers 라이브러리를 현재 hpu에서는 최신 버전을 제공하지 않기 때문입니다. 그래서 requirments-common.txt 를 수정해줘야 하는데 그러기 위해서는

psutil
sentencepiece  # Required for LLaMA tokenizer.
numpy < 2.0.0
requests >= 2.26.0
tqdm
blake3
py-cpuinfo
transformers >= 4.48.2, < 4.49  # Required for Bamba model and Transformers backend. , 이런식으로 변경해줘야 합니다.
tokenizers >= 0.19.1  # Required for Llama 3.
protobuf # Required by LlamaTokenizer.
fastapi[standard] >= 0.107.0, < 0.113.0; python_version < '3.9'
fastapi[standard]  >= 0.107.0, != 0.113.*, != 0.114.0; python_version >= '3.9'
aiohttp
openai >= 1.52.0 # Ensure modern openai package (ensure types module present and max_completion_tokens field support)
pydantic >= 2.9
prometheus_client >= 0.18.0
pillow  # Required for image processing
prometheus-fastapi-instrumentator >= 7.0.0
tiktoken >= 0.6.0  # Required for DBRX tokenizer
lm-format-enforcer >= 0.10.9, < 0.11
outlines == 0.1.11
lark == 1.2.2 
xgrammar == 0.1.11; platform_machine == "x86_64"
typing_extensions >= 4.10
filelock >= 3.16.1 # need to contain https://github.com/tox-dev/filelock/pull/317
partial-json-parser # used for parsing partial JSON outputs
pyzmq
msgspec
gguf == 0.10.0
importlib_metadata
mistral_common[opencv] >= 1.5.0
pyyaml
six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12
setuptools>=74.1.1; python_version > '3.11' # Setuptools is used by triton, we need to ensure a modern version is installed for 3.12+ so that it does not try to import distutils, which was removed in 3.12
einops # Required for Qwen2-VL.
compressed-tensors == 0.9.1 # required for compressed-tensors
depyf==0.18.0 # required for profiling and debugging with compilation config
cloudpickle # allows pickling lambda functions in model_executor/models/registry.py

* 최근 commit으로 이 문제는 해결된 것 같습니다.

이렇게 수정한 이후, 도커 파일을 빌드할건데, 아래와 같이 빌드해주시면 됩니다.

docker build -f Dockerfile.hpu -t vllm-fork:hpu .

빌드가 된다면,

docker run -it --runtime=habana \
  -e HABANA_VISIBLE_DEVICES=all \
  -e OMPI_MCA_btl_vader_single_copy_mechanism=none \
  --cap-add=sys_nice --net=host --ipc=host \
  --entrypoint /bin/bash \
  -p <your port>:<your port>\ #수정
  vllm-hpu-env:latest

이렇게 실행해주시면 됩니다. 이러면 docker terminal에 들어갈 수 있는데

아래와 같은 파이썬 코드를 실행시켜주시면 vLLM을 활용하여 서빙이 가능합니다.

python -m vllm.entrypoints.openai.api_server --model=Qwen/QwQ-32B --tensor-parallel-size 8 --port 12251
# tensor-parallel-size 은 현재 사용가능한 hpu 개수 이하로 설정해주시면 됩니다.

'A.I.(인공지능) & M.L.(머신러닝) > LLM' 카테고리의 다른 글

BABEL : 바벨탑 이전으로 (0)	2025.03.15
Embedding 모델과 Sentence-Transformer Training (0)	2025.02.21
CUDA를 넘어: DeepSeek (0)	2025.02.05
DeepSeek 살펴보기 (0)	2025.01.31
Custom Model Training을 위한 Hugging Face Trainer 구조 파악하기 (0)	2025.01.22

'A.I.(인공지능) & M.L.(머신러닝) > LLM' 카테고리의 다른 글

티스토리툴바