[실습] Model Fine-tuning 환경설정시 시행착오(Linux Ubuntu)

Topic. 미세조정(Fine-tuning) 하기 위한 환경설정 중 시행착오 내용을 공유합니다.

* 딥러닝 환경설정 시행착오

1. Case1

UnboundLocalError: local variable 'sentencepiece_model_pb2' referenced before assignment

위 오류가 발생하면 protobuf 패키지를 설치해줍니다.

$ conda install protobuf

2. Case2

NotImplementedError: Loading a dataset cached in a LocalFileSystem is not supported.

conda env 설치를 root 계정으로 한 경우 위 오류가 발생합니다.

datasets 를 User 영역에서 사용할 수 있도록 설치합니다.

$ pip install -U datasets

3. Case3

모델 저장시 safe_serialization=False 옵션을 주지 않으면,

학습 중간 checkpoint 저장 또는 학습 완료시

adapter_model.bin 파일이 생성되지 않고 adapter_model.safetensors 파일로 생성됩니다.

model = AutoModelForCausalLM.from_pretrained(
    model_pretrained_name,
    load_in_4bit=True,
    device_map="auto",
    use_safetensors=False <<<<<<<<<<<<<<<<
)

args=transformers.TrainingArguments(
    per_device_train_batch_size=micro_batch_size,
    ...
    learning_rate=lr,
    ...
    save_safetensors=False, <<<<<<<<<<<<<<<<<<
    ...
)

model.save_pretrained(output_dir, safe_serialization=False)

# 연관된 문제

safetensors 파일을 정상적으로 불러오지 못하는 경우가 있는데

이는 adapter_model 파일이 깨진것(443byte) 입니다.

아래 글을 참조하여 해결하시기 바랍니다.

https://x2bee.tistory.com/393

[문제해결] LoRA fine-tuning > adapter_model.bin 이 1kb 미만인 현상

Topic. adapter_model.bin 파일이 1kb 로 깨져서 생성되는 문제를 해결합니다. 열심히 하이퍼파라미터를 조정하며 인내의 시간을 기다려 fine-tuning이 완료되었을 때 생성된 adapter model이 1kb 인 누가봐도

x2bee.tistory.com

4. Case4

Exception: cublasLt ran into an error

모델을 로드하고 학습을 진행할 때 메모리 부족으로 인해 8bit 로 진행할 수 없는 경우 발생합니다.

model = LlamaForCausalLM.from_pretrained(
    model_pretrained_name,
    load_in_4bit=True, # 8bit -> 4bit
    device_map="auto",
    use_safetensors=use_safetensors
)

위와 같이 4bit 로 불러오도록 수정합니다.

5. AutoTokenizer Import 오류

아래와 같이 토크나이저를 불러올 때 오류가 발생하는 경우 transformers 를 git 을 통해 다시 받고 시도해봅니다.

ValueError: Tokenizer class CodeLlamaTokenizer does not exist or is not currently imported.

pip install git+https://github.com/huggingface/transformers.git@main

6. MissingCUDAException

DeepSpeed 를 pip 로 설치하거나

DeepSpeed 학습 실행시 아래 오류가 발생한다면 cuda-compiler 가 없기때문입니다.

MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

conda 를 이용하여 설치합시다.

conda install -c nvidia cuda-compiler

7. triton 설치 오류

아래와 같은 오류가 발생한다면 triton 을 root 로 설치했기때문입니다.

RuntimeError: Failed to import transformers.integrations.bitsandbytes because of the following error (look up to see its traceback):
cannot import name 'get_env_vars' from 'triton._C.libtriton.triton' (unknown location)

-U 옵션을 주어 재설치합시다.

pip install -U triton

8. LoRA model merge Error

DeepSpeed 모델 학습이 끝나고 merge 할때 아래 오류가 발생하는 경우

Error an illegal memory access was encountered at line 88 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/ops.c

DeepSpeed config 파일(ds_config.json)에서

zero_optimization 의 stage 를 낮춘다. (ex : 3 -> 2)

--- 추가 시행착오가 발견되면 계속해서 업데이트 합니다.

'A.I.(인공지능) & M.L.(머신러닝) > Development Environments' 카테고리의 다른 글

[문제해결] 머신러닝, 딥러닝 GPU 사용률이 올라가지 않는현상 (0)	2024.03.08
[실습] WSL 환경 CUDA 설치 (0)	2024.03.07
[실습] A.I. Pre-trained Model Fine-tuning 을 위한 환경설정(Linux Ubuntu) (0)	2024.01.05
[실습] GPU CUDA 사용 설정 (0)	2023.12.14
[실습] windows10 이상에서 WSL 환경 (0)	2023.12.13