Custom Model Training을 위한 Hugging Face Trainer 구조 파악하기

Hugging Face의 Trainer는 매우매우 편리한 도구이지만

Tainer의 코드는 5천줄이 넘어갈 정도로 너무 거대하기 때문에

다시 뜯어볼 염두가 나지 않아, 스스로 정리하려고 작성하였다.

Hugging Face는 몇 가지 매우 간편한 툴을 제공한다

1. Automodel Class

모델의 구조를 Transformers 라이브러리 속에 처박아 두었다.

Automodel.from_pretrained(repo_id)로 호출 시
model.safetensors와 config.json 파일을 읽어 적절한 모델에 파라미터를 적합시키는 작업을 수행한다.

model = ModernBertModel.from_pretrained(save_dir)

이런 방식으로 적절한 모델이 있다면, 적절한 가중치를 가져와서 매핑한다.

문제는 라이브러리가 제공하지 않는 모델의 구조를 우리가 맘대로 만들고 싶을 때 발생한다.

처음부터 모델을 원본으로 구현하는 것이 가장 좋지만 ... 적절하게 구현하기 위해서는 코드가 너무 커진다.

그렇기 때문에 Automodel을 써서, 공통된 구조(Transformer의 Embedding 및 Attention)만 재활용을 하려고 하는 것이다.
(실제로 대부분의 Hugging Face 모델은 이런 방식으로 세부적 Task Model을 설계한다)

예를 들어보자.
ModernBERT의 경우 ModernBertModel Class의 경우 마지막 hidden_state의 상태를 반환한다.

"안녕하세요" 라는 단어를 Tokenizer로 분해하여
총 len 5를 가지는 **inputs**로 만들었다고 가정하자.

이것은 ModernBertModel의 forward를 호출하여 [ batch, length, hidden_state]의 tensor를 반환한다.
batch를 하나라고 가정하면

[1, 5, 768]의 tensor가 반환되는 것이다.

이것이 거대한 Transformer 모델의 출력값이고
이 출력값이 제대로 나오도록 Tuning하는 작업이 바로 Pre-Training인 것이다.

언어 모델은 이 출력값이 나오는 모든 과정은 동일하지만 (물론 추가 학습으로 미세하게 조정될 수는 있지만)
이렇게 나온 출력값을 가지고 어떻게 바꾸느냐에 따라 세부적인 Task가 달라진다.

우리가 하고 싶은 것은 바로 이것이다.

outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            sliding_window_mask=sliding_window_mask,
            position_ids=position_ids,
            indices=indices,
            cu_seqlens=cu_seqlens,
            max_seqlen=max_seqlen,
            batch_size=batch_size,
            seq_len=seq_len,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        last_hidden_state = outputs[0]

        if self.config.classifier_pooling == "cls":
            last_hidden_state = last_hidden_state[:, 0]
        elif self.config.classifier_pooling == "mean":
            last_hidden_state = (last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1) / attention_mask.sum(
                dim=1, keepdim=True
            )

        pooled_output = self.head(last_hidden_state)
        pooled_output = self.drop(pooled_output)
        logits = self.classifier(pooled_output)

하나 예시를 보자.

이것은 ModernBertForSequenceClassification 라는 Class로
ModernBertModel의 출력을 내부적으로 받아, 새로운 forward를 수행한다.
(위 코드에서 self.model이 ModernBertModel의 foward 함수를 호출한다)

이 모델은 방금 본 hidden_state의 출력을 이용하여 pooling을 먼저 수행,

이후 head_layer(간단한 Dense Network), drop_layer(Dropdown)을 진행하여 768차원의 문장 임베딩을 수행한다.

이렇게 입력 문장을 먼저 임베딩하고, 그렇게 임베딩 된 값을 토대로 classification_layer를 통과시키는 것이다.

        loss = None
        if labels is not None:
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(logits, labels)

        if not return_dict:
            output = (logits,)
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

최종적으로는 이렇게 만들어진 forward 함수 내부에서 loss와 custom출력값(logits)을 반환하게 된다.
만약 loss를 반환하지 않는다면, Trainer 내부에서 호출하는 Compute_loss_func을 이용하게 되는데 ...
이렇게 구현하는 것은 상당히 불편하기 때문에, 그냥 model의 forward에서 loss를 처리하는 것이 훨씬 낫다.

그러면 간단하게 기억하면 된다!

Transformers 라이브러리의 공통부분이 Custom 모듈의 핵심 부분이 된다!
해당 부분의 output을 받아, 최종 loss를 계산하는 forward 함수를 재작성한다!
이렇게 정의된 새로운 class는 PreTrainedModel을 래핑하여 작성하면 된다.

class ModernBERTSimCSE(PreTrainedModel):
    def __init__(self, modernbert, config: AutoConfig):
        super().__init__(config)
        self.modernbert = modernbert
        self.pooler = nn.Sequential(
            nn.Linear(config.hidden_size, 768, bias=True),
            nn.Tanh()
        )
        self.loss_fn = Loss()

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs):
        # Config 로드
        config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
        config.architectures = ["ModernBertModel"]
        # Base model 로드
        base_model = ModernBertModel.from_pretrained(pretrained_model_name_or_path, config=config, **kwargs)
        # ModernBERTSimCSE 초기화
        return cls(modernbert=base_model, config=config)

    def forward(self, 
                anchor_input_ids=None, 
                anchor_attention_mask=None, 
                positive_input_ids=None, 
                positive_attention_mask=None, 
                negative_input_ids=None, 
                negative_attention_mask=None, 
                sentence_1_input_ids=None, 
                sentence_1_attention_mask=None, 
                sentence_2_input_ids=None, 
                sentence_2_attention_mask=None, 
                labels=None):
        try:
            if anchor_input_ids is not None:
                # NLI (train) 처리
                anchor_outputs = self.modernbert(
                    input_ids=anchor_input_ids,
                    attention_mask=anchor_attention_mask,
                    return_dict=True,
                )
                positive_outputs = self.modernbert(
                    input_ids=positive_input_ids,
                    attention_mask=positive_attention_mask,
                    return_dict=True,
                )
                negative_outputs = self.modernbert(
                    input_ids=negative_input_ids,
                    attention_mask=negative_attention_mask,
                    return_dict=True,
                )

                # Pooling 수행
                anchor_pooled = self.mean_pooling(anchor_outputs.last_hidden_state, anchor_attention_mask)
                anchor_pooled = self.pooler(anchor_pooled)
                positive_pooled = self.mean_pooling(positive_outputs.last_hidden_state, positive_attention_mask)
                positive_pooled = self.pooler(positive_pooled)
                negative_pooled = self.mean_pooling(negative_outputs.last_hidden_state, negative_attention_mask)
                negative_pooled = self.pooler(negative_pooled)

                # 손실 계산
                loss = self.loss_fn.compute_loss(anchor_pooled, positive_pooled, negative_pooled)

                return SequenceClassifierOutput(
                    loss=loss,
                    logits=None,
                    hidden_states=(anchor_pooled, positive_pooled, negative_pooled),
                    attentions=None 
                )

            elif sentence_1_input_ids is not None and sentence_2_input_ids is not None:
                # STS (evaluation) 처리
                sentence_1_outputs = self.modernbert(
                    input_ids=sentence_1_input_ids,
                    attention_mask=sentence_1_attention_mask,
                    return_dict=True,
                )
                sentence_2_outputs = self.modernbert(
                    input_ids=sentence_2_input_ids,
                    attention_mask=sentence_2_attention_mask,
                    return_dict=True,
                )

                # Pooling 수행
                sentence_1_pooled = self.mean_pooling(sentence_1_outputs.last_hidden_state, sentence_1_attention_mask)
                sentence_1_pooled = self.pooler(sentence_1_pooled)
                sentence_2_pooled = self.mean_pooling(sentence_2_outputs.last_hidden_state, sentence_2_attention_mask)
                sentence_2_pooled = self.pooler(sentence_2_pooled)

                if labels is not None:
                    # Loss 계산
                    cosine_similarity = nn.CosineSimilarity(dim=-1)
                    scores = cosine_similarity(sentence_1_pooled, sentence_2_pooled)
                    mse_loss = nn.MSELoss()

                    loss = mse_loss(scores, labels)

                    return SequenceClassifierOutput(
                        loss=loss,  # Loss 값
                        logits=(sentence_1_pooled, sentence_2_pooled),
                        attentions=None 
                    )

                return None, (sentence_1_pooled, sentence_2_pooled)

            else:
                raise ValueError("Invalid input configuration for forward method.")
        except Exception as e:
            print(e)
            
     def mean_pooling(self, last_hidden_state: Tensor, attention_mask: Tensor) -> Tensor:
        """
        Attention mask를 고려한 평균 풀링
        """
        weighted_sum = (last_hidden_state * attention_mask.unsqueeze(-1)).sum(dim=1)
        mask_sum = attention_mask.sum(dim=1).unsqueeze(-1)
        return weighted_sum / mask_sum.clamp(min=1e-9)

이것은 SimCSE와 NLI-STS Training에서 제시하는 방법론을 따라가기위해 구현된 Class라고 보면 된다.

앞서 설명한 것과 같이, 래핑한 custom class를 정의해준다.

추가된 layer 중 기울기 추적이 필요하다면, 반드시 래핑 class의 init에 정의되어야 한다.

output은 Tuple의 형태도 괜찮지만

기본적으로 Dict형태의 output을 요구하고, 이러한 form을 설정해주는 ModelOutput 클래스가 존재한다.

해당 클래스를 상속받는 다양한 상위 Class가 존재하는데, 적당히 원하는 값을 반환하는 Output Class를 선택하자.

해당 함수의 forward 부분은, input의 variable에 따라 다른 방식으로 작동되도록 설계되었다.

하나 팁이 있다면, labels이 input으로 들어오면, Trainer는 이를 인식하여 알아서 잘 처리한다.

그렇기 때문에 굳이 forward 함수의 반환 값에 이것이 존재하지 않아도 된다는 것이다.

이런 형식에 맞추어 조금만 수정하면, 내가 원하는 작업이 추가된 새로운 Model을 만들 수 있다.

2. Data Loader

그러면 이제 어떻게 데이터를 입력할까?

기본적으로 우리는 datasets 라이브러리의 Dataset class를 이용한다고 가정하겠다.

우리가 기억해야하는 InputDataset의 형태는 다음과 같다.

Dataset class를 Rapping하는 Custom Dataset class가 되어야 함.
해당 클래스에서는 dict 형태의 데이터를 반환할 수 있어야 함.
__getitem__(self, idx) 의 형태로 정의된 함수를 통해, 방금 준비한 dict 형태의 데이터를 호출할 수 있어야 함.
__len__(self) 의 형태로 정의된 함수를 통해, 전체 데이터 셋의 길이 확인이 가능해야 함.

예시를 보자.

class NLIDataLoader(Dataset):
    def __init__(self, 
                 args: DataArguments, 
                 tokenizer: PreTrainedTokenizer):
        """
        Initialize the NLI data loader.

        Args:
        """
        try:
            login(token=args.hf_data_token)
        except:
            print("Fail to login hgf")
        
        self.dataset = datasets.load_dataset(path=args.train_data, data_dir=args.train_data_dir, split=args.train_data_split)
        self.tokenizer = tokenizer
        self.max_seq_length = args.max_len
        
        if args.data_filtering:
            self.dataset = self.dataset.filter(self.is_valid_input)
        
        self.dataset = self.dataset.map(self.preprocess, batched=True, num_proc=os.cpu_count())
        
    def preprocess(self, examples):
        """
        Preprocess the NLI dataset by tokenizing anchor, positive, and negative samples.

        Args:
            examples (dict): A dictionary containing raw dataset examples.

        Returns:
            dict: Tokenized inputs for anchor, positive, and negative samples.
        """
        anchor = self.tokenizer(
            examples["anchor"],
            max_length=self.max_seq_length,
            padding="max_length",
            truncation=True,
        )
        positive = self.tokenizer(
            examples["positive"],
            max_length=self.max_seq_length,
            padding="max_length",
            truncation=True,
        )
        negative = self.tokenizer(
            examples["negative"],
            max_length=self.max_seq_length,
            padding="max_length",
            truncation=True,
        )
        return {
            "anchor_input_ids": anchor["input_ids"],
            "anchor_attention_mask": anchor["attention_mask"],
            "positive_input_ids": positive["input_ids"],
            "positive_attention_mask": positive["attention_mask"],
            "negative_input_ids": negative["input_ids"],
            "negative_attention_mask": negative["attention_mask"],
        }
        
    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        return self.dataset[idx]

이런 방식으로만 정의되면 된다.

물론 저러한 값들은 일반적으로 transformer가 input으로 기대하는 값들은 아니다.

그러나 우리는 아까 forward 함수를 적절하게 수정하였으므로, 이렇게 데이터를 서빙해도 잘 작동한다.

물론 여기서 끝은 아니다.

다음으로 준비해야 할 것은 collator다.

collator의 역할은 이렇게 정의된 Dataset으로부터 batch형태로 데이터를 반환하여 입력하는 역할이다.

문제는 아까 말했듯, 일반적으로 요구되는 input의 형태가 아니므로, 조금 수정해 줄 필요가 있다.

아래 코드에서 실제로 DataCollatorWithPadding을 래핑할 필요는 없다.

어차피 __call__ 함수를 새롭게 정의했으므로 기존 기능은 모두 사라진다.

여기서는 다음과 같이, 데이터 형태에 따라 다른 모습의 배치를 반환하게 된다.

class SimCSEDataCollator(DataCollatorWithPadding):
    def __call__(self, features):
        """
        Custom collator to handle both NLI and STS inputs.
        """
        # Check the type of input data
        
        if "anchor_input_ids" in features[0]:
            # NLI 데이터 처리
            batch = {
                "anchor_input_ids": [f["anchor_input_ids"] for f in features],
                "anchor_attention_mask": [f["anchor_attention_mask"] for f in features],
                "positive_input_ids": [f["positive_input_ids"] for f in features],
                "positive_attention_mask": [f["positive_attention_mask"] for f in features],
                "negative_input_ids": [f["negative_input_ids"] for f in features],
                "negative_attention_mask": [f["negative_attention_mask"] for f in features],
            }
        elif "sentence_1_input_ids" in features[0]:
            # STS 데이터 처리
            batch = {
                "sentence_1_input_ids": [f["sentence_1_input_ids"] for f in features],
                "sentence_1_attention_mask": [f["sentence_1_attention_mask"] for f in features],
                "sentence_2_input_ids": [f["sentence_2_input_ids"] for f in features],
                "sentence_2_attention_mask": [f["sentence_2_attention_mask"] for f in features],
                "labels": [f["labels"] for f in features],
            }
        else:
            raise ValueError("Features do not match NLI or STS format.")

        # Convert lists to tensors
                
        batch = {key: torch.tensor(value, dtype=torch.long) if "labels" not in key else torch.tensor(value, dtype=torch.float) for key, value in batch.items()}
        
        return batch

3. Trainer

마지막은 Trainer다.

trainer = Trainer(
    model=model,
    args=training_args,
    processing_class=tokenizer,
    optimizers=(optimizer,scheduler),
    train_dataset=nli_loader,
    eval_dataset=sts_loader,
    compute_metrics=model.compute_metrics,
    data_collator=SimCSEDataCollator(tokenizer=tokenizer)
)

Trainer 아래 요소들을 정확하게 정의해야 한다.

model
processing_class(tokenizer)
optimizer
dataset
compute_metrics
data_collator

model은 앞서 정의한 PreTrainedModel을 Rapping한 class를 입력한다.

해당 클래스는 앞서 설명하였듯, 반드시 custom된 forward() 함수가 지정되어야 한다.

processing_class는 굳이 필요하지 않지만, push_to_hub를 호출할 때, tokenizer를 같이 push하기 위해 필요하다.

optimizer는 기본적으로 AdamW Optimizer가 적용되어있다.

기본적으로 torch.optim.optimizer 에 정의된 Optimizer를 상속받은 개체는 모두 삽입이 가능한 것으로 보인다.

dataset의 경우 앞서 이야기 한 것과 같은 형태면 모두 삽입이 가능하다.

(dataset[idx] 형태로 dict type의 item을 호출할 수 있는 형태)

collator의 경우에도 앞서 이야기 한 것과 같이, 이렇게 삽입된 dataset을 batch로 변형(+가공)하면 된다.

마지막으로 설명할 것이 바로 compute_metrics 인자인데

callback 함수를 입력으로 요구하며, 해당 함수는 predict 데이터를 통해 dict type의 metrics를 반환하는 함수이다.

작동 방식은 다음과 같이 흘러간다.

model의 forward 함수 호출
해당 반환값을 받아 EvalPrediction 형태로 가공
이렇게 호출된 EvalPrediction이 앞서 입력된 compute_metrics 함수의 Input으로 입력.
dict 형태의 metric 반환

class EvalPrediction:
    """
    Evaluation output (always contains labels), to be used to compute metrics.

    Parameters:
        predictions (`np.ndarray`): Predictions of the model.
        label_ids (`np.ndarray`): Targets to be matched.
        inputs (`np.ndarray`, *optional*): Input data passed to the model.
        losses (`np.ndarray`, *optional*): Loss values computed during evaluation.
    """

    def __init__(
        self,
        predictions: Union[np.ndarray, Tuple[np.ndarray]],
        label_ids: Union[np.ndarray, Tuple[np.ndarray]],
        inputs: Optional[Union[np.ndarray, Tuple[np.ndarray]]] = None,
        losses: Optional[Union[np.ndarray, Tuple[np.ndarray]]] = None,
    ):
        self.predictions = predictions
        self.label_ids = label_ids
        self.inputs = inputs
        self.losses = losses
        self.elements = (self.predictions, self.label_ids)
        if self.inputs is not None:
            self.elements += (self.inputs,)
        if self.losses is not None:
            self.elements += (self.losses,)

    def __iter__(self):
        return iter(self.elements)

    def __getitem__(self, idx):
        if idx < 0 or idx >= len(self.elements):
            raise IndexError("tuple index out of range")
        return self.elements[idx]

    def compute_metrics(self, eval_pred):
        predictions, labels = eval_pred

        # 두 개의 문장 임베딩을 분리
        sentence_1_embeddings, sentence_2_embeddings = predictions[0], predictions[1]

        # NumPy로 변환
        embeddings1 = sentence_1_embeddings
        embeddings2 = sentence_2_embeddings
        labels = labels.flatten()

        # 거리 및 유사도 계산
        cosine_scores = 1 - paired_cosine_distances(embeddings1, embeddings2)
        manhattan_distances = -paired_manhattan_distances(embeddings1, embeddings2)
        euclidean_distances = -paired_euclidean_distances(embeddings1, embeddings2)
        dot_products = [np.dot(emb1, emb2) for emb1, emb2 in zip(embeddings1, embeddings2)]

        # Pearson 및 Spearman 상관계수 계산
        
        print(cosine_scores.shape)
        print(labels.shape)
            
        eval_pearson_cosine, _ = pearsonr(labels, cosine_scores)
        eval_spearman_cosine, _ = spearmanr(labels, cosine_scores)

        eval_pearson_manhattan, _ = pearsonr(labels, manhattan_distances)
        eval_spearman_manhattan, _ = spearmanr(labels, manhattan_distances)

        eval_pearson_euclidean, _ = pearsonr(labels, euclidean_distances)
        eval_spearman_euclidean, _ = spearmanr(labels, euclidean_distances)

        eval_pearson_dot, _ = pearsonr(labels, dot_products)
        eval_spearman_dot, _ = spearmanr(labels, dot_products)

        # 결과 딕셔너리 반환
        return {
            "pearson_cosine": eval_pearson_cosine,
            "spearman_cosine": eval_spearman_cosine,
            "pearson_manhattan": eval_pearson_manhattan,
            "spearman_manhattan": eval_spearman_manhattan,
            "pearson_euclidean": eval_pearson_euclidean,
            "spearman_euclidean": eval_spearman_euclidean,
            "pearson_dot": eval_pearson_dot,
            "spearman_dot": eval_spearman_dot,
        }

여기서 살펴봐야 할 것은 바로 EvalPrediction class의 인자들인데

기본적으로 forward 함수를 통해 적절한 dict 값을 반환하였다면, 알아서 잘 처리될 것이다.

여기서는 logtis (정확하게는 pooling 처리된 embedding vector를 반환하도록 하였음)의 쌍을 반환하고
이것을 tuple 형태로 만들어 logtis으로 반환하도록 처리하였다.

해당 밸류는 EvalPredict class의 predictions 값으로 입력되고

해당 값과 labels을 받아 compute_metric이 처리하고 있는 것이다.

핵심만 정리하면 다음과 같다.

Custom Model은 PreTrainedModel class를 래핑하여 사용하자.
Custom Model의 forward 함수에서는 loss 값을 포함하여, Output Class를 통해 반환하도록 설계하면 된다.
여기서 custom loss를 통해 계산을 수행하도록 하여야 한다.
Dataset Class를 상속받아, dataset[idx] 형태로 호출되는 데이터를 만들어 준비하자.
이 때 개별 호출 Value는 dict 형태로 준비되어야 한다.
(기본적으로 모델이 요구하는 것들은 "input_ids", "attention_mask", "labels" 등의 값이다.
만약 이런 것들이 custom되면 collator와 model의 forward에서 적절하게 처리하도록 하여야 한다.)
평가 단계에서는 Trainer에 입력된 compute_metric 함수를 호출하여 사용한다.
여기서는 labels을 포함한 forward 함수가 반환하는 logits 값이 포함된다.
이를 적절하게 조절하자.

'A.I.(인공지능) & M.L.(머신러닝) > LLM' 카테고리의 다른 글

CUDA를 넘어: DeepSeek (0)	2025.02.05
DeepSeek 살펴보기 (0)	2025.01.31
DeepSeek-V3 (0)	2025.01.14
DeBERTa: Decoding-enhanced BERT with Disentangled Attention 느낌만 맛보기 (0)	2024.12.24
eCeLLM 논문 리뷰: Instruction Tuning for E-Commerce (Data Example 추가) (0)	2024.12.23

1. Automodel Class

2. Data Loader

3. Trainer

'A.I.(인공지능) & M.L.(머신러닝) > LLM' 카테고리의 다른 글

티스토리툴바