[실습] 3-2 AutoTokenizer

* AutoTokenizer

거의 모든 NLP(자연어 처리) 작업은 토크나이저로 시작됩니다.

토크나이저는 입력을 모델이 처리할 수 있는 숫자 형식으로 변환합니다.

AutoTokenizer.from_pretrained() 를 사용하여 토크나이저를 자동으로 로드할 수 있습니다.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

토크나이저를 불러온 후 아래와 같이 토큰화(Embeding) 합니다.

sequence = "In a hole in the ground there lived a hobbit."
print(tokenizer(sequence))

토큰화의 결과는 다음과 같습니다.

{

'input_ids': [101, 1999, 1037, 4920, 1999, 1996, 2598, 2045, 2973, 1037, 7570, 10322, 4183, 1012, 102],

'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],

'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

}

각 요소에 대한 설명은 아래와 같습니다.