Self-supervised Pre-training Models
<GPT-1>
- It introduces special tokens, such as <S> /<E>/ $, to achieve effective transfer learning during fine tuning
- 예측을 위한 output layer는 떼어버리고, 전단계 output의 word별 인코딩 벡터를 가져와 사용함
- 마지막 layer는 randomization을 통해 시작이 되어 충분히 학습이 되어야 하지만, 기학습된 layer는 큰 변화가 일어나지 않도록 함. 즉, 기존 지식을 충분히 task에 활용할 수 있도록 함.
<BERT>
- Language models only use left context or right context , but language understanding is bi-directional
Masked Language Model (MLM)
- Mask token never seen during fine tuning
- Soultion : 15% of the words to predict, but don’t replace with [MASK] 100% of the time. Instead mask, random word, same sentence!
Next Sentence Prediction
- position embedding은 순차적으로 넣어주되, segment embedding을 통해 문장 레벨의 index를 반영함
- Machine Reading Comprehension (MRC), Question Answering : 독해기반 질의응답
- SQuAD 1.1, 2.0, On SWAG
<GPT-2>
Just a really big transformer LM
Quite a bit of effort going into making sure the dataset is good quality
Language model can perform down stream tasks in a zero shot setting
그밖의 다양한 모델들
<GPT-3>
모델을 변형하지 않고 텍스트의 일부로써 제시
<ALBERT: A Lite BERT for Self-supervised Learning of Language Representations>
- Factorized Embedding Parameterization
- Cross layer Parameter Sharing
- (For Performance) Sentence Order Prediction
<ELECTRA>
<Light-weight Models>
<Fusing Knowledge Graph into Language Model>
<실습>
HuggingFace 의 Transformers 라이브러리 사용