- Why work? Transfer learning from computer vision shows that low-level features are general, task-free, can be shared; high-level features are task-dependent [1]
- History: Statistical LM -> NNLM -> Word2vec (fixed word embeddings, has polysemy problem) -> ELMo (dynamic word embedding) -> ULMFiT -> GPT -> BERT. Refer to review 1, 2, 3, 4.1 and 4.2 see their difference
- Two stages: unsupervised pre-training on large corpus, then supervised feature-based (e.g., ELMo, BERT) or fine-tuning (e.g., GPT, BERT) to downstream tasks
- Feature-based strategy uses task-specific architecture that includes the pre-trained representations as additional features
- Fine-tuning stragety introduces minimal task-specific parameters and is trained on downstream tasks by fine-tuning the pretrained parameters
- Huggingface transformers keeps sota pretraining language models, such as longformer and reformer for long document
- Industrial experience for Chinese
- Electra is good for sequence labelling taks (e.g., correction), but worse in text classification, xlnet is worse in all nlp tasks
- TextBrewer is a model distillation toolkit
- Model: 2 layer forward LSTM + 2 layer backward LSTM
- Pre-training objective: bidirectional LM (i.e., concatenation of independently trained left-to-right and right-to-left LMs)
- Features: the feature of the word is linear combination of all hidden states of biLM
- ELMo is deep contextualized word representation, overcome the polysemy problem that word2vec (always fixed vectors given different context) has
- Lower biLSTM layer catches syntax (e.x., POS tagging), and higher biLSTM layer catches semantic (e.g., word sense disambiguation
AllenNLP ELMo page gives a detailed explanation about ELMo. And AllenNLp github page describes how to use ELMo:
- Get contextual representations uses trained model
- Train a new model based on ELMo. Class
allennlp.modules.elmo.Elmo
calculates weighted representation - Interactively. Class
allennlp.commands.elmo.ElmoEmbedder
return lstm hidden states for each word - With existing allennlp model
- Edit
bidaf.jsonnet
from/training_config/
- Run
allennlp train training_config/bidaf.jsonnet -s output_model_file_path
- See BiDAF example
- Edit
- How to use ? Concatenate ELMos with context-independent word embeddings instead of replacing them
- Three steps
- step 1: general-domain LM pretraining
- step 2: target task LM fine-tuning
- Discriminative fine-tuning: tune each layer (as different layer capture different information type) with different learning rate
- Gradual unfreezing: first unfreeze the last layer and fine-tune all un-frozen layers for one epoch, then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration
- Slanted triangular learning rates: a short increase and a long decay period for learning rate
- step 3: target task classifier fine-tuning
fc1_y = ReLU(Dropout(Batch normalization(fc1(X))) -> Softmax(Dropout(Batch normalization(fc2(fc1_y)))
X = Concat(h_t, mean_pooling(H), max_pooling(H))
- Official page gives code and model about GPT
- Model: multi-layer bidirectional Transformer encoder
- Pre-training objective
- Masked language model (MLM, inspired by the Cloze task, prevent each token 'see itself' in multi-layer bidirectional context), for one sentence
- Next sentence prediction (NSP), for two sentences
- Input: token embeddings + segment embeddings + position embeddings. First token is
[CLS]
, sentence pairs between sentences is[SEP]
- MLM
- Disadvantages
- Since no
[MASK]
in fine-tuning, there is a mismatch between pre-training and fine-tuning - Since only 15% token is sampled, the converge rate is slow
- Since no
- An example (15% token of whole training data is sampled)
my dog is hairy
-> 80% replacehairy
with[MASK]
, e.g.,my dog is hairy -> my dog is [MASK]
my dog is hairy
-> 10% replacehairy
with a random word, e.g.,my dog is hairy -> my dog is apple
my dog is hairy
-> 10% keep unchanged, e.g.,my dog is hairy -> my dog is hairy
- Disadvantages
- NSP
- Reveal the relationship between two sentences is not directly captured by language model
- 50% B is next of A, 50% B is a random sentence
- An example
- Input =
[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
- Label =
IsNext
- Input =
[CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
- Label =
NotNext
- Input =
- Fine-tuning
- Single sentence / sentence pair classification task
- Input: final hidden state of Transformer encoder about
[CLS]
- New parameter:
W
- All of the parameters of BERT and
W
are fine-tuned jointly
- Input: final hidden state of Transformer encoder about
- Question answering
- Input: final hidden state of Transformer encoder about all tokens
- New parameter:
start vector
andend vector
- Predicted span is
[max(softmax(dot product(token i hidden state, start vector))), max(softmax(dot product(token i hidden state, end vector)))
- Single sentence / sentence pair classification task
-
Official page gives pretrained models about BERT
- Preprocess
- Typically, the
Uncased
model is better unless thatcase
information is important (e.g., NER or POS tagging) Variable-length
problem (doc_stride
) of SQuAD context paragraphs
- Typically, the
- Tokenization
- For Chinese, BERT use character-based tokenization
- For all other languages, BERT useWordPiece tokenization
- Code
BasicTokenizer
intokenization.py
Out-of-memory
max_seq_length
train_batch_size
- Preprocess
-
Naturali gives details about BERT fine-tune
-
QA [4]
- Steps
- step 1: convert long documents to passage, build inverted index
- step 2: retrieval top k candidates passage by BM25 + RM3
- step 3: input (query, passage) to BERT, output contains/not contains answer or answer 90i89i9uij4start/end position
- Papers
- End-to-End Open-Domain Question Answering with BERTserini
- FAQ Retrieval using Query-Qestion Similarity and BERT-Based Query-Answer Relevance
- A BERT Baseline for the Natural Questions
- Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering
- Steps
-
IR [4]
- Papers
- Simple Applications of BERT for Ad Hoc Document Retrieval [4.1]
- Passage Re-ranking with BERT
- Investigating the Successes and Failures of BERT for Passage Re-Ranking
- Dealing with long document
- Replace document with several sentences [4.1] with ASSUMPTION: query are related to some sentences in the document
- Papers
-
Dialogue system [4]
-
Conclusion [4]
- BERT is good at sentence matching (may be the NSP task during pre-training) and deep semantic feature extracting (e.g., QA, but not excel at shallow feature task such as classification and sequence labeling).
- BERT is not good at sequence generation.
- Two paradigms
- ELMO, GPT, XLNet are autoregressive LM
- BERT is denoising autoencoder (DAE),
[MASK]
is the noise- The masked positions are independent
- Pretrain-finetune discrepancy
- XLNet highlights
- Permutation LM for bidirectional context in AR framework
- Two-Stream Self-Attention
- Transformer-XL(eXtra Long) for long document
- Highlights
- Segment-level recurrence with state reuse
- Relative positional encoding
- Official codes
- Highlights
- More data
- Permutation LM for bidirectional context in AR framework
-
Dynamic masking
- Generate the masking pattern every time when feed a sequence to the model
-
FULL-SENTENCEs without NSP loss
- Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents
-
Larger mini-batches
- 8K per batch
-
Larger byte-level BPE
- 50K sub-word units
- Principle: each layer has two sub-layers (i.e., multi-head self-attention and position-wise feedforward), the output of each sub-layer is
LayerNorm(x+Sublayer(x))
. Additional positional embedding is added to embedding. - Blocks
-
Encoder block
- Multi-head self-attention
- A layer that helps the encoder look at other words in the input sentence as it encodes a specific word
- Embedding with time singal = Embeddings + Positional Encoding
- One-head self-attention steps (i.e., scaled dot-product attention)
- Packing
n
word embeddings into a matrixX
- Multiplying
X
by the weight matricesW_Q
,W_K
,W_V
to generateQ
,K
andV
Z = softmax(Q * K_T / sqrt(d_k)) * V
, whered_k
is column number ofW_Q
,Z
shape is(n, d_k)
- Packing
- Multi-head self-attention steps
- Why scaled dot-product attention ?
- Definition:
softmax(Q * K_T / sqtr(d_k)) * V
Q * K_T
may large ford_k
, causesoftmax
returns0
or1
- Definition:
- Position-wise feed-forward
- The exact same feed-forward network is independently applied to each position
- Two linear transformation:
max(0, W1x + b1)W2 + b2
- Residuals
- Multi-head self-attention
-
Decoder block
- Multi-head self-attention
- It is only allowed to attend to earlier positions in the output sequence (This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation) .
- Encoder-Decoder attention
- Helps the decoder focus on relevant parts of the input sentence
- The layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack
- Position-wise feed-forward
- Three self-attentions
- Multi-head self-attention
-
Linear and softmax layer
- Linear projects the decoder output to the logits of vocabulary size
- Softmax the logits and choose the index with largest probability
-