Skip to content

Latest commit

 

History

History
224 lines (204 loc) · 15.6 KB

README.md

File metadata and controls

224 lines (204 loc) · 15.6 KB

Introduction

  • Why work? Transfer learning from computer vision shows that low-level features are general, task-free, can be shared; high-level features are task-dependent [1]
  • History: Statistical LM -> NNLM -> Word2vec (fixed word embeddings, has polysemy problem) -> ELMo (dynamic word embedding) -> ULMFiT -> GPT -> BERT. Refer to review 1, 2, 3, 4.1 and 4.2 see their difference
  • Two stages: unsupervised pre-training on large corpus, then supervised feature-based (e.g., ELMo, BERT) or fine-tuning (e.g., GPT, BERT) to downstream tasks
    • Feature-based strategy uses task-specific architecture that includes the pre-trained representations as additional features
    • Fine-tuning stragety introduces minimal task-specific parameters and is trained on downstream tasks by fine-tuning the pretrained parameters
  • Huggingface transformers keeps sota pretraining language models, such as longformer and reformer for long document
  • Industrial experience for Chinese
    • Electra is good for sequence labelling taks (e.g., correction), but worse in text classification, xlnet is worse in all nlp tasks
  • TextBrewer is a model distillation toolkit

Models

ELMo (Embeddings from Language Models)

Principle

  • Model: 2 layer forward LSTM + 2 layer backward LSTM
  • Pre-training objective: bidirectional LM (i.e., concatenation of independently trained left-to-right and right-to-left LMs)
  • Features: the feature of the word is linear combination of all hidden states of biLM
  • ELMo is deep contextualized word representation, overcome the polysemy problem that word2vec (always fixed vectors given different context) has
  • Lower biLSTM layer catches syntax (e.x., POS tagging), and higher biLSTM layer catches semantic (e.g., word sense disambiguation

Implementation

AllenNLP ELMo page gives a detailed explanation about ELMo. And AllenNLp github page describes how to use ELMo:

ULMFiT

Principle

  • Three steps
    • step 1: general-domain LM pretraining
    • step 2: target task LM fine-tuning
      • Discriminative fine-tuning: tune each layer (as different layer capture different information type) with different learning rate
      • Gradual unfreezing: first unfreeze the last layer and fine-tune all un-frozen layers for one epoch, then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration
      • Slanted triangular learning rates: a short increase and a long decay period for learning rate
    • step 3: target task classifier fine-tuning
      • fc1_y = ReLU(Dropout(Batch normalization(fc1(X))) -> Softmax(Dropout(Batch normalization(fc2(fc1_y)))
      • X = Concat(h_t, mean_pooling(H), max_pooling(H))

Implementation

GPT (Generative Pre-training)

Principle

  • Model: multi-layer left-to-right (left-context-only) Transformer decoder
  • Pre-training objective: LM

Implementation

BERT (Bidirectional Encoder Representations from Transformers)

Principle

  • Model: multi-layer bidirectional Transformer encoder
  • Pre-training objective
    • Masked language model (MLM, inspired by the Cloze task, prevent each token 'see itself' in multi-layer bidirectional context), for one sentence
    • Next sentence prediction (NSP), for two sentences
  • Input: token embeddings + segment embeddings + position embeddings. First token is [CLS], sentence pairs between sentences is [SEP]
  • MLM
    • Disadvantages
      • Since no [MASK] in fine-tuning, there is a mismatch between pre-training and fine-tuning
      • Since only 15% token is sampled, the converge rate is slow
    • An example (15% token of whole training data is sampled)
      • my dog is hairy -> 80% replace hairy with [MASK], e.g., my dog is hairy -> my dog is [MASK]
      • my dog is hairy -> 10% replace hairy with a random word, e.g., my dog is hairy -> my dog is apple
      • my dog is hairy -> 10% keep unchanged, e.g., my dog is hairy -> my dog is hairy
  • NSP
    • Reveal the relationship between two sentences is not directly captured by language model
    • 50% B is next of A, 50% B is a random sentence
    • An example
      • Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
      • Label = IsNext
      • Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]
      • Label = NotNext
  • Fine-tuning
    • Single sentence / sentence pair classification task
      • Input: final hidden state of Transformer encoder about [CLS]
      • New parameter: W
      • All of the parameters of BERT and W are fine-tuned jointly
    • Question answering
      • Input: final hidden state of Transformer encoder about all tokens
      • New parameter: start vector and end vector
      • Predicted span is [max(softmax(dot product(token i hidden state, start vector))), max(softmax(dot product(token i hidden state, end vector)))

Implementation

  • Official page gives pretrained models about BERT

    • Preprocess
      • Typically, the Uncased model is better unless that case information is important (e.g., NER or POS tagging)
      • Variable-length problem (doc_stride) of SQuAD context paragraphs
    • Tokenization
      • For Chinese, BERT use character-based tokenization
      • For all other languages, BERT useWordPiece tokenization
      • Code BasicTokenizer in tokenization.py
    • Out-of-memory
      • max_seq_length
      • train_batch_size
  • Naturali gives details about BERT fine-tune

  • bert-as-service

  • Illustrated bert

  • pytorch bert

Application

  • QA [4]

    • Steps
      • step 1: convert long documents to passage, build inverted index
      • step 2: retrieval top k candidates passage by BM25 + RM3
      • step 3: input (query, passage) to BERT, output contains/not contains answer or answer 90i89i9uij4start/end position
    • Papers
  • IR [4]

    • Papers
      • Simple Applications of BERT for Ad Hoc Document Retrieval [4.1]
      • Passage Re-ranking with BERT
      • Investigating the Successes and Failures of BERT for Passage Re-Ranking
    • Dealing with long document
      • Replace document with several sentences [4.1] with ASSUMPTION: query are related to some sentences in the document
  • Dialogue system [4]

  • Conclusion [4]

    • BERT is good at sentence matching (may be the NSP task during pre-training) and deep semantic feature extracting (e.g., QA, but not excel at shallow feature task such as classification and sequence labeling).
    • BERT is not good at sequence generation.

XLNet

Principle

  • Two paradigms
    • ELMO, GPT, XLNet are autoregressive LM
    • BERT is denoising autoencoder (DAE), [MASK] is the noise
      • The masked positions are independent
      • Pretrain-finetune discrepancy
  • XLNet highlights
    • Permutation LM for bidirectional context in AR framework
      • Two-Stream Self-Attention
    • Transformer-XL(eXtra Long) for long document
      • Highlights
        • Segment-level recurrence with state reuse
        • Relative positional encoding
      • Official codes
    • More data

Implementation

RoBERTa (Robustly optimized BERT approach)

Principle

  • Dynamic masking

    • Generate the masking pattern every time when feed a sequence to the model
  • FULL-SENTENCEs without NSP loss

    • Each input is packed with full sentences sampled contiguously from one or more documents, such that the total length is at most 512 tokens. Inputs may cross document boundaries. When we reach the end of one document, we begin sampling sentences from the next document and add an extra separator token between documents
  • Larger mini-batches

    • 8K per batch
  • Larger byte-level BPE

    • 50K sub-word units

Implementation

Transformer

Principle

  • Principle: each layer has two sub-layers (i.e., multi-head self-attention and position-wise feedforward), the output of each sub-layer is LayerNorm(x+Sublayer(x)). Additional positional embedding is added to embedding.
  • Blocks
    • Encoder block

      • Multi-head self-attention
        • A layer that helps the encoder look at other words in the input sentence as it encodes a specific word
        • Embedding with time singal = Embeddings + Positional Encoding
        • One-head self-attention steps (i.e., scaled dot-product attention)
          • Packing n word embeddings into a matrix X
          • Multiplying X by the weight matrices W_Q, W_K, W_V to generate Q, K and V
          • Z = softmax(Q * K_T / sqrt(d_k)) * V, where d_k is column number of W_Q, Z shape is (n, d_k)
        • Multi-head self-attention steps
          • First generating different Z_0, Z_1, ... by different Q, K and V via W_Q, W_K, W_V
          • Then generating Z by concatenating Z_0, Z_1, ... to Z and multiply with weight matrix W_0
        • Why scaled dot-product attention ?
          • Definition: softmax(Q * K_T / sqtr(d_k)) * V
          • Q * K_T may large for d_k, cause softmax returns 0 or 1
      • Position-wise feed-forward
        • The exact same feed-forward network is independently applied to each position
        • Two linear transformation: max(0, W1x + b1)W2 + b2
      • Residuals
        • Add & Layer Normalization, i.e.m LayerNorm(X + Z)
    • Decoder block

      • Multi-head self-attention
        • It is only allowed to attend to earlier positions in the output sequence (This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation) .
      • Encoder-Decoder attention
        • Helps the decoder focus on relevant parts of the input sentence
        • The layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack
      • Position-wise feed-forward
      • Three self-attentions
    • Linear and softmax layer

      • Linear projects the decoder output to the logits of vocabulary size
      • Softmax the logits and choose the index with largest probability

Implementation

BERT, GPT, ELMo comparison

References