Text Classification using ULMFiT and BERT. Challenge solved for ML Fellowship program @Fellowship.ai
- Language Model
-
- AWD-LSTM - Architecture
-
- 0.3 - Dropout
-
- 1e-2 -Learning Rate (One cycle policy with 2 epochs, to avoid overfitting)
-
- 1e-3 -Learning Rate (Unfreezing all layers)
- Classification Model
{To train the classifier, we will use a technique called gradual unfreezing. We can start by training the last few layers, then go backwards and unfreeze and train layers before. We can use the learner function learn.freeze_to(-2) to unfreeze the last 2 layers.}
-
- AWD-LSTM - Architecture
-
- 0.5 - Dropout
-
- 1e-2 -Learning Rate(One cycle policy with 2 epochs, to avoid overfitting)
-
-
- (5e-3, 2e-3) -Slice Learning Rate(Unfreeze last 2 layers)
{Applying different lr for different groups is a technique called “discriminative layer training”}
- (5e-3, 2e-3) -Slice Learning Rate(Unfreeze last 2 layers)
-
-
-
- (0.8, 0,7) -Momentum
{0.8 to 0.7 during the warmup then from 0.85 to 0.95 in the annealing}
- (0.8, 0,7) -Momentum
-
-
-
- (2e-3/100, 2e-3) -Slice Learning Rate(Unfreeze all the layers)
-
-
-
- (0.8, 0,7) -Momentum
-
-
- 64 -Batch Size
- Provide more data to ULTFiT language model (requires more computational resources)
- Hyperparameter tuning using bayesian techniques (for language model and classification model)
- Reduce dropout in language model (may lead to overfitting, so need to be careful)
- 64 -Max Length
- 64 -Batch Size
- 2e-5 -Learning Rate
- 3 -Epochs
- Increase max length parameter (requires more computational resources)
- Reduce batch size (more execution time, may lead to overfitting)
- Use regulazation techniques (L1 and L2)
# | Model | Accuracy | Loss | Total Time |
---|---|---|---|---|
01 | ULMFiT | 0.753679 | 0.702299 | 11:06 |
02 | BERT | 0.8193 | 0.5698 | 13:06 |