Skip to content
/ catt Public

The official implementation of CATT Arabic diacritization models.

License

Notifications You must be signed in to change notification settings

abjadai/catt

Repository files navigation

CATT: Character-based Arabic Tashkeel Transformer

CC BY-NC 4.0

This is the official implementation of the paper CATT: Character-based Arabic Tashkeel Transformer.

How to Run?

You need first to download models. You can find them in the Releases section of this repo.
The best checkpoint for Encoder-Decoder (ED) model is best_ed_mlm_ns_epoch_178.pt.
For the Encoder-Only (EO) model, the best checkpoint is best_eo_mlm_ns_epoch_193.pt.
use the following bash script to download models:

mkdir models/
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_ed_mlm_ns_epoch_178.pt
wget -P models/ https://github.com/abjadai/catt/releases/download/v2/best_eo_mlm_ns_epoch_193.pt

You can use the inference code examples: predict_ed.py for ED models and predict_eo.py for EO models.
Both examples are provided with batch inference support. Read the source code to gain a better understanding.

python predict_ed.py
python predict_eo.py

EO models are recommended for faster inference.
ED models are recommended for better accuracy of the predicted diacritics.

How to Train?

To start trainnig, you need to download the dataset from the Releases section of this repo.

wget https://github.com/abjadai/catt/releases/download/v2/dataset.zip
unzip dataset.zip

Then, edit the script train_catt.py and adjest the default values:

# Model's Configs
model_type = 'ed' # 'eo' for Encoder-Only OR 'ed' for Encoder-Decoder
dl_num_workers = 32
batch_size = 32
max_seq_len = 1024
threshold = 0.6

# Pretrained Char-Based BERT
pretrained_mlm_pt = None # Use None if you want to initialize weights randomly OR the path to the char-based BERT
#pretrained_mlm_pt = 'char_bert_model_pretrained.pt'

Finally, run the training script.

python train_catt.py

Resources

ToDo

  • inference script
  • upload our pretrained models
  • upload CATT dataset
  • upload DER scripts
  • training script

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

About

The official implementation of CATT Arabic diacritization models.

Resources

License

Stars

Watchers

Forks

Packages

No packages published