The First Pretrained Greek Sequence-to-Sequence Model
The first sequence to sequence pretrained model for the Greek language based on [BART] base model (https://github.com/facebookresearch/fairseq/tree/main/examples/bart).
GreekBART is pre-trained from scratch on a corpus of 76.9GB of Greek raw text to reconstruct corrupted input sentences.
As GreekBART is built upon on BART model, it is suitable for generative tasks.
Model | Architecture | #layers | #params | Link |
---|---|---|---|---|
GreekBART | BASE | 12 | 181M | Link |
We used fairseq (https://github.com/facebookresearch/fairseq) for our implementation.
In the src folder, we can find all the used codes and implementations.
- src
- corpus
- examples
- classification-GreekSUM
- classification-Macedonia
- NLI
- Sentimental Analysis
- preprocess
- data
- Crawler
- EU Parliament
- OSCAR
- Wikipedia
- pretrain
- summarization
- abstract
- title
The pretrained corpus is produced by the following datasets:
- the Greek part of Wikipedia1
- the Greek part of the European Parliament Proceedings Parallel Corpus (EuroParl)2
- the Greek part of OSCAR3, a clean version of CommonCrawl
- the large web corpus, crawled from about 20 million Greek-language URLs4
The "src/preprocess/read_dataset.py" script can be used to directly download the Wikipedia dataset, while the other datasets need to be downloaded by the user. However, for obtaining the latest version of the Greek part of OSCAR, one needs to contact the OSCAR team. After the download of the aforementioned datasets, you run the "src/preprocess/read_dataset.py" in order to preprocess and clean the downloaded datasets.
Then by running the "src/preprocess/create_deduplication_script.py" code and the generated bash script, we deduplicate the downloaded datasets and we concatenate them to one file, our raw corpus. For the deduplication process you need the runiq package (https://github.com/whitfin/runiq).
After the formation of the corpus, we executed the "src/pretrain/all_in_one_script.sh" to perform several tasks. Firstly, we divide the corpus into training and validation sets. Next, we tokenize and binarize our dataset. Finally, we initiate the pretraining of our model.
For more details about the used tokenizer, see LINK
Into the subfolder "examples", we can find the four discriminative tasts, in which our model was evaluated.
Please follow the steps here to get GreekSUM.
We run the following scripts:
- "src/examples/classification-GreekSUM/sentencepiece_nli.sh" (tokenize datasets)
- "src/examples/classification-GreekSUM/binarization_nli.sh" (binarize datasets)
- "src/examples/classification-GreekSUM/train_NLI.sh" (fine-tune them model to this task)
We run the following scripts:
- "src/examples/NLI/get-xnli.sh" (To download the dataset)
- "src/examples/NLI/process_nli.py" (To preprocess dataset and split it to training/validation/test sets)
- "src/examples/NLI/sentencepiece_nli.sh" (tokenize datasets)
- "src/examples/NLI/binarization_nli.sh" (binarize datasets)
- "src/examples/NLI/train_NLI.sh" (fine-tune them model to this task)
Download dataset from LINK We run the following scripts:
- "src/examples/Sentimental Analysis/process_sentimental.py" (To preprocess dataset and split it to training/validation/test sets)
- "src/examples/Sentimental Analysis/sentencepiece_sentimental.sh" (tokenize datasets)
- "src/examples/Sentimental Analysis/binarization_sentimental.sh" (binarize datasets)
- "src/examples/Sentimental Analysis/train_sentimental.sh" (fine-tune them model to this task)
We run the following scripts:
- "src/examples/classification-Macedonia/get-classification.sh" (To download the dataset)
- "src/examples/classification-Macedonia/process_classification.py" (To preprocess dataset and split it to training/validation/test sets)
- "src/examples/classification-Macedonia/sentencepiece_classification.sh" (tokenize datasets)
- "src/examples/classification-Macedonia/binarization_classification.sh" (binarize datasets)
- "src/examples/classification-Macedonia/train_classification.sh" (fine-tune them model to this task)
We can use the scripts "src/examples/inference.py" and "src/examples/calculate_score.py" to evaluate the model's performance into the test set.
If you trained the model with multiple seeds, you can utilize the "src/examples/compute_mean_std.py" script to calculate the mean, median, and standard deviation of the scores. The valid score corresponds to the best valid score across the epochs, and the test score corresponds to the test score of the epoch with the best valid score.
Thanks to its encoder-decoder structure, GreekBART can perform generative tasks such as summarization.
Please follow the steps here to get GreekSUM.
We run the following scripts:
- "src/summarization/abstract/sentencepiece_summarization.sh" (tokenize datasets)
- "src/summarization/abstract/binarization_summarization.sh" (binarize datasets)
- "src/summarization/abstract/train_summarization.sh" (fine-tune them model to this task)
We run the following scripts:
- "src/summarization/title/sentencepiece_summarization.sh" (tokenize datasets)
- "src/summarization/title/binarization_summarization.sh" (binarize datasets)
- "src/summarization/title/train_summarization.sh" (fine-tune them model to this task)
Use "src/summarization/generate_summary.py" and "src/summarization/calculate_score.py" to generate the summaries and to compute their ROUGE and BERTScore scores, respectively. No stemming is applied before evaluation.
Finally, we can execute the "src/summarization/models_statistics.r" script to compute statistics on the generated summaries, such as their length and the percentage of repetitions relative to their reference summaries.
You can find our demo HERE
If you use the code or any of the models, you can cite the following paper:
@misc{evdaimon2023greekbart,
title={GreekBART: The First Pretrained Greek Sequence-to-Sequence Model},
author={Iakovos Evdaimon and Hadi Abdine and Christos Xypolopoulos and Stamatis Outsios and Michalis Vazirgiannis and Giorgos Stamou},
year={2023},
eprint={2304.00869},
archivePrefix={arXiv},
primaryClass={cs.CL}
}