This repository contains experimental code and resources to train and evaluate language models for hate speech multi-label classification task. The aim of the project is to investigate the performance of different models, mainly BERT-based, on the task of hate speech classification.
The dataset provided in this repository contains 90k samples from various sources and created by combining multiple existing datasets (see datasets/dataset-preprocessed.csv
). The dataset was preprocessed and cleaned to ensure consistency and quality of the data. The dataset contains text samples and corresponding labels for hate speech.
Original datasets used for creating the unified dataset:
Prerequisites:
- Python >= 3.8
- CUDA capable GPU (NVIDIA) and CUDA toolkit
- Ubuntu 20.04 - project may also run on other distros with some minor tweaking of package versions
Installation:
Install required packages from requirements.txt
(installation may vary depending on the environment):
pip install -r requirements.txt
Install following torch packages:
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
Add executable permission to training script:
chmod +x train_in_bg.sh
We strongly recommend to read PyTorch and PyTorch Lightning documentations for better understanding of project and file structure as well as built-in functions used in this project.
-
notebooks
- contains Jupyter notebooks used for data exploration, visualization and processing. Its describes the creation of the unified dataset by combining multiple sources. -
benchmark
- contains dataset and model checkpoints used for performing benchmark. Benchmark can be run usingbenchmark.py
file in root directory. -
datasets
- contains datasets used for preprocessing and training:dataset_preprocessed.csv
- cleaned and preprocessed dataset used for trainingunified_dataset.csv
- raw dataset created from other datasets - original dataset names are descibed in columnoriginal_dataset
-
models
- model classes (containing model architecture and parameters) and training functions forbaseline
andbert_based
models -
utils
- helper functions and utilities -
constants.py
- configuration file containing all the hyper-parameters and constants used in the project -
preprocess.py
- preprocessing and logging (version control) of datasets -
test_sample_text.py
- outputs prediction for a given sentence using given model checkpoints -
train_in_bg.sh
- helper script which callsmain.py
and runs training of model in background (also outputs process id insave_pid.txt
and all the logs in thetraining_output.log
)
Before training, you may want to change which model architecture is going to be trained and change parameters.
To do so, you need to import given model in either
/models/baselines/train.py
or /models/bert_based/train.py
, change import to desired train function inside main.py
and tweak parameters according to your needs directly in model file (e.g. removing/adding layers in Model_LSTM.py
) or through available parameters as arguments.
Example of running training with default settings (baseline Model_LSTM.py
):
$ ./train_in_bg
> Enter model name: example_model_name_123
Afterwards, training output and results will be printed inside training_output.log
.
This project uses Wandb for experiment tracking. You can enable wandb by changing WANDB
in constants.py
to True
.