Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge CalBERT-rewrite with Main #1

Merged
merged 43 commits into from
Jul 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
6a73b2c
Initial commit
aditeyabaral Oct 24, 2021
6248405
Initial commit
aditeyabaral Oct 24, 2021
defabc9
Update file paths
aditeyabaral Oct 24, 2021
133a8c0
Commented GPT2 models
abaksy Oct 24, 2021
e9f403f
Merge branch 'main' of https://github.com/aditeyabaral/CalBERT into main
abaksy Oct 24, 2021
d8041b8
Add argument parsing
aditeyabaral Oct 25, 2021
503cf70
Add argument parsing
aditeyabaral Oct 25, 2021
b54e5ef
Updated argument parsing description
aditeyabaral Oct 25, 2021
0cd132d
Add pretraining script
aditeyabaral Nov 18, 2021
701612e
Fixed bug in Siamese Pretrain
abaksy Nov 18, 2021
acc6a1a
Added paper resources
Nov 26, 2021
44524e0
removed paper
Nov 26, 2021
28568f0
Update README and clean scripts
aditeyabaral Dec 6, 2021
f85f236
Merge branch 'main' of https://github.com/aditeyabaral/calBERT
aditeyabaral Dec 6, 2021
d47d18c
Code cleanup and update env requirements
aditeyabaral Dec 6, 2021
0264102
Add min token frequency filter
aditeyabaral Dec 11, 2021
317acfc
Added simple GUI
anshsarkar Dec 12, 2021
043d2e5
Updated GUI
anshsarkar Dec 12, 2021
a781856
restructure repo
aditeyabaral Dec 12, 2021
eedadfe
Deploy app
aditeyabaral Dec 12, 2021
cd477ca
Add initial file
aditeyabaral Jun 17, 2022
87fdb27
Add initial file
aditeyabaral Jun 17, 2022
8eece4d
Added initial trainer and dataset files
aditeyabaral Jun 19, 2022
9e7a4be
Added initial trainer and dataset files
aditeyabaral Jun 19, 2022
ba35bb6
Add model training
aditeyabaral Jun 23, 2022
07ac276
Add model training
aditeyabaral Jun 23, 2022
3d53bb7
Add documentation for dataset
aditeyabaral Jun 24, 2022
e70c55c
Add documentation for dataset
aditeyabaral Jun 24, 2022
b9c57a7
Add initial packaging
aditeyabaral Jun 24, 2022
634d13a
Add initial packaging
aditeyabaral Jun 24, 2022
7395e0c
Optimised imports
aditeyabaral Jun 24, 2022
c4cd206
Optimised imports
aditeyabaral Jun 24, 2022
fb3285c
Minor restructure
aditeyabaral Jun 24, 2022
7367547
Minor restructure
aditeyabaral Jun 24, 2022
ed5be40
Created setup.py for packaging
abaksy Jun 27, 2022
7689835
Added License
abaksy Jun 27, 2022
4aa9330
Merge branch 'rewrite' of https://github.com/aditeyabaral/calBERT int…
aditeyabaral Jun 27, 2022
fefe11a
Changed requirements for setup.py
aditeyabaral Jun 28, 2022
bf83adc
Bug fix in.py
aditeyabaral Jun 28, 2022
cef3b8a
Modified content_type
abaksy Jul 12, 2022
ceac827
Updated README
aditeyabaral Jul 13, 2022
4b99145
Merge branch 'rewrite' of https://github.com/aditeyabaral/calBERT int…
aditeyabaral Jul 13, 2022
ad735f5
Add tensorboard logging and checkpointing
aditeyabaral Jul 14, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,9 @@ target/
# Jupyter Notebook
.ipynb_checkpoints

# Pycharm project files
.idea/

# IPython
profile_default/
ipython_config.py
Expand Down Expand Up @@ -128,8 +131,8 @@ dmypy.json
# Pyre type checker
.pyre/

# src subdirectories
src/tokenizer_data/
# calbert subdirectories
calbert/tokenizer_data/

# Dataset creator files
chromedriver*
Expand Down
8 changes: 8 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Copyright 2022 CalBERT Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

239 changes: 109 additions & 130 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,155 +1,134 @@
# CalBERT
Code-mixed Adaptive Language representations using BERT
# CalBERT - Code-mixed Apaptive Language representations using BERT

CalBERT adapts existing Transformer representations for a language to another language by minimising the semantic space between equivalent sequences in those languages, thus allowing the Transformer to learn representations for the same tokens across two languages.
This repository contains the source code
for [CalBERT - Code-mixed Apaptive Language representations using BERT](http://ceur-ws.org/Vol-3121/short3.pdf),
published at AAAI-MAKE 2022, Stanford University.

CalBERT is language agnostic, and can be used to adapt any language to any other language. It is also task agnostic, and can be fine-tuned on any task.
CalBERT can be used to adapt existing Transformer language representations into another similar language by minimising
the semantic space between equivalent sentences in those languages, thus allowing the Transformer to learn
representations for words across two languages. It relies on a novel pre-training architecture named Siamese Pre-training to learn task-agnostic and language-agnostic
representations. For more information, please refer to the paper.

# How to use CalBERT
This framework allows you to perform CalBERT's Siamese Pre-training to learn representations for your own data and can be used to obtain dense vector representations for words, sentences or paragraphs. The base models used to
train CalBERT consist of BERT-based Transformer models such as BERT, RoBERTa, XLM, XLNet, DistilBERT, and so on.
CalBERT achieves state-of-the-art results on the SAIL and IIT-P Product Reviews datasets. CalBERT is also one of the
only models able to learn code-mixed language representations without the need for traditional pre-training methods and
is currently one of the few models available for Indian code-mixing such as Hinglish.

CalBERT is primarily meant to be used on an existing pre-trained Transformer model. CalBERT adapts the embeddings of the langauge the Transformer was pre-trained in to another target language which consists of the base language.
# Installation

## Environment setup
We recommend `Python 3.9` or higher for CalBERT.

If you use `conda`, you can create an environment with the following command:
## PyTorch with CUDA

```sh
conda env create -f environment.yml
```

You can also use the `requirements.txt` file to create an environment with the following command:

```sh
conda create -n calbert -f requirements.txt
```

## Data Preparation

The following terms will be used extensively in the following sections:

1. **Base Language**: The single language the Transformer was pre-trained in.
2. **Target Language**: The code-mixed language for which the Transformer will be adapting representations. This language is a superset of the base language, since it builds on top of the base language.

Note that the script language used across both languages has to be the same, i.e. Roman(English) for both languages or French for both languages and so on.

### Dataset Format

CalBERT requires code-mixed data in the following format.
If you want to use a GPU/ CUDA, you must install PyTorch with the matching CUDA Version. Follow
[PyTorch - Get Started](https://pytorch.org/get-started/locally/) for further details how to install PyTorch with CUDA.

1. The first column contains the sentence in the base language, such as English
## Install with pip
```bash
pip install calbert
```

2. The second column contains the original sentence in the target language -- the code-mixed language for which the Transformer is trying to adapt representations, such as Hinglish
## Install from source
You can also clone the current version from the [repository](https://github.com/aditeyabaral/calbert) and then directly
install the package.
```bash
pip install -e .
```

Examples of such data is given below:
# Getting Started

| Translation | Transliteration |
|--------------|------------------|
| I am going to the airport today | Main aaj airport jaa raha hoon |
| I really liked this movie | mujhe yeh movie bahut achhi lagi |
Detailed documentation coming soon.

An example of such a dataset is placed in the `data/` directory, named `dataset.csv`.
The following example shows you how to use CalBERT to obtain sentence embeddings.

### Dataset Creation
# Training

The `utils` folder contains scripts that can help you generate code-mixed datasets. The `create_dataset.py` script can be used to create a dataset in the format described above.
This framework allows you to also train your own CalBERT models on your own code-mixed data so you can learn
embeddings for your custom code-mixed languages. There are various options to choose from in order to get the best
embeddings for your language.

The input to the script is a file which contains newline delimited sentences in either of the following formats:

1. In a code-mixed language (like Hinglish)
2. One of the constituent languages of the code-mixed language *except* the base language (like Hindi).

If your input data is code-mixed, pass `True` for the ```--format``` flag. Else pass `False`.

```sh
usage: python create_dataset.py [-h] --input INPUT --output OUTPUT --target TARGET --base BASE --format FORMAT

Create dataset from text file. Ensure that the text file contains newline delimited sentences either in the target language for adaptation, or one of the constituent languages of the code-mixed language
*except* the base language

optional arguments:
-h, --help show this help message and exit
--input INPUT, -i INPUT
Input file
--output OUTPUT, -o OUTPUT
Output file as CSV
--target TARGET, -t TARGET
Language code of one of the constituent languages of the code-mixed language except the base language
--base BASE, -b BASE Base language code used to originally pre-train Transformer
--format FORMAT, -f FORMAT
Input data format is code-mixed
First, initialise a model with the base Transformer
```python
from calbert import CalBERT
model = CalBERT('bert-base-uncased')
```

Example:

```bash
python create_dataset.py
--input data/input_code_mixed.txt
--output data/dataset.csv
--target hi
--base en
--format True
Create a CalBERTDataset using your sentences
```python
from calbert import CalBERTDataset
base_language_sentences = [
"I am going to Delhi today via flight",
"This movie is awesome!"
]
target_language_sentences = [
"Main aaj flight lekar Delhi ja raha hoon.",
"Mujhe yeh movie bahut awesome lagi!"
]
dataset = CalBERTDataset(base_language_sentences, target_language_sentences)
```


## Siamese Pre-Training

To perform CalBERT's siamese pre-training, you need to use the `siamese_pretraining.py` script inside `src/`. It takes in the following arguments, all of which are self-explanatory.

```bash
usage: python siamese_pretraining.py [-h] --model MODEL --dataset DATASET [--hub HUB] [--loss LOSS] [--batch_size BATCH_SIZE] [--evaluator EVALUATOR] [--evaluator_examples EVALUATOR_EXAMPLES] [--epochs EPOCHS]
[--sample_negative SAMPLE_NEGATIVE] [--sample_size SAMPLE_SIZE] [--username USERNAME] [--password PASSWORD] [--output OUTPUT] [--hub_name HUB_NAME]

Siamese pre-train an existing Transformer model

optional arguments:
-h, --help show this help message and exit
--model MODEL, -m MODEL
Transformer model name/path to siamese pre-train
--dataset DATASET, -d DATASET
Path to dataset in required format
--hub HUB, -hf HUB Push model to HuggingFace Hub
--loss LOSS, -l LOSS Loss function to use -- cosine, contrastive or online_contrastive
--batch_size BATCH_SIZE, -b BATCH_SIZE
Batch size
--evaluator EVALUATOR, -v EVALUATOR
Evaluate as you train
--evaluator_examples EVALUATOR_EXAMPLES, -ee EVALUATOR_EXAMPLES
Number of examples to evaluate
--epochs EPOCHS, -e EPOCHS
Number of epochs
--sample_negative SAMPLE_NEGATIVE, -s SAMPLE_NEGATIVE
Sample negative examples
--sample_size SAMPLE_SIZE, -ss SAMPLE_SIZE
Number of negative examples to sample
--username USERNAME, -u USERNAME
Username for HuggingFace Hub
--password PASSWORD, -p PASSWORD
Password for HuggingFace Hub
--output OUTPUT, -o OUTPUT
Output directory path
--hub_name HUB_NAME, -hn HUB_NAME
Name of the model in the HuggingFace Hub
Then create a trainer and train the model
```python
from calbert import SiamesePreTrainer
trainer = SiamesePreTrainer(model, dataset)
trainer.train()
```

Example:

```bash
python siamese_pretraining.py \
--model xlm-roberta-base \
--dataset data/dataset.csv \
--hub False \
--loss cosine \
--batch_size 32 \
--evaluator True \
--evaluator_examples 1000 \
--epochs 10 \
--sample_negative True \
--sample_size 2 \
--output saved_models/calbert-xlm-roberta-base
# Performance

Our models achieve state-of-the-art results on the SAIL and IIT-P Product Reviews datasets.

More information will be added soon.

# Application and Uses

This framework can be used for:

- Computing code-mixed as well as plain sentence embeddings
- Obtaining semantic similarities between any two sentences
- Other textual tasks such as clustering, text summarization, semantic search and many more.

# Citing and Authors

If you find this repository useful, please cite our publication [CalBERT - Code-mixed Apaptive Language representations using BERT](http://ceur-ws.org/Vol-3121/short3.pdf).

```bibtex
@inproceedings{calbert-baral-et-al-2022,
author = {Aditeya Baral and
Aronya Baksy and
Ansh Sarkar and
Deeksha D and
Ashwini M. Joshi},
editor = {Andreas Martin and
Knut Hinkelmann and
Hans{-}Georg Fill and
Aurona Gerber and
Doug Lenat and
Reinhard Stolle and
Frank van Harmelen},
title = {CalBERT - Code-Mixed Adaptive Language Representations Using {BERT}},
booktitle = {Proceedings of the {AAAI} 2022 Spring Symposium on Machine Learning
and Knowledge Engineering for Hybrid Intelligence {(AAAI-MAKE} 2022),
Stanford University, Palo Alto, California, USA, March 21-23, 2022},
series = {{CEUR} Workshop Proceedings},
volume = {3121},
publisher = {CEUR-WS.org},
year = {2022},
url = {http://ceur-ws.org/Vol-3121/short3.pdf},
timestamp = {Fri, 22 Apr 2022 14:55:37 +0200}
}
```

The Siamese pre-trained CalBERT model will be saved into the specified output directory as `[model_name]_TRANSFORMER`. This model can now be fine-tuned for any given task.
# Contact

Please feel free to contact us by emailing us to report any issues or suggestions, or if you have any further
questions.

Contact: - [Aditeya Baral](https://aditeyabaral.github.io/), [[email protected]](mailto:[email protected])

## Pre-Training from Scratch
You can also contact the other maintainers listed below.

If you would like to pre-train your own Transformer from scratch on Masked-Language-Modelling before performing the Siamese Pre-training, you can use the scripts provided in `src/pretrain-transformer/`
- [Aronya Baksy](mailto:[email protected])
- [Ansh Sarkar](mailto:[email protected])
- [Deeksha D](mailto:[email protected])
77 changes: 0 additions & 77 deletions app/app.py

This file was deleted.

Loading