GitHub - malllabiisc/DiPS: NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Source code for NAACL 2019 paper: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Overview of DiPS during decoding to generate k paraphrases. At each time step, a set of N sequences V^(t) is used to determine k < N sequences (X^∗) via submodular maximization . The above figure illustrates the motivation behind each submodular component. Please see Section 4 in the paper for details.

Also on GEM/NL-Augmenter 🦎 → 🐍

Please use/check diverse_paraphrase in NL-Augmenter for the transformer-model version. Diverse-Paraphrase: NL-Augmenter.

Dependencies

compatible with python 3.6
dependencies can be installed using requirements.txt

Dataset

Download the following datasets:

Extract and place them in the data directory. Path : data/<dataset-folder-name>. A sample dataset folder might look like data/quora/<train/test/val>/<src.txt/tgt.txt>.

Download GoogleNews-vectors-negative300.bin.gz into the data directory. In case the above link doesn't work, find the zip file here

Setup:

To get the project's source code, clone the github repository:

$ git clone https://github.com/malllabiisc/DiPS

Install VirtualEnv using the following (optional):

$ [sudo] pip install virtualenv

Create and activate your virtual environment (optional):

$ virtualenv -p python3 venv
$ source venv/bin/activate

Install all the required packages:

$ pip install -r requirements.txt

Install the submodopt package by running the following command from the root directory of the repository:

$ cd ./packages/submodopt
$ python setup.py install
$ cd ../../

Training the sequence to sequence model

python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset quora -run_name <run_name>

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

To use trained embeddings -

python -m src.create_dict -model trained -run_name <run_name> -gpu 0

To use pretrained word2vec embeddings -

python -m src.create_dict -model pretrained -run_name <run_name> -gpu 0

This will generate the word2vec.pickle file in data/embeddings

Decoding using submodularity

python -m src.main -mode decode -selec submod -run_name <run_name> -beam_width 10 -gpu 0

Citation

Please cite the following paper if you find this work relevant to your application

@inproceedings{dips2019,
    title = "Submodular Optimization-based Diverse Paraphrasing and its Effectiveness in Data Augmentation",
    author = "Kumar, Ashutosh  and
      Bhattamishra, Satwik  and
      Bhandari, Manik  and
      Talukdar, Partha",
    booktitle = "Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)",
    month = jun,
    year = "2019",
    address = "Minneapolis, Minnesota",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/N19-1363",
    pages = "3609--3619"
}

For any clarification, comments, or suggestions please create an issue or contact [email protected] or Satwik Bhattamishra

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
images		images
packages/submodopt		packages/submodopt
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Also on GEM/NL-Augmenter 🦎 → 🐍

Dependencies

Dataset

Setup:

Training the sequence to sequence model

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

Decoding using submodularity

Citation

About

Releases

Packages

Contributors 4

Languages

License

malllabiisc/DiPS

Folders and files

Latest commit

History

Repository files navigation

Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

Also on GEM/NL-Augmenter 🦎 → 🐍

Dependencies

Dataset

Setup:

Training the sequence to sequence model

Create dictionary for submodular subset selection. Used for Semantic similarity (L2)

Decoding using submodularity

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Create dictionary for submodular subset selection. Used for Semantic similarity (L₂)

Packages