visrep

This repository is an extension of fairseq to enable training with visual text representations.

For further information, please see:

Salesky et al. (2021): Robust Open-Vocabulary Translation from Visual Text Representations.
In Proceedings of EMNLP 2021.

For the multilingual task, please see the multi branch README.

Overview

Our approach replaces the source embedding matrix with visual text representations, computed from rendered text with (optional) convolutions. This creates a 'continuous' vocabulary, in place of the fixed-size embedding matrix, which takes into account visual similarity, which together improve model robustness. There is no preprocessing before rendering text: on the source side, we directly render raw text, which we slice into overlapping, fixed-width image tokens.

Given typical parallel text, the data loader renders a complete source sentence and then creates strided slices according to the values of --image-window (width) and --image-stride (stride). Image height is determined automatically from the font size (--font-size), and slices are created using the full image height. This creates a set of image 'tokens' for each sentence, one per slice, with size 'window width' x 'image height.'

Because the image tokens are generated completely in the data loader, to train and evaluate typical fairseq code remains largely unchanged. Our VisualTextTransformer (enabled with --task visual_text) produces the source representations for training from the rendered text (one per image token). After that, everything proceeds as per normal fairseq.

Installation

The installation is the same as fairseq, plus additional requirements specific to visual text.

Requirements:

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also want an NVIDIA GPU, and NCCL (for multi-gpu training)

To install and develop locally:

git clone https://github.com/esalesky/visrep
cd visrep
pip install --editable ./
pip install -r examples/visual_text/requirements.txt

Running the code

Training and evaluation can be called as with normal fairseq. The following parameters are unique to visrep:

--task visual_text 
--arch visual_text_transformer 
--image-window {VALUE} 
--image-stride {VALUE} 
--image-font-path {VALUE} (we have included the NotoSans fonts we used in this repo: see fairseq/data/visual/fonts/)
--image-embed-normalize
--image-embed-type {VALUE} (options for number of convolutional blocks: e.g., direct, 1layer, 2layer, ..

Visual text parameters are serialized into saved models and do not need to be specified at inference time.
Image samples can also optionally be written to the MODELDIR/samples/ subdirectory using --image-samples-path (directory to write to) and --image-samples-interval N (write every Nth image).

Inference

You can interact with models on the command line or through a script the same way as typical fairseq models, for example:

echo "Ich bin ein robustes Model" | PYTHONPATH=/exp/esalesky/visrep python -m fairseq_cli.interactive ./ --task 'visual_text' --path ./checkpoint_best.pt -s de -t en --target-dict dict.en.txt --beam 5

Check out some of our models on Zenodo.

Load with from_pretrained

# Download model, spm, and dict files from Zenodo
wget https://zenodo.org/record/5770933/files/de-en.zip
unzip de-en.zip

# Load the model in python
from fairseq.models.visual import VisualTextTransformerModel
model = VisualTextTransformerModel.from_pretrained(
    checkpoint_file='de-en/checkpoint_best.pt',
    target_dict='de-en/dict.en.txt',
    target_spm='de-en/spm_en.model',
    src='de',
    image_font_path='fairseq/data/visual/fonts/NotoSans-Regular.ttf'
)
model.eval()  # disable dropout (or leave in train mode to finetune)

# Translate
model.translate("Das ist ein Test.")
> 'This is a test.'

Binarization

In addition to running on raw text, you may want to preprocess (binarize) the data for larger experiments. This can be done as normal using fairseq preprocess but with the necessary visual text parameters, as below, and then passing --dataset-impl mmap instead of --dataset-impl raw during training. You may point to prepped (bpe'd) data for source and target here: it will be de-bpe'd on the source side before rendering.

WINDOW=30 STRIDE=20 FONTSIZE=10 /exp/esalesky/visrep/grid_scripts/preprocess.sh /exp/esalesky/visrep/data/de-en/5k /exp/esalesky/visrep/exp/bin de en --image-samples-interval 100000; done

Best visual text parameters

MTTT
- ar-en: 1layer, window 27, stride 10, fontsize 14, batch 20k
- de-en: 1layer, window 20, stride 5, fontsize 10, batch 20k
- fr-en: 1layer, window 15, stride 10, fontsize 10, batch 20k
- ko-en: 1layer, window 25, stride 8, fontsize 12, batch 20k
- ja-en: 1layer, window 25, stride 8, fontsize 10, batch 20k
- ru-en: 1layer, window 20, stride 10, fontsize 10, batch 20k
- zh-en: 1layer, window 30, stride 6, fontsize 10, batch 20k
WMT (filtered)
- de-en: direct, window 30, stride 20, fontsize 8, batch 40k
- zh-en: direct, window 25, stride 10, fontsize 8, batch 40k

Grid scripts

We include our grid scripts, which use the UGE scheduler, in grid_scripts.
These include *.qsub, train.sh, train-big.sh, translate.sh, translate-big.sh, and translate-all-testsets.sh to bulk queue translation of multiple test sets. The .sh scripts have the hyperparameters for the small (MTTT) and larger datasets.

Example:

export lang=fr; export window=25; export stride=10; 
qsub train.qsub /exp/esalesky/visrep/exp/$lang-en/1layernorm.window$window.stride$stride.fontsize10.batch20k $lang en --image-font-size 10 --image-window $window --image-stride $stride --image-embed-type 1layer --update-freq 2

Important Files

fairseq/tasks/visual_text.py

The visual text task. Does data loading, instantiates the model for training, and creates the data for inference.
fairseq/data/visual/visual_text_dataset.py

Creates a visual text dataset object for fairseq.
fairseq/data/visual/image_generator.py

Loads the raw data, and generates images from text.

To generate individual samples from image_generator.py directly, it can be called like so:
```
./image_generator.py --font-size 10 --font-file fonts/NotoSans-Regular.ttf --text "This is a sentence." --prefix english --window 25 --stride 10
```
combine.sh in the same directory can combine the slices into a single image to visualize what the image tokens for a sentence look like (as in Table 6 in the paper).
fairseq/models/visual/visual_transformer.py (Note: fairseq/models/visual_transformer.py is UNUSED)

Creates the VisualTextTransformerModel. This has a VisualTextTransformerEncoder and a normal decoder. The only thing that is unique to this encoder is that it calls self.cnn_embedder to create source representations
There may be additional obsolete visual files in the repository.

Inducing noise

We induced five types of noise, as below:

swap: swaps two adjacent characters per token. applies to words of length >=2 (Arabic, French, German, Korean, Russian)
cmabrigde: permutes word-internal characters with first and last character unchanged. applies to words of length >=4 (Arabic, French, German, Korean, Russian)
diacritization: diacritization, applied via camel-tools (Arabic)
unicode: substitutes visually similar Latin characters for Cyrillic characters (Russian)
l33tspeak: substitutes numbers or other visually similar characters for Latin characters (French, German)

The scripts to induce noise are in scripts/visual_text, where -p is the probability of inducing noise per-token, and can be run as below. In our paper we use p from 0.1 to 1.0, in intervals of 0.1.

cat test.de-en.de | python3 scripts/visual_text/swap.py -p 0.1 > visual/test-sets/swap_10.de-en.de
cat test.ko-en.ko | python3 scripts/visual_text/cmabrigde.py -p 0.1 > visual/test-sets/cam_10.ko-en.ko
cat test.ar-en.ar | python3 scripts/visual_text/diacritization.py -p 0.1 > visual/test-sets/dia_10.ar-en.ar
cat test.ru-en.ru | python3 scripts/visual_text/cyrillic_noise.py -p 0.1 > visual/test-sets/cyr_10.ru-en.ru
cat test.fr-en.fr | python3 scripts/visual_text/l33t.py -p 0.1 > visual/test-sets/l33t_10.fr-en.fr

License

fairseq(-py) is MIT-licensed.

Citation

Please cite as:

@inproceedings{salesky-etal-2021-robust,
    title = "Robust Open-Vocabulary Translation from Visual Text Representations",
    author = "Salesky, Elizabeth  and
      Etter, David  and
      Post, Matt",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2104.08211",
}

@inproceedings{ott2019fairseq,
  title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
  author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
  booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
  year = {2019},
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,025 Commits
.github		.github
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
grid_scripts		grid_scripts
scripts		scripts
tests		tests
visual		visual
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
setup.py		setup.py
text_utils.py		text_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

visrep

Overview

Installation

Running the code

Inference

Load with from_pretrained

Binarization

Grid scripts

Important Files

Inducing noise

License

Citation

About

Releases

Packages

Languages

License

esalesky/visrep

Folders and files

Latest commit

History

Repository files navigation

visrep

Overview

Installation

Running the code

Inference

Load with from_pretrained

Binarization

Grid scripts

Important Files

Inducing noise

License

Citation

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages