Skip to content

Commit

Permalink
Spellchecking ASR customization model (#6179)
Browse files Browse the repository at this point in the history
* bug fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bugs, add preparation and evaluation scripts, add readme

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add real coverage calculation, small fixes, more debug information

Signed-off-by: Alexandra Antonova <[email protected]>

* add option to pass a filelist and output folder - to handle inference from multiple input files

Signed-off-by: Alexandra Antonova <[email protected]>

* added preprocessing for yago wikipedia articles - finding yago entities and their subphrases

Signed-off-by: Alexandra Antonova <[email protected]>

* yago wiki preprocessing, sampling, pseudonormalization

Signed-off-by: Alexandra Antonova <[email protected]>

* more scripts for preparation of training examples

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* add some alphabet checks

Signed-off-by: Alexandra Antonova <[email protected]>

* add bert on subwords, concatenate it to bert on characters

Signed-off-by: Alexandra Antonova <[email protected]>

* add calculation of character_pos_to_subword_pos

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* tensor join bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* double hidden_size in classifier

Signed-off-by: Alexandra Antonova <[email protected]>

* pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* default index value 0 instead of -1 because index cannot be negative

Signed-off-by: Alexandra Antonova <[email protected]>

* pad index value 0 instead of -1 because index cannot be negative

Signed-off-by: Alexandra Antonova <[email protected]>

* remove pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bugs, add creation of tarred dataset

Signed-off-by: Alexandra Antonova <[email protected]>

* add possibility to change sequence len at inference

Signed-off-by: Alexandra Antonova <[email protected]>

* change sampling of dummy candidates at inference, add candidate info file

Signed-off-by: Alexandra Antonova <[email protected]>

* fix import

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* update transcription now uses info

Signed-off-by: Alexandra Antonova <[email protected]>

* write path

Signed-off-by: Alexandra Antonova <[email protected]>

* 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing

Signed-off-by: Alexandra Antonova <[email protected]>

* skip short_sent if no real candidates

Signed-off-by: Alexandra Antonova <[email protected]>

* fix import

Signed-off-by: Alexandra Antonova <[email protected]>

* add braceexpand

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug in np.ones

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug in collate

Signed-off-by: Alexandra Antonova <[email protected]>

* change tensor type to long because of error in torch.gather

Signed-off-by: Alexandra Antonova <[email protected]>

* fix for empty spans tensor

Signed-off-by: Alexandra Antonova <[email protected]>

* same fixes in _collate_fn for tarred dataset

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug from previous commit

Signed-off-by: Alexandra Antonova <[email protected]>

* change int types to be shorter to minimize tar size

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring of datasets and inference

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* tar by 100k examples, small fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes, add analytics script

Signed-off-by: Alexandra Antonova <[email protected]>

* Add functions for dynamic programming comparison to get best path by ngrams

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* small fix

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes to support testing on SPGISpeech

Signed-off-by: Alexandra Antonova <[email protected]>

* add preprocessing for userlibri

Signed-off-by: Alexandra Antonova <[email protected]>

* some refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* some refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* small refactoring before pr. Add bash-scripts reproducing evaluation

Signed-off-by: Alexandra Antonova <[email protected]>

* style fix

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes in inference

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix - didn't move window on last symbol

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug - shuffle was before truncation of sorted candidates

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring, fix some bugs

Signed-off-by: Alexandra Antonova <[email protected]>

* variour fixes. Add word_indices at inference

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add candidate positions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move data preparation and evaluation to other repo

Signed-off-by: Alexandra Antonova <[email protected]>

* add infer_reproduce_paper. Refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* refactor inference using fragment indices

Signed-off-by: Alexandra Antonova <[email protected]>

* add some helper functions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug with parameters order

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bugs

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring, fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add multiple variants of adjusting start/end positions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit tests, other fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix

Signed-off-by: Alexandra Antonova <[email protected]>

* fix CodeQl warnings

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bugs, add preparation and evaluation scripts, add readme

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add real coverage calculation, small fixes, more debug information

Signed-off-by: Alexandra Antonova <[email protected]>

* add option to pass a filelist and output folder - to handle inference from multiple input files

Signed-off-by: Alexandra Antonova <[email protected]>

* added preprocessing for yago wikipedia articles - finding yago entities and their subphrases

Signed-off-by: Alexandra Antonova <[email protected]>

* yago wiki preprocessing, sampling, pseudonormalization

Signed-off-by: Alexandra Antonova <[email protected]>

* more scripts for preparation of training examples

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* add some alphabet checks

Signed-off-by: Alexandra Antonova <[email protected]>

* add bert on subwords, concatenate it to bert on characters

Signed-off-by: Alexandra Antonova <[email protected]>

* add calculation of character_pos_to_subword_pos

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* tensor join bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* double hidden_size in classifier

Signed-off-by: Alexandra Antonova <[email protected]>

* pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* default index value 0 instead of -1 because index cannot be negative

Signed-off-by: Alexandra Antonova <[email protected]>

* pad index value 0 instead of -1 because index cannot be negative

Signed-off-by: Alexandra Antonova <[email protected]>

* remove pdb

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bugs, add creation of tarred dataset

Signed-off-by: Alexandra Antonova <[email protected]>

* add possibility to change sequence len at inference

Signed-off-by: Alexandra Antonova <[email protected]>

* change sampling of dummy candidates at inference, add candidate info file

Signed-off-by: Alexandra Antonova <[email protected]>

* fix import

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* update transcription now uses info

Signed-off-by: Alexandra Antonova <[email protected]>

* write path

Signed-off-by: Alexandra Antonova <[email protected]>

* 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing

Signed-off-by: Alexandra Antonova <[email protected]>

* skip short_sent if no real candidates

Signed-off-by: Alexandra Antonova <[email protected]>

* fix import

Signed-off-by: Alexandra Antonova <[email protected]>

* add braceexpand

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug in np.ones

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug in collate

Signed-off-by: Alexandra Antonova <[email protected]>

* change tensor type to long because of error in torch.gather

Signed-off-by: Alexandra Antonova <[email protected]>

* fix for empty spans tensor

Signed-off-by: Alexandra Antonova <[email protected]>

* same fixes in _collate_fn for tarred dataset

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug from previous commit

Signed-off-by: Alexandra Antonova <[email protected]>

* change int types to be shorter to minimize tar size

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring of datasets and inference

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix

Signed-off-by: Alexandra Antonova <[email protected]>

* tar by 100k examples, small fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes, add analytics script

Signed-off-by: Alexandra Antonova <[email protected]>

* Add functions for dynamic programming comparison to get best path by ngrams

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* small fix

Signed-off-by: Alexandra Antonova <[email protected]>

* fixes to support testing on SPGISpeech

Signed-off-by: Alexandra Antonova <[email protected]>

* add preprocessing for userlibri

Signed-off-by: Alexandra Antonova <[email protected]>

* some refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* some refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* move some functions to utils to reuse from other project

Signed-off-by: Alexandra Antonova <[email protected]>

* small refactoring before pr. Add bash-scripts reproducing evaluation

Signed-off-by: Alexandra Antonova <[email protected]>

* style fix

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes in inference

Signed-off-by: Alexandra Antonova <[email protected]>

* bug fix - didn't move window on last symbol

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug - shuffle was before truncation of sorted candidates

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring, fix some bugs

Signed-off-by: Alexandra Antonova <[email protected]>

* variour fixes. Add word_indices at inference

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add candidate positions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Move data preparation and evaluation to other repo

Signed-off-by: Alexandra Antonova <[email protected]>

* add infer_reproduce_paper. Refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* refactor inference using fragment indices

Signed-off-by: Alexandra Antonova <[email protected]>

* add some helper functions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bug with parameters order

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bugs

Signed-off-by: Alexandra Antonova <[email protected]>

* refactoring, fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add multiple variants of adjusting start/end positions

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* more fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add unit tests, other fixes

Signed-off-by: Alexandra Antonova <[email protected]>

* fix

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix CodeQl warnings

Signed-off-by: Alexandra Antonova <[email protected]>

* add script for full inference pipeline, refactoring

Signed-off-by: Alexandra Antonova <[email protected]>

* add tutorial

Signed-off-by: Alexandra Antonova <[email protected]>

* take example data from HuggingFace

Signed-off-by: Alexandra Antonova <[email protected]>

* add docs

Signed-off-by: Alexandra Antonova <[email protected]>

* fix comment

Signed-off-by: Alexandra Antonova <[email protected]>

* fix bug

Signed-off-by: Alexandra Antonova <[email protected]>

* small fixes for PR

Signed-off-by: Alexandra Antonova <[email protected]>

* add some more tests

Signed-off-by: Alexandra Antonova <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* try to fix tests adding with_downloads

Signed-off-by: Alexandra Antonova <[email protected]>

* skip tests with tokenizer download

Signed-off-by: Alexandra Antonova <[email protected]>

---------

Signed-off-by: Alexandra Antonova <[email protected]>
Signed-off-by: Alexandra Antonova <[email protected]>
Co-authored-by: Alexandra Antonova <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Jun 2, 2023
1 parent cfbe092 commit 5428a97
Show file tree
Hide file tree
Showing 33 changed files with 6,459 additions and 206 deletions.
1 change: 1 addition & 0 deletions docs/source/nlp/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ NeMo's NLP collection supports provides the following task-specific models:
:maxdepth: 1

punctuation_and_capitalization_models
spellchecking_asr_customization
token_classification
joint_intent_slot
text_classification
Expand Down
128 changes: 128 additions & 0 deletions docs/source/nlp/spellchecking_asr_customization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
.. _spellchecking_asr_customization:

SpellMapper (Spellchecking ASR Customization) Model
=====================================================

SpellMapper is a non-autoregressive model for postprocessing of ASR output. It gets as input a single ASR hypothesis (text) and a custom vocabulary and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any. Unlike traditional spellchecking approaches, which aim to correct known words using language models, SpellMapper's goal is to correct highly specific user terms, out-of-vocabulary (OOV) words or spelling variations (e.g., "John Koehn", "Jon Cohen").

This model is an alternative to word boosting/shallow fusion approaches:

- does not require retraining ASR model;
- does not require beam-search/language model (LM);
- can be applied on top of any English ASR model output;

Model Architecture
------------------
Though SpellMapper is based on `BERT <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-ner-devlin2018bert` architecture, it uses some non-standard tricks that make it different from other BERT-based models:

- ten separators (``[SEP]`` tokens) are used to combine the ASR hypothesis and ten candidate phrases into a single input;
- the model works on character level;
- subword embeddings are concatenated to the embeddings of each character that belongs to this subword;

.. code::
Example input: [CLS] a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o [SEP] d i d i e r _ s a u m o n [SEP] a s t r o n o m i e [SEP] t r i s t a n _ g u i l l o t [SEP] ...
Input segments: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4
Example output: 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 0 ...
The model calculates logits for each character x 11 labels:

- ``0`` - character doesn't belong to any candidate,
- ``1..10`` - character belongs to candidate with this id.

At inference average pooling is applied to calculate replacement probability for the whole fragments.

Quick Start Guide
-----------------

We recommend you try this model in a Jupyter notebook (need GPU):
`NeMo/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`__.

A pretrained English checkpoint can be found at `HuggingFace <https://huggingface.co/bene-ges/spellmapper_asr_customization_en>`__.

An example inference pipeline can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_infer.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__.

An example script on how to train the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training.sh>`__.

An example script on how to train on large datasets can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh>`__.

The default configuration file for the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml>`__.

.. _dataset_spellchecking_asr_customization:

Input/Output Format at Inference stage
--------------------------------------
Here we describe input/output format of the SpellMapper model.

.. note::

If you use `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ this format will be hidden inside and you only need to provide an input manifest and user vocabulary and you will get a corrected manifest.

An input line should consist of 4 tab-separated columns:
1. text of ASR-hypothesis
2. texts of 10 candidates separated by semicolon
3. 1-based ids of non-dummy candidates, separated by space
4. approximate start/end coordinates of non-dummy candidates (correspond to ids in third column)

Example input (in one line):

.. code::
t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x
h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
1 2 6 7 8 9 10
CUSTOM 6 23;CUSTOM 4 10;CUSTOM 4 15;CUSTOM 56 62;CUSTOM 5 19;CUSTOM 28 31;CUSTOM 39 48
Each line in SpellMapper output is tab-separated and consists of 4 columns:
1. ASR-hypothesis (same as in input)
2. 10 candidates separated by semicolon (same as in input)
3. fragment predictions, separated by semicolon, each prediction is a tuple (start, end, candidate_id, probability)
4. letter predictions - candidate_id predicted for each letter (this is only for debug purposes)

Example output (in one line):

.. code::
t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x
h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
56 62 7 0.99998;4 20 8 0.95181;12 20 8 0.44829;4 17 8 0.99464;12 17 8 0.97645
8 8 8 0 8 8 8 8 8 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7
Training Data Format
--------------------

For training, the data should consist of 5 files:

- ``config.json`` - BERT config
- ``label_map.txt`` - labels from 0 to 10, do not change
- ``semiotic_classes.txt`` - currently there are only two classes: ``PLAIN`` and ``CUSTOM``, do not change
- ``train.tsv`` - training examples
- ``test.tsv`` - validation examples

Note that since all these examples are synthetic, we do not reserve a set for final testing. Instead, we run `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ and compare resulting word error rate (WER) to the WER of baseline ASR output.

One (non-tarred) training example should consist of 4 tab-separated columns:
1. text of ASR-hypothesis
2. texts of 10 candidates separated by semicolon
3. 1-based ids of correct candidates, separated by space, or 0 if none
4. start/end coordinates of correct candidates (correspond to ids in third column)

Example (in one line):

.. code::
a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o
d i d i e r _ s a u m o n;a s t r o n o m i e;t r i s t a n _ g u i l l o t;t r i s t e s s e;m o n a d e;c h r i s t i a n;a s t r o n o m e r;s o l o m o n;d i d i d i d i d i;m e r c y
1 3
CUSTOM 12 23;CUSTOM 28 41
For data preparation see `this script <https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh>`__


References
----------

.. bibliography:: nlp_all.bib
:style: plain
:labelprefix: NLP-NER
:keyprefix: nlp-ner-
3 changes: 3 additions & 0 deletions docs/source/starthere/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ To run a tutorial:
* - NLP
- Punctuation and Capitalization
- `Punctuation and Capitalization <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb>`_
* - NLP
- Spellchecking ASR Customization - SpellMapper
- `Spellchecking ASR Customization - SpellMapper <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`_
* - NLP
- Entity Linking
- `Entity Linking <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Entity_Linking_Medical.ipynb>`_
Expand Down
32 changes: 32 additions & 0 deletions examples/nlp/spellchecking_asr_customization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# SpellMapper - spellchecking model for ASR Customization

This model is inspired by Microsoft's paper https://arxiv.org/pdf/2203.00888.pdf, but does not repeat its implementation.
The goal is to build a model that gets as input a single ASR hypothesis (text) and a vocabulary of custom words/phrases and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any.
Our model is non-autoregressive (NAR) based on transformer architecture (BERT with multiple separators).

As initial data we use about 5 mln entities from [YAGO corpus](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/). These entities are short phrases from Wikipedia headings.
In order to get misspelled predictions we feed these data to TTS model and then to ASR model.
Having a "parallel" corpus of "correct + misspelled" phrases, we use statistical machine translation techniques to create a dictionary of possible ngram mappings with their respective frequencies.
We create an auxiliary algorithm that takes as input a sentence (ASR hypothesis) and a large custom dictionary (e.g. 5000 phrases) and selects top 10 candidate phrases that are probably contained in this sentence in a misspelled way.
The task of our final neural model is to predict which fragments in the ASR hypothesis should be replaced by which of top-10 candidate phrases if any.

The pipeline consists of multiple steps:

1. Download or generate training data.
See `https://github.com/bene-ges/nemo_compatible/tree/main/scripts/nlp/en_spellmapper/dataset_preparation`

2. [Optional] Convert training dataset to tarred files.
`convert_dataset_to_tarred.sh`

3. Train spellchecking model.
`run_training.sh`
or
`run_training_tarred.sh`

4. Run evaluation.
- [test_on_kensho.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
- [test_on_userlibri.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
- [test_on_spoken_wikipedia.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)

5. Run inference.
`python run_infer.sh`
38 changes: 38 additions & 0 deletions examples/nlp/spellchecking_asr_customization/checkpoint_to_nemo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This script converts checkpoint .ckpt to .nemo file.
This script uses the `examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml`
config file by default. The other option is to set another config file via command
line arguments by `--config-name=CONFIG_FILE_PATH'.
"""

from omegaconf import DictConfig, OmegaConf

from nemo.collections.nlp.models import SpellcheckingAsrCustomizationModel
from nemo.core.config import hydra_runner
from nemo.utils import logging


@hydra_runner(config_path="conf", config_name="spellchecking_asr_customization_config")
def main(cfg: DictConfig) -> None:
logging.debug(f'Config Params: {OmegaConf.to_yaml(cfg)}')
SpellcheckingAsrCustomizationModel.load_from_checkpoint(cfg.checkpoint_path).save_to(cfg.target_nemo_path)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
name: &name spellchecking
lang: ??? # e.g. 'ru', 'en'

# Pretrained Nemo Models
pretrained_model: null

trainer:
devices: 1 # the number of gpus, 0 for CPU
num_nodes: 1
max_epochs: 3 # the number of training epochs
enable_checkpointing: false # provided by exp_manager
logger: false # provided by exp_manager
accumulate_grad_batches: 1 # accumulates grads every k batches
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
accelerator: gpu
strategy: ddp
log_every_n_steps: 1 # Interval of logging.
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.

model:
do_training: true
label_map: ??? # path/.../label_map.txt
semiotic_classes: ??? # path/.../semiotic_classes.txt
max_sequence_len: 128
lang: ${lang}
hidden_size: 768

optim:
name: adamw
lr: 3e-5
weight_decay: 0.1

sched:
name: WarmupAnnealing

# pytorch lightning args
monitor: val_loss
reduce_on_plateau: false

# scheduler config override
warmup_ratio: 0.1
last_epoch: -1

language_model:
pretrained_model_name: bert-base-uncased # For ru, try DeepPavlov/rubert-base-cased | For de or multilingual, try bert-base-multilingual-cased
lm_checkpoint: null
config_file: null # json file, precedence over config
config: null

tokenizer:
tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
vocab_file: null # path to vocab file
tokenizer_model: null # only used if tokenizer is sentencepiece
special_tokens: null

exp_manager:
exp_dir: nemo_experiments # where to store logs and checkpoints
name: training # name of experiment
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
save_top_k: 3
monitor: "val_loss"
mode: "min"

tokenizer:
tokenizer_name: ${model.transformer} # or sentencepiece
vocab_file: null # path to vocab file
tokenizer_model: null # only used if tokenizer is sentencepiece
special_tokens: null

# Data
data:
train_ds:
data_path: ??? # provide the full path to the file
batch_size: 8
shuffle: true
num_workers: 3
pin_memory: false
drop_last: false

validation_ds:
data_path: ??? # provide the full path to the file.
batch_size: 8
shuffle: false
num_workers: 3
pin_memory: false
drop_last: false


# Inference
inference:
from_file: null # Path to the raw text, no labels required. Each sentence on a separate line
out_file: null # Path to the output file
batch_size: 16 # batch size for inference.from_file
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Path to NeMo repository
NEMO_PATH=NeMo

DATA_PATH="data_folder"

## data_folder_example
## ├── tarred_data
## | └── (output)
## ├── config.json
##   ├── label_map.txt
##   ├── semiotic_classes.txt
## ├── test.tsv
## ├── 1.tsv
## ├── ...
## └── 200.tsv

## Each of {1-200}.tsv input files are 110'000 examples subsets of all.tsv (except for validation part),
## generated by https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh
## Note that in this example we use 110'000 as input and only pack 100'000 of them to tar file.
## This is because some input examples, e.g. too long, can be skipped during preprocessing, and we want all tar files to contain fixed equal number of examples.

for part in {1..200}
do
python ${NEMO_PATH}/examples/nlp/spellchecking_asr_customization/create_tarred_dataset.py \
lang="en" \
data.train_ds.data_path=${DATA_PATH}/${part}.tsv \
data.validation_ds.data_path=${DATA_PATH}/test.tsv \
model.max_sequence_len=256 \
model.language_model.pretrained_model_name=huawei-noah/TinyBERT_General_6L_768D \
model.language_model.config_file=${DATA_PATH}/config.json \
model.label_map=${DATA_PATH}/label_map.txt \
model.semiotic_classes=${DATA_PATH}/semiotic_classes.txt \
+output_tar_file=${DATA_PATH}/tarred_data/part${part}.tar \
+take_first_n_lines=100000
done
Loading

0 comments on commit 5428a97

Please sign in to comment.