Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spellchecking ASR customization model #6179

Merged
merged 197 commits into from
Jun 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
197 commits
Select commit Hold shift + click to select a range
545598f
bug fixes
Oct 12, 2022
adb1ce2
fix bugs, add preparation and evaluation scripts, add readme
Oct 19, 2022
37693f4
small fixes
Oct 19, 2022
16a75f0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2022
7f02059
add real coverage calculation, small fixes, more debug information
bene-ges Nov 3, 2022
2ee091c
add option to pass a filelist and output folder - to handle inference…
bene-ges Nov 4, 2022
540ed99
added preprocessing for yago wikipedia articles - finding yago entiti…
bene-ges Nov 24, 2022
1932dce
yago wiki preprocessing, sampling, pseudonormalization
bene-ges Nov 28, 2022
047c7c8
more scripts for preparation of training examples
bene-ges Dec 9, 2022
ba1a79b
bug fixes
bene-ges Dec 10, 2022
996aa5e
add some alphabet checks
bene-ges Dec 15, 2022
ee2fe28
add bert on subwords, concatenate it to bert on characters
bene-ges Nov 11, 2022
14d8c80
add calculation of character_pos_to_subword_pos
bene-ges Nov 12, 2022
4c975e1
bug fix
bene-ges Nov 12, 2022
a4069dd
bug fix
bene-ges Nov 12, 2022
82b2b4c
pdb
bene-ges Nov 12, 2022
9ad5b7c
tensor join bug fix
bene-ges Nov 12, 2022
6d63ad5
double hidden_size in classifier
bene-ges Nov 12, 2022
bee59f2
pdb
bene-ges Nov 12, 2022
fcd5e8f
default index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
fb3e927
pad index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
b2c5c5b
remove pdb
bene-ges Nov 12, 2022
c389b36
fix bugs, add creation of tarred dataset
bene-ges Dec 16, 2022
9d638a7
add possibility to change sequence len at inference
bene-ges Dec 18, 2022
50d1147
change sampling of dummy candidates at inference, add candidate info …
bene-ges Dec 19, 2022
55205fd
fix import
bene-ges Dec 19, 2022
be14005
fix bug
bene-ges Dec 19, 2022
02bc90a
update transcription now uses info
bene-ges Dec 20, 2022
c12b45f
write path
bene-ges Dec 20, 2022
69fd821
1. add tarred dataset support(untested). 2. fix bug with ban_ngrams i…
bene-ges Dec 22, 2022
1c3793c
skip short_sent if no real candidates
bene-ges Dec 22, 2022
0e6a981
fix import
bene-ges Dec 22, 2022
b3f0f28
add braceexpand
bene-ges Dec 22, 2022
f087a6d
fixes
bene-ges Dec 22, 2022
955e59a
fix bug
bene-ges Dec 22, 2022
ecf6ca5
fix bug
bene-ges Dec 22, 2022
d82f47f
fix bug in np.ones
bene-ges Dec 28, 2022
68ee337
fix bug in collate
bene-ges Dec 28, 2022
cd6a265
change tensor type to long because of error in torch.gather
bene-ges Dec 28, 2022
37ae2df
fix for empty spans tensor
bene-ges Dec 28, 2022
4049984
same fixes in _collate_fn for tarred dataset
bene-ges Dec 28, 2022
b0adc87
fix bug from previous commit
bene-ges Dec 28, 2022
9328161
change int types to be shorter to minimize tar size
bene-ges Dec 28, 2022
b198c0c
refactoring of datasets and inference
bene-ges Dec 30, 2022
0c93c11
bug fix
bene-ges Dec 30, 2022
d9cd84e
bug fix
bene-ges Dec 30, 2022
02fb3f8
bug fix
bene-ges Dec 30, 2022
6454517
tar by 100k examples, small fixes
bene-ges Jan 5, 2023
f088491
small fixes, add analytics script
bene-ges Jan 9, 2023
d22c7ff
Add functions for dynamic programming comparison to get best path by …
bene-ges Jan 13, 2023
28deac6
fixes
bene-ges Feb 1, 2023
a5706ff
small fix
bene-ges Feb 1, 2023
4b2dee1
fixes to support testing on SPGISpeech
bene-ges Feb 15, 2023
ca33fb9
add preprocessing for userlibri
bene-ges Feb 21, 2023
82c909b
some refactoring
bene-ges Mar 5, 2023
287d67b
some refactoring
bene-ges Mar 6, 2023
3bdff5d
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
2acc888
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
f778141
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
28c0246
small refactoring before pr. Add bash-scripts reproducing evaluation
bene-ges Mar 12, 2023
3843abd
style fix
bene-ges Mar 13, 2023
bc1f8c1
small fixes in inference
bene-ges Apr 19, 2023
cbc24d8
bug fix - didn't move window on last symbol
bene-ges Apr 21, 2023
0d9c001
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
2dd24ed
fix bug - shuffle was before truncation of sorted candidates
bene-ges Apr 21, 2023
fe75a76
refactoring, fix some bugs
bene-ges Apr 27, 2023
edd48fa
variour fixes. Add word_indices at inference
bene-ges May 2, 2023
c989676
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
a055199
add candidate positions
bene-ges May 3, 2023
1df3e90
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 2, 2023
ef83a8e
Move data preparation and evaluation to other repo
bene-ges May 5, 2023
110f2df
add infer_reproduce_paper. Refactoring
bene-ges May 6, 2023
4fb86fc
refactor inference using fragment indices
bene-ges May 10, 2023
1b8dafe
add some helper functions
bene-ges May 13, 2023
33a6e9f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 10, 2023
fd86468
fix bug with parameters order
bene-ges May 13, 2023
abf4a8f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
3b174a0
fix bugs
bene-ges May 17, 2023
0454953
refactoring, fix bug
bene-ges May 20, 2023
05c8fe7
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
e83b38b
add multiple variants of adjusting start/end positions
bene-ges May 20, 2023
4ce665c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
2d01c26
more fixes
bene-ges May 21, 2023
7b09133
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
c918d40
add unit tests, other fixes
bene-ges May 22, 2023
c79b676
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
997b940
fix
bene-ges May 22, 2023
9db922d
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 22, 2023
ecd58e4
fix CodeQl warnings
bene-ges May 23, 2023
7929aa0
bug fixes
Oct 12, 2022
dd3f784
fix bugs, add preparation and evaluation scripts, add readme
Oct 19, 2022
3277dd2
small fixes
Oct 19, 2022
a358134
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 22, 2022
76e637b
add real coverage calculation, small fixes, more debug information
bene-ges Nov 3, 2022
b32dac0
add option to pass a filelist and output folder - to handle inference…
bene-ges Nov 4, 2022
1e8b103
added preprocessing for yago wikipedia articles - finding yago entiti…
bene-ges Nov 24, 2022
e4528f7
yago wiki preprocessing, sampling, pseudonormalization
bene-ges Nov 28, 2022
e1d5d04
more scripts for preparation of training examples
bene-ges Dec 9, 2022
5cb6d32
bug fixes
bene-ges Dec 10, 2022
3d78770
add some alphabet checks
bene-ges Dec 15, 2022
e803d4e
add bert on subwords, concatenate it to bert on characters
bene-ges Nov 11, 2022
a381a61
add calculation of character_pos_to_subword_pos
bene-ges Nov 12, 2022
b8c6e4f
bug fix
bene-ges Nov 12, 2022
30bc4cd
bug fix
bene-ges Nov 12, 2022
4c323c4
pdb
bene-ges Nov 12, 2022
4f9d0c8
tensor join bug fix
bene-ges Nov 12, 2022
0e98191
double hidden_size in classifier
bene-ges Nov 12, 2022
a081e56
pdb
bene-ges Nov 12, 2022
d4f29af
default index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
5b581ba
pad index value 0 instead of -1 because index cannot be negative
bene-ges Nov 12, 2022
6fab32e
remove pdb
bene-ges Nov 12, 2022
66e11dc
fix bugs, add creation of tarred dataset
bene-ges Dec 16, 2022
bf3bed1
add possibility to change sequence len at inference
bene-ges Dec 18, 2022
5142e73
change sampling of dummy candidates at inference, add candidate info …
bene-ges Dec 19, 2022
c19b54c
fix import
bene-ges Dec 19, 2022
650ca4c
fix bug
bene-ges Dec 19, 2022
5f848e0
update transcription now uses info
bene-ges Dec 20, 2022
8ce1037
write path
bene-ges Dec 20, 2022
aee765c
1. add tarred dataset support(untested). 2. fix bug with ban_ngrams i…
bene-ges Dec 22, 2022
46566d3
skip short_sent if no real candidates
bene-ges Dec 22, 2022
96335d6
fix import
bene-ges Dec 22, 2022
b8dc2aa
add braceexpand
bene-ges Dec 22, 2022
353f016
fixes
bene-ges Dec 22, 2022
e8ecf54
fix bug
bene-ges Dec 22, 2022
d3cdf00
fix bug
bene-ges Dec 22, 2022
1a2dbf5
fix bug in np.ones
bene-ges Dec 28, 2022
3bda8f7
fix bug in collate
bene-ges Dec 28, 2022
c73eb22
change tensor type to long because of error in torch.gather
bene-ges Dec 28, 2022
2401fc4
fix for empty spans tensor
bene-ges Dec 28, 2022
e21781c
same fixes in _collate_fn for tarred dataset
bene-ges Dec 28, 2022
6cfe2c7
fix bug from previous commit
bene-ges Dec 28, 2022
0d87dc7
change int types to be shorter to minimize tar size
bene-ges Dec 28, 2022
f345ead
refactoring of datasets and inference
bene-ges Dec 30, 2022
bb89dfd
bug fix
bene-ges Dec 30, 2022
c91b0c5
bug fix
bene-ges Dec 30, 2022
a966fa9
bug fix
bene-ges Dec 30, 2022
2080c87
tar by 100k examples, small fixes
bene-ges Jan 5, 2023
740ed63
small fixes, add analytics script
bene-ges Jan 9, 2023
1d8d1b0
Add functions for dynamic programming comparison to get best path by …
bene-ges Jan 13, 2023
af50cdb
fixes
bene-ges Feb 1, 2023
2276bb2
small fix
bene-ges Feb 1, 2023
6a1e11f
fixes to support testing on SPGISpeech
bene-ges Feb 15, 2023
82d551d
add preprocessing for userlibri
bene-ges Feb 21, 2023
8fd5b34
some refactoring
bene-ges Mar 5, 2023
cf9484c
some refactoring
bene-ges Mar 6, 2023
fca2a11
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
e1c43a0
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
7ffb782
move some functions to utils to reuse from other project
bene-ges Mar 8, 2023
ff00416
small refactoring before pr. Add bash-scripts reproducing evaluation
bene-ges Mar 12, 2023
17e36b4
style fix
bene-ges Mar 13, 2023
c497fd2
small fixes in inference
bene-ges Apr 19, 2023
1bdef8f
bug fix - didn't move window on last symbol
bene-ges Apr 21, 2023
e459556
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
32a1535
fix bug - shuffle was before truncation of sorted candidates
bene-ges Apr 21, 2023
98c7486
refactoring, fix some bugs
bene-ges Apr 27, 2023
7554ab4
variour fixes. Add word_indices at inference
bene-ges May 2, 2023
8a91b03
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 21, 2023
a28c305
add candidate positions
bene-ges May 3, 2023
7e86efd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 2, 2023
1fccfa3
Move data preparation and evaluation to other repo
bene-ges May 5, 2023
681a323
add infer_reproduce_paper. Refactoring
bene-ges May 6, 2023
8915c0c
refactor inference using fragment indices
bene-ges May 10, 2023
e74cd27
add some helper functions
bene-ges May 13, 2023
5251684
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 10, 2023
a63d6be
fix bug with parameters order
bene-ges May 13, 2023
ad8e2ff
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
2b80e25
fix bugs
bene-ges May 17, 2023
1b790bc
refactoring, fix bug
bene-ges May 20, 2023
7300de5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 13, 2023
5ef0c65
add multiple variants of adjusting start/end positions
bene-ges May 20, 2023
303f571
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
ec1442a
more fixes
bene-ges May 21, 2023
374664c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2023
02467c3
add unit tests, other fixes
bene-ges May 22, 2023
cf08a46
fix
bene-ges May 22, 2023
3d9ba36
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 22, 2023
868e248
fix CodeQl warnings
bene-ges May 23, 2023
9997419
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 23, 2023
691d7eb
add script for full inference pipeline, refactoring
bene-ges May 26, 2023
f6b819a
add tutorial
bene-ges May 26, 2023
3fa3b62
take example data from HuggingFace
bene-ges May 27, 2023
f35e331
add docs
bene-ges May 27, 2023
ce13037
fix comment
bene-ges May 27, 2023
8b2c1fa
fix bug
bene-ges May 30, 2023
4eddb80
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 30, 2023
428b32e
small fixes for PR
bene-ges May 31, 2023
0365036
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
f5d5ffd
add some more tests
bene-ges May 31, 2023
0d34292
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
c8a60e5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 31, 2023
c1edc67
try to fix tests adding with_downloads
bene-ges May 31, 2023
c40d2a1
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges May 31, 2023
35a1ec1
Merge branch 'spellchecking_asr_customization_double_bert' of github.…
bene-ges May 31, 2023
ecde7bc
skip tests with tokenizer download
bene-ges Jun 1, 2023
2262ec1
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 1, 2023
e36ce3d
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 1, 2023
a8664da
Merge branch 'main' into spellchecking_asr_customization_double_bert
bene-ges Jun 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/nlp/models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ NeMo's NLP collection supports provides the following task-specific models:
:maxdepth: 1

punctuation_and_capitalization_models
spellchecking_asr_customization
token_classification
joint_intent_slot
text_classification
Expand Down
128 changes: 128 additions & 0 deletions docs/source/nlp/spellchecking_asr_customization.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
.. _spellchecking_asr_customization:

SpellMapper (Spellchecking ASR Customization) Model
=====================================================

SpellMapper is a non-autoregressive model for postprocessing of ASR output. It gets as input a single ASR hypothesis (text) and a custom vocabulary and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any. Unlike traditional spellchecking approaches, which aim to correct known words using language models, SpellMapper's goal is to correct highly specific user terms, out-of-vocabulary (OOV) words or spelling variations (e.g., "John Koehn", "Jon Cohen").

This model is an alternative to word boosting/shallow fusion approaches:

- does not require retraining ASR model;
ekmb marked this conversation as resolved.
Show resolved Hide resolved
- does not require beam-search/language model (LM);
- can be applied on top of any English ASR model output;
ekmb marked this conversation as resolved.
Show resolved Hide resolved

Model Architecture
------------------
Though SpellMapper is based on `BERT <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-ner-devlin2018bert` architecture, it uses some non-standard tricks that make it different from other BERT-based models:

- ten separators (``[SEP]`` tokens) are used to combine the ASR hypothesis and ten candidate phrases into a single input;
- the model works on character level;
- subword embeddings are concatenated to the embeddings of each character that belongs to this subword;

.. code::

Example input: [CLS] a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o [SEP] d i d i e r _ s a u m o n [SEP] a s t r o n o m i e [SEP] t r i s t a n _ g u i l l o t [SEP] ...
Input segments: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4
Example output: 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 0 ...

The model calculates logits for each character x 11 labels:

- ``0`` - character doesn't belong to any candidate,
- ``1..10`` - character belongs to candidate with this id.

At inference average pooling is applied to calculate replacement probability for the whole fragments.

Quick Start Guide
-----------------

We recommend you try this model in a Jupyter notebook (need GPU):
`NeMo/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`__.

A pretrained English checkpoint can be found at `HuggingFace <https://huggingface.co/bene-ges/spellmapper_asr_customization_en>`__.

An example inference pipeline can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_infer.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__.

An example script on how to train the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training.sh>`__.

An example script on how to train on large datasets can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh>`__.

The default configuration file for the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml>`__.

.. _dataset_spellchecking_asr_customization:

Input/Output Format at Inference stage
--------------------------------------
Here we describe input/output format of the SpellMapper model.

.. note::

If you use `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ this format will be hidden inside and you only need to provide an input manifest and user vocabulary and you will get a corrected manifest.

An input line should consist of 4 tab-separated columns:
ekmb marked this conversation as resolved.
Show resolved Hide resolved
1. text of ASR-hypothesis
2. texts of 10 candidates separated by semicolon
3. 1-based ids of non-dummy candidates, separated by space
4. approximate start/end coordinates of non-dummy candidates (correspond to ids in third column)
ekmb marked this conversation as resolved.
Show resolved Hide resolved

Example input (in one line):

.. code::

t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x
h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
1 2 6 7 8 9 10
CUSTOM 6 23;CUSTOM 4 10;CUSTOM 4 15;CUSTOM 56 62;CUSTOM 5 19;CUSTOM 28 31;CUSTOM 39 48

Each line in SpellMapper output is tab-separated and consists of 4 columns:
1. ASR-hypothesis (same as in input)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tab?
is this manifest? no fields?

2. 10 candidates separated by semicolon (same as in input)
3. fragment predictions, separated by semicolon, each prediction is a tuple (start, end, candidate_id, probability)
4. letter predictions - candidate_id predicted for each letter (this is only for debug purposes)

Example output (in one line):

.. code::

t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x
h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
56 62 7 0.99998;4 20 8 0.95181;12 20 8 0.44829;4 17 8 0.99464;12 17 8 0.97645
8 8 8 0 8 8 8 8 8 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7

Training Data Format
--------------------

For training, the data should consist of 5 files:

- ``config.json`` - BERT config
- ``label_map.txt`` - labels from 0 to 10, do not change
- ``semiotic_classes.txt`` - currently there are only two classes: ``PLAIN`` and ``CUSTOM``, do not change
- ``train.tsv`` - training examples
- ``test.tsv`` - validation examples

Note that since all these examples are synthetic, we do not reserve a set for final testing. Instead, we run `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ and compare resulting word error rate (WER) to the WER of baseline ASR output.

One (non-tarred) training example should consist of 4 tab-separated columns:
1. text of ASR-hypothesis
2. texts of 10 candidates separated by semicolon
3. 1-based ids of correct candidates, separated by space, or 0 if none
4. start/end coordinates of correct candidates (correspond to ids in third column)

Example (in one line):

.. code::

a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o
d i d i e r _ s a u m o n;a s t r o n o m i e;t r i s t a n _ g u i l l o t;t r i s t e s s e;m o n a d e;c h r i s t i a n;a s t r o n o m e r;s o l o m o n;d i d i d i d i d i;m e r c y
1 3
CUSTOM 12 23;CUSTOM 28 41

For data preparation see `this script <https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh>`__


References
----------

.. bibliography:: nlp_all.bib
:style: plain
:labelprefix: NLP-NER
:keyprefix: nlp-ner-
3 changes: 3 additions & 0 deletions docs/source/starthere/tutorials.rst
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,9 @@ To run a tutorial:
* - NLP
- Punctuation and Capitalization
- `Punctuation and Capitalization <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb>`_
* - NLP
- Spellchecking ASR Customization - SpellMapper
- `Spellchecking ASR Customization - SpellMapper <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`_
* - NLP
- Entity Linking
- `Entity Linking <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Entity_Linking_Medical.ipynb>`_
Expand Down
32 changes: 32 additions & 0 deletions examples/nlp/spellchecking_asr_customization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# SpellMapper - spellchecking model for ASR Customization

This model is inspired by Microsoft's paper https://arxiv.org/pdf/2203.00888.pdf, but does not repeat its implementation.
The goal is to build a model that gets as input a single ASR hypothesis (text) and a vocabulary of custom words/phrases and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any.
Our model is non-autoregressive (NAR) based on transformer architecture (BERT with multiple separators).

As initial data we use about 5 mln entities from [YAGO corpus](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/). These entities are short phrases from Wikipedia headings.
In order to get misspelled predictions we feed these data to TTS model and then to ASR model.
Having a "parallel" corpus of "correct + misspelled" phrases, we use statistical machine translation techniques to create a dictionary of possible ngram mappings with their respective frequencies.
We create an auxiliary algorithm that takes as input a sentence (ASR hypothesis) and a large custom dictionary (e.g. 5000 phrases) and selects top 10 candidate phrases that are probably contained in this sentence in a misspelled way.
The task of our final neural model is to predict which fragments in the ASR hypothesis should be replaced by which of top-10 candidate phrases if any.

ekmb marked this conversation as resolved.
Show resolved Hide resolved
The pipeline consists of multiple steps:

1. Download or generate training data.
See `https://github.com/bene-ges/nemo_compatible/tree/main/scripts/nlp/en_spellmapper/dataset_preparation`

2. [Optional] Convert training dataset to tarred files.
`convert_dataset_to_tarred.sh`

3. Train spellchecking model.
`run_training.sh`
or
`run_training_tarred.sh`

4. Run evaluation.
- [test_on_kensho.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
- [test_on_userlibri.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
- [test_on_spoken_wikipedia.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)

5. Run inference.
`python run_infer.sh`
38 changes: 38 additions & 0 deletions examples/nlp/spellchecking_asr_customization/checkpoint_to_nemo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


"""
This script converts checkpoint .ckpt to .nemo file.

This script uses the `examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml`
config file by default. The other option is to set another config file via command
line arguments by `--config-name=CONFIG_FILE_PATH'.
"""

from omegaconf import DictConfig, OmegaConf

from nemo.collections.nlp.models import SpellcheckingAsrCustomizationModel
from nemo.core.config import hydra_runner
from nemo.utils import logging


@hydra_runner(config_path="conf", config_name="spellchecking_asr_customization_config")
def main(cfg: DictConfig) -> None:
logging.debug(f'Config Params: {OmegaConf.to_yaml(cfg)}')
SpellcheckingAsrCustomizationModel.load_from_checkpoint(cfg.checkpoint_path).save_to(cfg.target_nemo_path)


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
name: &name spellchecking
lang: ??? # e.g. 'ru', 'en'

# Pretrained Nemo Models
pretrained_model: null

trainer:
devices: 1 # the number of gpus, 0 for CPU
num_nodes: 1
max_epochs: 3 # the number of training epochs
enable_checkpointing: false # provided by exp_manager
logger: false # provided by exp_manager
accumulate_grad_batches: 1 # accumulates grads every k batches
gradient_clip_val: 0.0
precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
accelerator: gpu
strategy: ddp
log_every_n_steps: 1 # Interval of logging.
val_check_interval: 1.0 # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.

model:
do_training: true
label_map: ??? # path/.../label_map.txt
semiotic_classes: ??? # path/.../semiotic_classes.txt
max_sequence_len: 128
lang: ${lang}
hidden_size: 768

optim:
name: adamw
lr: 3e-5
weight_decay: 0.1

sched:
name: WarmupAnnealing

# pytorch lightning args
monitor: val_loss
reduce_on_plateau: false

# scheduler config override
warmup_ratio: 0.1
last_epoch: -1

language_model:
pretrained_model_name: bert-base-uncased # For ru, try DeepPavlov/rubert-base-cased | For de or multilingual, try bert-base-multilingual-cased
lm_checkpoint: null
config_file: null # json file, precedence over config
config: null

tokenizer:
tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
vocab_file: null # path to vocab file
tokenizer_model: null # only used if tokenizer is sentencepiece
special_tokens: null

exp_manager:
exp_dir: nemo_experiments # where to store logs and checkpoints
name: training # name of experiment
create_tensorboard_logger: True
create_checkpoint_callback: True
checkpoint_callback_params:
save_top_k: 3
monitor: "val_loss"
mode: "min"

tokenizer:
tokenizer_name: ${model.transformer} # or sentencepiece
vocab_file: null # path to vocab file
tokenizer_model: null # only used if tokenizer is sentencepiece
special_tokens: null

# Data
data:
train_ds:
data_path: ??? # provide the full path to the file
batch_size: 8
shuffle: true
num_workers: 3
pin_memory: false
drop_last: false

validation_ds:
data_path: ??? # provide the full path to the file.
batch_size: 8
shuffle: false
num_workers: 3
pin_memory: false
drop_last: false


# Inference
inference:
from_file: null # Path to the raw text, no labels required. Each sentence on a separate line
out_file: null # Path to the output file
batch_size: 16 # batch size for inference.from_file
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
ekmb marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Path to NeMo repository
NEMO_PATH=NeMo

DATA_PATH="data_folder"

## data_folder_example
## ├── tarred_data
## | └── (output)
## ├── config.json
##   ├── label_map.txt
##   ├── semiotic_classes.txt
## ├── test.tsv
## ├── 1.tsv
## ├── ...
## └── 200.tsv

## Each of {1-200}.tsv input files are 110'000 examples subsets of all.tsv (except for validation part),
## generated by https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh
ekmb marked this conversation as resolved.
Show resolved Hide resolved
## Note that in this example we use 110'000 as input and only pack 100'000 of them to tar file.
## This is because some input examples, e.g. too long, can be skipped during preprocessing, and we want all tar files to contain fixed equal number of examples.

for part in {1..200}
do
python ${NEMO_PATH}/examples/nlp/spellchecking_asr_customization/create_tarred_dataset.py \
lang="en" \
data.train_ds.data_path=${DATA_PATH}/${part}.tsv \
data.validation_ds.data_path=${DATA_PATH}/test.tsv \
model.max_sequence_len=256 \
ekmb marked this conversation as resolved.
Show resolved Hide resolved
model.language_model.pretrained_model_name=huawei-noah/TinyBERT_General_6L_768D \
model.language_model.config_file=${DATA_PATH}/config.json \
model.label_map=${DATA_PATH}/label_map.txt \
model.semiotic_classes=${DATA_PATH}/semiotic_classes.txt \
+output_tar_file=${DATA_PATH}/tarred_data/part${part}.tar \
+take_first_n_lines=100000
done
Loading
Loading