NVIDIA · ekmb · Jun 2, 2023 · Oct 12, 2022 · Oct 19, 2022 · Oct 19, 2022
diff --git a/docs/source/nlp/models.rst b/docs/source/nlp/models.rst
@@ -9,6 +9,7 @@ NeMo's NLP collection supports provides the following task-specific models:
    :maxdepth: 1
 
    punctuation_and_capitalization_models
+   spellchecking_asr_customization
    token_classification
    joint_intent_slot
    text_classification

diff --git a/docs/source/nlp/spellchecking_asr_customization.rst b/docs/source/nlp/spellchecking_asr_customization.rst
@@ -0,0 +1,128 @@
+.. _spellchecking_asr_customization:
+
+SpellMapper (Spellchecking ASR Customization) Model
+=====================================================
+
+SpellMapper is a non-autoregressive model for postprocessing of ASR output. It gets as input a single ASR hypothesis (text) and a custom vocabulary and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any. Unlike traditional spellchecking approaches, which aim to correct known words using language models, SpellMapper's goal is to correct highly specific user terms, out-of-vocabulary (OOV) words or spelling variations (e.g., "John Koehn", "Jon Cohen").
+
+This model is an alternative to word boosting/shallow fusion approaches:
+
+- does not require retraining ASR model;
+- does not require beam-search/language model (LM);
+- can be applied on top of any English ASR model output;
+
+Model Architecture
+------------------
+Though SpellMapper is based on `BERT <https://arxiv.org/abs/1810.04805>`__ :cite:`nlp-ner-devlin2018bert` architecture, it uses some non-standard tricks that make it different from other BERT-based models:
+
+- ten separators (``[SEP]`` tokens) are used to combine the ASR hypothesis and ten candidate phrases into a single input;
+- the model works on character level;
+- subword embeddings are concatenated to the embeddings of each character that belongs to this subword;
+
+ .. code::
+
+    Example input:   [CLS] a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o [SEP] d i d i e r _ s a u m o n [SEP] a s t r o n o m i e [SEP] t r i s t a n _ g u i l l o t [SEP] ...
+    Input segments:      0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0     1 1 1 1 1 1 1 1 1 1 1 1 1 1     2 2 2 2 2 2 2 2 2 2 2     3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3     4      
+    Example output:      0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 0     ...
+
+The model calculates logits for each character x 11 labels: 
+
+- ``0`` - character doesn't belong to any candidate,
+- ``1..10`` - character belongs to candidate with this id.
+
+At inference average pooling is applied to calculate replacement probability for the whole fragments.
+
+Quick Start Guide
+-----------------
+
+We recommend you try this model in a Jupyter notebook (need GPU): 
+`NeMo/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb <https://github.com/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`__.
+
+A pretrained English checkpoint can be found at `HuggingFace <https://huggingface.co/bene-ges/spellmapper_asr_customization_en>`__. 
+
+An example inference pipeline can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_infer.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__.
+
+An example script on how to train the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training.sh>`__.
+
+An example script on how to train on large datasets can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_training_tarred.sh>`__.
+
+The default configuration file for the model can be found here: `NeMo/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml>`__.
+
+.. _dataset_spellchecking_asr_customization:
+
+Input/Output Format at Inference stage
+--------------------------------------
+Here we describe input/output format of the SpellMapper model. 
+
+.. note::
+
+    If you use `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ this format will be hidden inside and you only need to provide an input manifest and user vocabulary and you will get a corrected manifest.
+
+An input line should consist of 4 tab-separated columns:
+    1. text of ASR-hypothesis
+    2. texts of 10 candidates separated by semicolon
+    3. 1-based ids of non-dummy candidates, separated by space
+    4. approximate start/end coordinates of non-dummy candidates (correspond to ids in third column)
+
+Example input (in one line):
+
+.. code::
+
+    t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x	
+    h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
+    1 2 6 7 8 9 10
+    CUSTOM 6 23;CUSTOM 4 10;CUSTOM 4 15;CUSTOM 56 62;CUSTOM 5 19;CUSTOM 28 31;CUSTOM 39 48
+
+Each line in SpellMapper output is tab-separated and consists of 4 columns:
+    1. ASR-hypothesis (same as in input)
+    2. 10 candidates separated by semicolon (same as in input)
+    3. fragment predictions, separated by semicolon, each prediction is a tuple (start, end, candidate_id, probability)
+    4. letter predictions - candidate_id predicted for each letter (this is only for debug purposes)
+
+Example output (in one line):
+
+.. code::
+
+    t h e _ t a r a s i c _ o o r d a _ i s _ a _ p a r t _ o f _ t h e _ a o r t a _ l o c a t e d _ i n _ t h e _ t h o r a x
+    h e p a t i c _ c i r r h o s i s;u r a c i l;c a r d i a c _ a r r e s t;w e a n;a p g a r;p s y c h o m o t o r;t h o r a x;t h o r a c i c _ a o r t a;a v f;b l o c k a d e d
+    56 62 7 0.99998;4 20 8 0.95181;12 20 8 0.44829;4 17 8 0.99464;12 17 8 0.97645
+    8 8 8 0 8 8 8 8 8 8 8 8 8 8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 7    
+
+Training Data Format
+--------------------
+
+For training, the data should consist of 5 files:
+
+- ``config.json`` - BERT config
+- ``label_map.txt`` - labels from 0 to 10, do not change
+- ``semiotic_classes.txt`` - currently there are only two classes: ``PLAIN`` and ``CUSTOM``, do not change
+- ``train.tsv`` - training examples
+- ``test.tsv`` - validation examples
+
+Note that since all these examples are synthetic, we do not reserve a set for final testing. Instead, we run `inference pipeline <https://github.com/NVIDIA/NeMo/blob/stable/examples/nlp/spellchecking_asr_customization/run_infer.sh>`__ and compare resulting word error rate (WER) to the WER of baseline ASR output. 
+
+One (non-tarred) training example should consist of 4 tab-separated columns:
+    1. text of ASR-hypothesis
+    2. texts of 10 candidates separated by semicolon
+    3. 1-based ids of correct candidates, separated by space, or 0 if none
+    4. start/end coordinates of correct candidates (correspond to ids in third column)
+
+Example (in one line):
+
+.. code::
+
+    a s t r o n o m e r s _ d i d i e _ s o m o n _ a n d _ t r i s t i a n _ g l l o
+    d i d i e r _ s a u m o n;a s t r o n o m i e;t r i s t a n _ g u i l l o t;t r i s t e s s e;m o n a d e;c h r i s t i a n;a s t r o n o m e r;s o l o m o n;d i d i d i d i d i;m e r c y
+    1 3
+    CUSTOM 12 23;CUSTOM 28 41
+
+For data preparation see `this script <https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh>`__
+
+
+References
+----------
+
+.. bibliography:: nlp_all.bib
+    :style: plain
+    :labelprefix: NLP-NER
+    :keyprefix: nlp-ner-
diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst
@@ -130,6 +130,9 @@ To run a tutorial:
    * - NLP
      - Punctuation and Capitalization
      - `Punctuation and Capitalization <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Punctuation_and_Capitalization.ipynb>`_
+   * - NLP
+     - Spellchecking ASR Customization - SpellMapper
+     - `Spellchecking ASR Customization - SpellMapper <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/SpellMapper_English_ASR_Customization.ipynb>`_
    * - NLP
      - Entity Linking
      - `Entity Linking <https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/nlp/Entity_Linking_Medical.ipynb>`_

diff --git a/examples/nlp/spellchecking_asr_customization/README.md b/examples/nlp/spellchecking_asr_customization/README.md
@@ -0,0 +1,32 @@
+# SpellMapper - spellchecking model for ASR Customization
+
+This model is inspired by Microsoft's paper https://arxiv.org/pdf/2203.00888.pdf, but does not repeat its implementation.
+The goal is to build a model that gets as input a single ASR hypothesis (text) and a vocabulary of custom words/phrases and predicts which fragments in the ASR hypothesis should be replaced by which custom words/phrases if any.
+Our model is non-autoregressive (NAR) based on transformer architecture (BERT with multiple separators).
+
+As initial data we use about 5 mln entities from [YAGO corpus](https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/downloads/). These entities are short phrases from Wikipedia headings.
+In order to get misspelled predictions we feed these data to TTS model and then to ASR model.
+Having a "parallel" corpus of "correct + misspelled" phrases, we use statistical machine translation techniques to create a dictionary of possible ngram mappings with their respective frequencies.
+We create an auxiliary algorithm that takes as input a sentence (ASR hypothesis) and a large custom dictionary (e.g. 5000 phrases) and selects top 10 candidate phrases that are probably contained in this sentence in a misspelled way.
+The task of our final neural model is to predict which fragments in the ASR hypothesis should be replaced by which of top-10 candidate phrases if any.
+
+The pipeline consists of multiple steps:
+
+1. Download or generate training data. 
+   See `https://github.com/bene-ges/nemo_compatible/tree/main/scripts/nlp/en_spellmapper/dataset_preparation`
+
+2. [Optional] Convert training dataset to tarred files.
+   `convert_dataset_to_tarred.sh`
+
+3. Train spellchecking model.
+   `run_training.sh`
+   or 
+   `run_training_tarred.sh`
+
+4. Run evaluation.
+   - [test_on_kensho.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
+   - [test_on_userlibri.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
+   - [test_on_spoken_wikipedia.sh](https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/evaluation/test_on_kensho.sh)
+
+5. Run inference.
+   `python run_infer.sh`
diff --git a/examples/nlp/spellchecking_asr_customization/checkpoint_to_nemo.py b/examples/nlp/spellchecking_asr_customization/checkpoint_to_nemo.py
@@ -0,0 +1,38 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+"""
+This script converts checkpoint .ckpt to .nemo file.
+
+This script uses the `examples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml`
+config file by default. The other option is to set another config file via command
+line arguments by `--config-name=CONFIG_FILE_PATH'.
+"""
+
+from omegaconf import DictConfig, OmegaConf
+
+from nemo.collections.nlp.models import SpellcheckingAsrCustomizationModel
+from nemo.core.config import hydra_runner
+from nemo.utils import logging
+
+
+@hydra_runner(config_path="conf", config_name="spellchecking_asr_customization_config")
+def main(cfg: DictConfig) -> None:
+    logging.debug(f'Config Params: {OmegaConf.to_yaml(cfg)}')
+    SpellcheckingAsrCustomizationModel.load_from_checkpoint(cfg.checkpoint_path).save_to(cfg.target_nemo_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/...ples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml b/...ples/nlp/spellchecking_asr_customization/conf/spellchecking_asr_customization_config.yaml
@@ -0,0 +1,97 @@
+name: &name spellchecking
+lang: ???        # e.g. 'ru', 'en'
+
+# Pretrained Nemo Models
+pretrained_model: null
+
+trainer:
+  devices: 1 # the number of gpus, 0 for CPU
+  num_nodes: 1
+  max_epochs: 3  # the number of training epochs
+  enable_checkpointing: false  # provided by exp_manager
+  logger: false  # provided by exp_manager
+  accumulate_grad_batches: 1 # accumulates grads every k batches
+  gradient_clip_val: 0.0
+  precision: 32 # Should be set to 16 for O1 and O2 to enable the AMP.
+  accelerator: gpu
+  strategy: ddp
+  log_every_n_steps: 1  # Interval of logging.
+  val_check_interval: 1.0  # Set to 0.25 to check 4 times per epoch, or an int for number of iterations
+  resume_from_checkpoint: null # The path to a checkpoint file to continue the training, restores the whole state including the epoch, step, LR schedulers, apex, etc.
+
+model:
+  do_training: true
+  label_map: ???  # path/.../label_map.txt
+  semiotic_classes: ???  # path/.../semiotic_classes.txt
+  max_sequence_len: 128
+  lang: ${lang}
+  hidden_size: 768
+
+  optim:
+    name: adamw
+    lr: 3e-5
+    weight_decay: 0.1
+
+    sched:
+      name: WarmupAnnealing                      
+
+      # pytorch lightning args
+      monitor: val_loss
+      reduce_on_plateau: false
+
+      # scheduler config override
+      warmup_ratio: 0.1
+      last_epoch: -1
+
+  language_model:
+    pretrained_model_name: bert-base-uncased     # For ru, try DeepPavlov/rubert-base-cased | For de or multilingual, try bert-base-multilingual-cased
+    lm_checkpoint: null
+    config_file: null # json file, precedence over config
+    config: null
+
+  tokenizer:
+    tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
+    vocab_file: null # path to vocab file
+    tokenizer_model: null # only used if tokenizer is sentencepiece
+    special_tokens: null
+
+exp_manager:
+  exp_dir: nemo_experiments # where to store logs and checkpoints
+  name: training # name of experiment
+  create_tensorboard_logger: True
+  create_checkpoint_callback: True
+  checkpoint_callback_params:
+    save_top_k: 3
+    monitor: "val_loss"
+    mode: "min"
+
+tokenizer:
+    tokenizer_name: ${model.transformer} # or sentencepiece
+    vocab_file: null # path to vocab file
+    tokenizer_model: null # only used if tokenizer is sentencepiece
+    special_tokens: null
+
+# Data
+data:
+  train_ds:
+    data_path: ???  # provide the full path to the file
+    batch_size: 8
+    shuffle: true
+    num_workers: 3
+    pin_memory: false
+    drop_last: false
+
+  validation_ds:
+    data_path: ???  # provide the full path to the file.
+    batch_size: 8
+    shuffle: false
+    num_workers: 3
+    pin_memory: false
+    drop_last: false
+
+
+# Inference
+inference:
+  from_file: null # Path to the raw text, no labels required. Each sentence on a separate line
+  out_file: null # Path to the output file
+  batch_size: 16 # batch size for inference.from_file
diff --git a/examples/nlp/spellchecking_asr_customization/convert_data_to_tarred.sh b/examples/nlp/spellchecking_asr_customization/convert_data_to_tarred.sh
@@ -0,0 +1,50 @@
+# Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+# Path to NeMo repository
+NEMO_PATH=NeMo
+
+DATA_PATH="data_folder"
+
+## data_folder_example
+##   ├── tarred_data
+##   |    └── (output)
+##   ├── config.json
+##   ├── label_map.txt
+##   ├── semiotic_classes.txt
+##   ├── test.tsv
+##   ├── 1.tsv
+##   ├── ...
+##   └── 200.tsv
+
+## Each of {1-200}.tsv input files are 110'000 examples subsets of all.tsv (except for validation part),
+## generated by https://github.com/bene-ges/nemo_compatible/blob/main/scripts/nlp/en_spellmapper/dataset_preparation/build_training_data.sh
+## Note that in this example we use 110'000 as input and only pack 100'000 of them to tar file. 
+## This is because some input examples, e.g. too long, can be skipped during preprocessing, and we want all tar files to contain fixed equal number of examples.
+
+for part in {1..200}
+do
+    python ${NEMO_PATH}/examples/nlp/spellchecking_asr_customization/create_tarred_dataset.py \
+    lang="en" \
+    data.train_ds.data_path=${DATA_PATH}/${part}.tsv \
+    data.validation_ds.data_path=${DATA_PATH}/test.tsv \
+    model.max_sequence_len=256 \
+    model.language_model.pretrained_model_name=huawei-noah/TinyBERT_General_6L_768D \
+    model.language_model.config_file=${DATA_PATH}/config.json \
+    model.label_map=${DATA_PATH}/label_map.txt \
+    model.semiotic_classes=${DATA_PATH}/semiotic_classes.txt \
+    +output_tar_file=${DATA_PATH}/tarred_data/part${part}.tar \
+    +take_first_n_lines=100000
+done