Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Spellchecking ASR customization model (#6179)
* bug fixes Signed-off-by: Alexandra Antonova <[email protected]> * fix bugs, add preparation and evaluation scripts, add readme Signed-off-by: Alexandra Antonova <[email protected]> * small fixes Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add real coverage calculation, small fixes, more debug information Signed-off-by: Alexandra Antonova <[email protected]> * add option to pass a filelist and output folder - to handle inference from multiple input files Signed-off-by: Alexandra Antonova <[email protected]> * added preprocessing for yago wikipedia articles - finding yago entities and their subphrases Signed-off-by: Alexandra Antonova <[email protected]> * yago wiki preprocessing, sampling, pseudonormalization Signed-off-by: Alexandra Antonova <[email protected]> * more scripts for preparation of training examples Signed-off-by: Alexandra Antonova <[email protected]> * bug fixes Signed-off-by: Alexandra Antonova <[email protected]> * add some alphabet checks Signed-off-by: Alexandra Antonova <[email protected]> * add bert on subwords, concatenate it to bert on characters Signed-off-by: Alexandra Antonova <[email protected]> * add calculation of character_pos_to_subword_pos Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * pdb Signed-off-by: Alexandra Antonova <[email protected]> * tensor join bug fix Signed-off-by: Alexandra Antonova <[email protected]> * double hidden_size in classifier Signed-off-by: Alexandra Antonova <[email protected]> * pdb Signed-off-by: Alexandra Antonova <[email protected]> * default index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <[email protected]> * pad index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <[email protected]> * remove pdb Signed-off-by: Alexandra Antonova <[email protected]> * fix bugs, add creation of tarred dataset Signed-off-by: Alexandra Antonova <[email protected]> * add possibility to change sequence len at inference Signed-off-by: Alexandra Antonova <[email protected]> * change sampling of dummy candidates at inference, add candidate info file Signed-off-by: Alexandra Antonova <[email protected]> * fix import Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * update transcription now uses info Signed-off-by: Alexandra Antonova <[email protected]> * write path Signed-off-by: Alexandra Antonova <[email protected]> * 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing Signed-off-by: Alexandra Antonova <[email protected]> * skip short_sent if no real candidates Signed-off-by: Alexandra Antonova <[email protected]> * fix import Signed-off-by: Alexandra Antonova <[email protected]> * add braceexpand Signed-off-by: Alexandra Antonova <[email protected]> * fixes Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * fix bug in np.ones Signed-off-by: Alexandra Antonova <[email protected]> * fix bug in collate Signed-off-by: Alexandra Antonova <[email protected]> * change tensor type to long because of error in torch.gather Signed-off-by: Alexandra Antonova <[email protected]> * fix for empty spans tensor Signed-off-by: Alexandra Antonova <[email protected]> * same fixes in _collate_fn for tarred dataset Signed-off-by: Alexandra Antonova <[email protected]> * fix bug from previous commit Signed-off-by: Alexandra Antonova <[email protected]> * change int types to be shorter to minimize tar size Signed-off-by: Alexandra Antonova <[email protected]> * refactoring of datasets and inference Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * tar by 100k examples, small fixes Signed-off-by: Alexandra Antonova <[email protected]> * small fixes, add analytics script Signed-off-by: Alexandra Antonova <[email protected]> * Add functions for dynamic programming comparison to get best path by ngrams Signed-off-by: Alexandra Antonova <[email protected]> * fixes Signed-off-by: Alexandra Antonova <[email protected]> * small fix Signed-off-by: Alexandra Antonova <[email protected]> * fixes to support testing on SPGISpeech Signed-off-by: Alexandra Antonova <[email protected]> * add preprocessing for userlibri Signed-off-by: Alexandra Antonova <[email protected]> * some refactoring Signed-off-by: Alexandra Antonova <[email protected]> * some refactoring Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * small refactoring before pr. Add bash-scripts reproducing evaluation Signed-off-by: Alexandra Antonova <[email protected]> * style fix Signed-off-by: Alexandra Antonova <[email protected]> * small fixes in inference Signed-off-by: Alexandra Antonova <[email protected]> * bug fix - didn't move window on last symbol Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug - shuffle was before truncation of sorted candidates Signed-off-by: Alexandra Antonova <[email protected]> * refactoring, fix some bugs Signed-off-by: Alexandra Antonova <[email protected]> * variour fixes. Add word_indices at inference Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add candidate positions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move data preparation and evaluation to other repo Signed-off-by: Alexandra Antonova <[email protected]> * add infer_reproduce_paper. Refactoring Signed-off-by: Alexandra Antonova <[email protected]> * refactor inference using fragment indices Signed-off-by: Alexandra Antonova <[email protected]> * add some helper functions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug with parameters order Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bugs Signed-off-by: Alexandra Antonova <[email protected]> * refactoring, fix bug Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add multiple variants of adjusting start/end positions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more fixes Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit tests, other fixes Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Alexandra Antonova <[email protected]> * fix CodeQl warnings Signed-off-by: Alexandra Antonova <[email protected]> * bug fixes Signed-off-by: Alexandra Antonova <[email protected]> * fix bugs, add preparation and evaluation scripts, add readme Signed-off-by: Alexandra Antonova <[email protected]> * small fixes Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add real coverage calculation, small fixes, more debug information Signed-off-by: Alexandra Antonova <[email protected]> * add option to pass a filelist and output folder - to handle inference from multiple input files Signed-off-by: Alexandra Antonova <[email protected]> * added preprocessing for yago wikipedia articles - finding yago entities and their subphrases Signed-off-by: Alexandra Antonova <[email protected]> * yago wiki preprocessing, sampling, pseudonormalization Signed-off-by: Alexandra Antonova <[email protected]> * more scripts for preparation of training examples Signed-off-by: Alexandra Antonova <[email protected]> * bug fixes Signed-off-by: Alexandra Antonova <[email protected]> * add some alphabet checks Signed-off-by: Alexandra Antonova <[email protected]> * add bert on subwords, concatenate it to bert on characters Signed-off-by: Alexandra Antonova <[email protected]> * add calculation of character_pos_to_subword_pos Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * pdb Signed-off-by: Alexandra Antonova <[email protected]> * tensor join bug fix Signed-off-by: Alexandra Antonova <[email protected]> * double hidden_size in classifier Signed-off-by: Alexandra Antonova <[email protected]> * pdb Signed-off-by: Alexandra Antonova <[email protected]> * default index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <[email protected]> * pad index value 0 instead of -1 because index cannot be negative Signed-off-by: Alexandra Antonova <[email protected]> * remove pdb Signed-off-by: Alexandra Antonova <[email protected]> * fix bugs, add creation of tarred dataset Signed-off-by: Alexandra Antonova <[email protected]> * add possibility to change sequence len at inference Signed-off-by: Alexandra Antonova <[email protected]> * change sampling of dummy candidates at inference, add candidate info file Signed-off-by: Alexandra Antonova <[email protected]> * fix import Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * update transcription now uses info Signed-off-by: Alexandra Antonova <[email protected]> * write path Signed-off-by: Alexandra Antonova <[email protected]> * 1. add tarred dataset support(untested). 2. fix bug with ban_ngrams in indexing Signed-off-by: Alexandra Antonova <[email protected]> * skip short_sent if no real candidates Signed-off-by: Alexandra Antonova <[email protected]> * fix import Signed-off-by: Alexandra Antonova <[email protected]> * add braceexpand Signed-off-by: Alexandra Antonova <[email protected]> * fixes Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * fix bug in np.ones Signed-off-by: Alexandra Antonova <[email protected]> * fix bug in collate Signed-off-by: Alexandra Antonova <[email protected]> * change tensor type to long because of error in torch.gather Signed-off-by: Alexandra Antonova <[email protected]> * fix for empty spans tensor Signed-off-by: Alexandra Antonova <[email protected]> * same fixes in _collate_fn for tarred dataset Signed-off-by: Alexandra Antonova <[email protected]> * fix bug from previous commit Signed-off-by: Alexandra Antonova <[email protected]> * change int types to be shorter to minimize tar size Signed-off-by: Alexandra Antonova <[email protected]> * refactoring of datasets and inference Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * bug fix Signed-off-by: Alexandra Antonova <[email protected]> * tar by 100k examples, small fixes Signed-off-by: Alexandra Antonova <[email protected]> * small fixes, add analytics script Signed-off-by: Alexandra Antonova <[email protected]> * Add functions for dynamic programming comparison to get best path by ngrams Signed-off-by: Alexandra Antonova <[email protected]> * fixes Signed-off-by: Alexandra Antonova <[email protected]> * small fix Signed-off-by: Alexandra Antonova <[email protected]> * fixes to support testing on SPGISpeech Signed-off-by: Alexandra Antonova <[email protected]> * add preprocessing for userlibri Signed-off-by: Alexandra Antonova <[email protected]> * some refactoring Signed-off-by: Alexandra Antonova <[email protected]> * some refactoring Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * move some functions to utils to reuse from other project Signed-off-by: Alexandra Antonova <[email protected]> * small refactoring before pr. Add bash-scripts reproducing evaluation Signed-off-by: Alexandra Antonova <[email protected]> * style fix Signed-off-by: Alexandra Antonova <[email protected]> * small fixes in inference Signed-off-by: Alexandra Antonova <[email protected]> * bug fix - didn't move window on last symbol Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug - shuffle was before truncation of sorted candidates Signed-off-by: Alexandra Antonova <[email protected]> * refactoring, fix some bugs Signed-off-by: Alexandra Antonova <[email protected]> * variour fixes. Add word_indices at inference Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add candidate positions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move data preparation and evaluation to other repo Signed-off-by: Alexandra Antonova <[email protected]> * add infer_reproduce_paper. Refactoring Signed-off-by: Alexandra Antonova <[email protected]> * refactor inference using fragment indices Signed-off-by: Alexandra Antonova <[email protected]> * add some helper functions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bug with parameters order Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bugs Signed-off-by: Alexandra Antonova <[email protected]> * refactoring, fix bug Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add multiple variants of adjusting start/end positions Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more fixes Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unit tests, other fixes Signed-off-by: Alexandra Antonova <[email protected]> * fix Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CodeQl warnings Signed-off-by: Alexandra Antonova <[email protected]> * add script for full inference pipeline, refactoring Signed-off-by: Alexandra Antonova <[email protected]> * add tutorial Signed-off-by: Alexandra Antonova <[email protected]> * take example data from HuggingFace Signed-off-by: Alexandra Antonova <[email protected]> * add docs Signed-off-by: Alexandra Antonova <[email protected]> * fix comment Signed-off-by: Alexandra Antonova <[email protected]> * fix bug Signed-off-by: Alexandra Antonova <[email protected]> * small fixes for PR Signed-off-by: Alexandra Antonova <[email protected]> * add some more tests Signed-off-by: Alexandra Antonova <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * try to fix tests adding with_downloads Signed-off-by: Alexandra Antonova <[email protected]> * skip tests with tokenizer download Signed-off-by: Alexandra Antonova <[email protected]> --------- Signed-off-by: Alexandra Antonova <[email protected]> Signed-off-by: Alexandra Antonova <[email protected]> Co-authored-by: Alexandra Antonova <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information