Skip to content

Commit

Permalink
Fix punctuation_capitalization_train_evaluate.py description (#3482)
Browse files Browse the repository at this point in the history
* fix run script documentation

Signed-off-by: PeganovAnton <[email protected]>

* Add missing parameters to examples in documentation

Signed-off-by: PeganovAnton <[email protected]>

* Standardize format of paths and file names in docs examples

Signed-off-by: PeganovAnton <[email protected]>
  • Loading branch information
PeganovAnton authored and fayejf committed Mar 2, 2022
1 parent 25a739d commit 4c9e2ed
Show file tree
Hide file tree
Showing 6 changed files with 57 additions and 32 deletions.
32 changes: 20 additions & 12 deletions docs/source/nlp/punctuation_and_capitalization.rst
Original file line number Diff line number Diff line change
Expand Up @@ -139,8 +139,8 @@ section), run the following command:
.. code::
python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \
-s <PATH_TO_THE_SOURCE_FILE>
-o <PATH_TO_THE_OUTPUT_DIRECTORY>
-s <PATH/TO/THE/SOURCE/FILE> \
-o <PATH/TO/THE/OUTPUT/DIRECTORY>
Required Argument for Dataset Conversion
Expand Down Expand Up @@ -175,9 +175,9 @@ For creating of tarred dataset you will need data in NeMo format:
.. code::
python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \
--text <PATH_TO_LOWERCASED_TEXT_WITHOUT_PUNCTUATION> \
--labels <PATH_TO_LABELS_IN_NEMO_FORMAT> \
--output_dir <PATH_TO_DIRECTORY_WITH_OUTPUT_TARRED_DATASET> \
--text <PATH/TO/LOWERCASED/TEXT/WITHOUT/PUNCTUATION> \
--labels <PATH/TO/LABELS/IN/NEMO/FORMAT> \
--output_dir <PATH/TO/DIRECTORY/WITH/OUTPUT/TARRED/DATASET> \
--num_batches_per_tarfile 100
All tar files contain similar amount of batches, so up to :code:`--num_batches_per_tarfile - 1` batches will be
Expand Down Expand Up @@ -782,7 +782,11 @@ To train the model from scratch, run:
python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \
model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \
model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA_DIR> \
model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \
model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
trainer.gpus=[0,1] \
optim.name=adam \
optim.lr=0.0001
Expand All @@ -796,7 +800,11 @@ To train from the pre-trained model, run:
python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \
model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA_DIR> \
model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA/DIR> \
model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \
model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
pretrained_model=<PATH/TO/SAVE/.nemo>
Expand All @@ -816,18 +824,18 @@ Inference is performed by a script `examples/nlp/token_classification/punctuate_
.. code::
python punctuate_capitalize_infer.py \
--input_manifest <PATH_TO_INPUT_MANIFEST> \
--output_manifest <PATH_TO_OUTPUT_MANIFEST> \
--input_manifest <PATH/TO/INPUT/MANIFEST> \
--output_manifest <PATH/TO/OUTPUT/MANIFEST> \
--pretrained_name punctuation_en_bert \
--max_seq_length 64 \
--margin 16 \
--step 8
:code:`<PATH_TO_INPUT_MANIFEST>` is a path to NeMo :ref:`ASR manifest<LibriSpeech_dataset>` with text in which you need to
:code:`<PATH/TO/INPUT/MANIFEST>` is a path to NeMo :ref:`ASR manifest<LibriSpeech_dataset>` with text in which you need to
restore punctuation and capitalization. If manifest contains :code:`'pred_text'` key, then :code:`'pred_text'` elements
will be processed. Otherwise, punctuation and capitalization will be restored in :code:`'text'` elements.

:code:`<PATH_TO_OUTPUT_MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored
:code:`<PATH/TO/OUTPUT/MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored
punctuation and capitalization is saved into :code:`'pred_text'` elements if :code:`'pred_text'` key is present in the
input manifest. Otherwise result will be saved into :code:`'text'` elements.

Expand Down Expand Up @@ -874,8 +882,8 @@ To start evaluation of the pre-trained model, run:
+model.to_testing=true \
model.test_ds.ds_item=<PATH/TO/TEST/DATA/DIR> \
pretrained_model=punctuation_en_bert \
model.test_ds.text_file=<text_dev.txt> \
model.test_ds.labels_file=<labels_dev.txt>
model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
Required Arguments
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,9 @@
Example of usage:
python create_punctuation_capitalization_tarred_dataset.py \
--text <PATH_TO_TEXT_FILE> \
--labels <PATH_TO_LABELS_FILE> \
--output_dir <PATH_TO_OUTPUT_DIR> \
--text <PATH/TO/TEXT/FILE> \
--labels <PATH/TO/LABELS/FILE> \
--output_dir <PATH/TO/OUTPUT/DIR> \
--lines_per_dataset_fragment 10000 \
--tokens_in_batch 8000 \
--num_batches_per_tarfile 5 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@
section), run the following command:
python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \
-s <PATH_TO_THE_SOURCE_FILE>
-o <PATH_TO_THE_OUTPUT_DIRECTORY>
-s <PATH/TO/THE/SOURCE/FILE> \
-o <PATH/TO/THE/OUTPUT/DIRECTORY>
"""

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,13 @@
Usage example:
python punctuate_capitalize.py \
--input_manifest <PATH_TO_INPUT_MANIFEST> \
--output_manifest <PATH_TO_OUTPUT_MANIFEST>
--input_manifest <PATH/TO/INPUT/MANIFEST> \
--output_manifest <PATH/TO/OUTPUT/MANIFEST>
<PATH_TO_INPUT_MANIFEST> is a path to NeMo ASR manifest. Usually it is an output of
<PATH/TO/INPUT/MANIFEST> is a path to NeMo ASR manifest. Usually it is an output of
NeMo/examples/asr/transcribe_speech.py but can be a manifest with 'text' key. Alternatively you can use
--input_text parameter for passing text for inference.
<PATH_TO_OUTPUT_MANIFEST> is a path to NeMo ASR manifest into which script output will be written. Alternatively
<PATH/TO/OUTPUT/MANIFEST> is a path to NeMo ASR manifest into which script output will be written. Alternatively
you can use parameter --output_text.
For more details on this script usage look in argparse help.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -49,14 +49,22 @@
To run this script and train the model from scratch, use:
python punctuation_capitalization_train_evaluate.py \
model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE>
To use one of the pretrained versions of the model and finetune it, run:
python punctuation_capitalization_train_evaluate.py \
pretrained_model=punctuation_en_bert \
model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE>
pretrained_model - pretrained PunctuationCapitalization model from list_available_models() or
path to a .nemo file, for example: punctuation_en_bert or model.nemo
Expand All @@ -65,15 +73,24 @@
python punctuation_capitalization_train_evaluate.py \
+do_testing=true \
pretrained_model=punctuation_en_bert \
model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
model.test_ds.ds_item=<PATH/TO/TEST_DATA> \
model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
Set `do_training` to `false` and `do_testing` to `true` to perform evaluation without training:
python punctuation_capitalization_train_evaluate.py \
+do_testing=true \
+do_training=false \
pretrained_model=punctuation_en_bert \
model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
model.test_ds.ds_item=<PATH/TO/TEST/DATA> \
model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
"""

Expand Down
8 changes: 4 additions & 4 deletions tutorials/nlp/Punctuation_and_Capitalization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -814,17 +814,17 @@
"\n",
"```\n",
"python punctuate_capitalize_infer.py \\\n",
" --input_manifest <PATH_TO_INPUT_MANIFEST> \\\n",
" --output_manifest <PATH_TO_OUTPUT_MANIFEST> \\\n",
" --input_manifest <PATH/TO/INPUT/MANIFEST> \\\n",
" --output_manifest <PATH/TO/OUTPUT/MANIFEST> \\\n",
" --pretrained_name punctuation_en_bert \\\n",
" --max_seq_length 64 \\\n",
" --margin 16 \\\n",
" --step 8\n",
"```\n",
"\n",
"`<PATH_TO_INPUT_MANIFEST>` is a path to NeMo [ASR manifest](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html) which contains text in which you need to restore punctuation and capitalization. If manifest contains `'pred_text'` key, then `'pred_text'` elements will be processed. Otherwise, punctuation and capitalization will be restored in `'text'` elements.\n",
"`<PATH/TO/INPUT/MANIFEST>` is a path to NeMo [ASR manifest](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html) which contains text in which you need to restore punctuation and capitalization. If manifest contains `'pred_text'` key, then `'pred_text'` elements will be processed. Otherwise, punctuation and capitalization will be restored in `'text'` elements.\n",
"\n",
"`<PATH_TO_OUTPUT_MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored\n",
"`<PATH/TO/OUTPUT/MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored\n",
"punctuation and capitalization is saved into `'pred_text'` elements if `'pred_text'` key is present in\n",
"input manifest. Otherwise result will be saved into `'text'` elements.\n",
"\n",
Expand Down

0 comments on commit 4c9e2ed

Please sign in to comment.