Fix punctuation_capitalization_train_evaluate.py description (#3482)

* fix run script documentation Signed-off-by: PeganovAnton <[email protected]> * Add missing parameters to examples in documentation Signed-off-by: PeganovAnton <[email protected]> * Standardize format of paths and file names in docs examples Signed-off-by: PeganovAnton <[email protected]>
NVIDIA · Mar 2, 2022 · 4c9e2ed · 4c9e2ed
1 parent 25a739d
commit 4c9e2ed
Show file tree

Hide file tree

Showing 6 changed files with 57 additions and 32 deletions.
diff --git a/docs/source/nlp/punctuation_and_capitalization.rst b/docs/source/nlp/punctuation_and_capitalization.rst
@@ -139,8 +139,8 @@ section), run the following command:
 .. code::
 
     python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \
-           -s <PATH_TO_THE_SOURCE_FILE>
-           -o <PATH_TO_THE_OUTPUT_DIRECTORY>
+           -s <PATH/TO/THE/SOURCE/FILE> \
+           -o <PATH/TO/THE/OUTPUT/DIRECTORY>
 
 
 Required Argument for Dataset Conversion
@@ -175,9 +175,9 @@ For creating of tarred dataset you will need data in NeMo format:
 .. code::
 
     python examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py \
-        --text <PATH_TO_LOWERCASED_TEXT_WITHOUT_PUNCTUATION> \
-        --labels <PATH_TO_LABELS_IN_NEMO_FORMAT> \
-        --output_dir <PATH_TO_DIRECTORY_WITH_OUTPUT_TARRED_DATASET> \
+        --text <PATH/TO/LOWERCASED/TEXT/WITHOUT/PUNCTUATION> \
+        --labels <PATH/TO/LABELS/IN/NEMO/FORMAT> \
+        --output_dir <PATH/TO/DIRECTORY/WITH/OUTPUT/TARRED/DATASET> \
         --num_batches_per_tarfile 100
 
 All tar files contain similar amount of batches, so up to :code:`--num_batches_per_tarfile - 1` batches will be
@@ -782,7 +782,11 @@ To train the model from scratch, run:
 
       python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \
              model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \
+             model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
+             model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
              model.validation_ds.ds_item=<PATH/TO/DEV/DATA_DIR> \
+             model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \
+             model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
              trainer.gpus=[0,1] \
              optim.name=adam \
              optim.lr=0.0001
@@ -796,7 +800,11 @@ To train from the pre-trained model, run:
 
       python examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py \
              model.train_ds.ds_item=<PATH/TO/TRAIN/DATA_DIR> \
-             model.validation_ds.ds_item=<PATH/TO/DEV/DATA_DIR> \
+             model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
+             model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
+             model.validation_ds.ds_item=<PATH/TO/DEV/DATA/DIR> \
+             model.validation_ds.text_file=<NAME_OF_DEV_TEXT_FILE> \
+             model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
              pretrained_model=<PATH/TO/SAVE/.nemo>
 
 
@@ -816,18 +824,18 @@ Inference is performed by a script `examples/nlp/token_classification/punctuate_
 .. code::
 
     python punctuate_capitalize_infer.py \
-        --input_manifest <PATH_TO_INPUT_MANIFEST> \
-        --output_manifest <PATH_TO_OUTPUT_MANIFEST> \
+        --input_manifest <PATH/TO/INPUT/MANIFEST> \
+        --output_manifest <PATH/TO/OUTPUT/MANIFEST> \
         --pretrained_name punctuation_en_bert \
         --max_seq_length 64 \
         --margin 16 \
         --step 8
 
-:code:`<PATH_TO_INPUT_MANIFEST>` is a path to NeMo :ref:`ASR manifest<LibriSpeech_dataset>` with text in which you need to
+:code:`<PATH/TO/INPUT/MANIFEST>` is a path to NeMo :ref:`ASR manifest<LibriSpeech_dataset>` with text in which you need to
 restore punctuation and capitalization. If manifest contains :code:`'pred_text'` key, then :code:`'pred_text'` elements
 will be processed. Otherwise, punctuation and capitalization will be restored in :code:`'text'` elements.
 
-:code:`<PATH_TO_OUTPUT_MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored
+:code:`<PATH/TO/OUTPUT/MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored
 punctuation and capitalization is saved into :code:`'pred_text'` elements if :code:`'pred_text'` key is present in the
 input manifest. Otherwise result will be saved into :code:`'text'` elements.
 
@@ -874,8 +882,8 @@ To start evaluation of the pre-trained model, run:
            +model.to_testing=true \
            model.test_ds.ds_item=<PATH/TO/TEST/DATA/DIR>  \
            pretrained_model=punctuation_en_bert \
-           model.test_ds.text_file=<text_dev.txt> \
-           model.test_ds.labels_file=<labels_dev.txt>
+           model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
+           model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
 
 
 Required Arguments

diff --git a/examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py b/examples/nlp/token_classification/data/create_punctuation_capitalization_tarred_dataset.py
@@ -51,9 +51,9 @@
 Example of usage:
 
 python create_punctuation_capitalization_tarred_dataset.py \
-  --text <PATH_TO_TEXT_FILE> \
-  --labels <PATH_TO_LABELS_FILE> \
-  --output_dir <PATH_TO_OUTPUT_DIR> \
+  --text <PATH/TO/TEXT/FILE> \
+  --labels <PATH/TO/LABELS/FILE> \
+  --output_dir <PATH/TO/OUTPUT/DIR> \
   --lines_per_dataset_fragment 10000 \
   --tokens_in_batch 8000 \
   --num_batches_per_tarfile 5 \

diff --git a/examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py b/examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py
@@ -74,8 +74,8 @@
 section), run the following command:
 
     python examples/nlp/token_classification/data/prepare_data_for_punctuation_capitalization.py \
-           -s <PATH_TO_THE_SOURCE_FILE>
-           -o <PATH_TO_THE_OUTPUT_DIRECTORY>
+           -s <PATH/TO/THE/SOURCE/FILE> \
+           -o <PATH/TO/THE/OUTPUT/DIRECTORY>
 
 """
 

diff --git a/examples/nlp/token_classification/punctuate_capitalize_infer.py b/examples/nlp/token_classification/punctuate_capitalize_infer.py
@@ -28,13 +28,13 @@
 Usage example:
 
 python punctuate_capitalize.py \
-    --input_manifest <PATH_TO_INPUT_MANIFEST> \
-    --output_manifest <PATH_TO_OUTPUT_MANIFEST>
+    --input_manifest <PATH/TO/INPUT/MANIFEST> \
+    --output_manifest <PATH/TO/OUTPUT/MANIFEST>
 
-<PATH_TO_INPUT_MANIFEST> is a path to NeMo ASR manifest. Usually it is an output of
+<PATH/TO/INPUT/MANIFEST> is a path to NeMo ASR manifest. Usually it is an output of
     NeMo/examples/asr/transcribe_speech.py but can be a manifest with 'text' key. Alternatively you can use
     --input_text parameter for passing text for inference.
-<PATH_TO_OUTPUT_MANIFEST> is a path to NeMo ASR manifest into which script output will be written. Alternatively
+<PATH/TO/OUTPUT/MANIFEST> is a path to NeMo ASR manifest into which script output will be written. Alternatively
     you can use parameter --output_text.
 
 For more details on this script usage look in argparse help.

diff --git a/examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py b/examples/nlp/token_classification/punctuation_capitalization_train_evaluate.py
@@ -49,14 +49,22 @@
 
 To run this script and train the model from scratch, use:
     python punctuation_capitalization_train_evaluate.py \
-        model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
-        model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
+        model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
+        model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
+        model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
+        model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
+        model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
+        model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE>
 
 To use one of the pretrained versions of the model and finetune it, run:
     python punctuation_capitalization_train_evaluate.py \
         pretrained_model=punctuation_en_bert \
-        model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
-        model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
+        model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
+        model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
+        model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
+        model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
+        model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
+        model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE>
     
     pretrained_model   - pretrained PunctuationCapitalization model from list_available_models() or 
         path to a .nemo file, for example: punctuation_en_bert or model.nemo
@@ -65,15 +73,24 @@
     python punctuation_capitalization_train_evaluate.py \
         +do_testing=true \
         pretrained_model=punctuation_en_bert \
-        model.train_ds.ds_item=<PATH_TO_TRAIN_DATA> \
-        model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
+        model.train_ds.ds_item=<PATH/TO/TRAIN/DATA> \
+        model.train_ds.text_file=<NAME_OF_TRAIN_INPUT_TEXT_FILE> \
+        model.train_ds.labels_file=<NAME_OF_TRAIN_LABELS_FILE> \
+        model.validation_ds.ds_item=<PATH/TO/DEV/DATA> \
+        model.validation_ds.text_file=<NAME_OF_DEV_INPUT_TEXT_FILE> \
+        model.validation_ds.labels_file=<NAME_OF_DEV_LABELS_FILE> \
+        model.test_ds.ds_item=<PATH/TO/TEST_DATA> \
+        model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
+        model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
 
 Set `do_training` to `false` and `do_testing` to `true` to perform evaluation without training:
     python punctuation_capitalization_train_evaluate.py \
         +do_testing=true \
         +do_training=false \
         pretrained_model=punctuation_en_bert \
-        model.validation_ds.ds_item=<PATH_TO_DEV_DATA>
+        model.test_ds.ds_item=<PATH/TO/TEST/DATA> \
+        model.test_ds.text_file=<NAME_OF_TEST_INPUT_TEXT_FILE> \
+        model.test_ds.labels_file=<NAME_OF_TEST_LABELS_FILE>
 
 """
 

diff --git a/tutorials/nlp/Punctuation_and_Capitalization.ipynb b/tutorials/nlp/Punctuation_and_Capitalization.ipynb
@@ -814,17 +814,17 @@
     "\n",
     "```\n",
     "python punctuate_capitalize_infer.py \\\n",
-    "    --input_manifest <PATH_TO_INPUT_MANIFEST> \\\n",
-    "    --output_manifest <PATH_TO_OUTPUT_MANIFEST> \\\n",
+    "    --input_manifest <PATH/TO/INPUT/MANIFEST> \\\n",
+    "    --output_manifest <PATH/TO/OUTPUT/MANIFEST> \\\n",
     "    --pretrained_name punctuation_en_bert \\\n",
     "    --max_seq_length 64 \\\n",
     "    --margin 16 \\\n",
     "    --step 8\n",
     "```\n",
     "\n",
-    "`<PATH_TO_INPUT_MANIFEST>` is a path to NeMo [ASR manifest](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html) which contains text in which you need to restore punctuation and capitalization. If manifest contains `'pred_text'` key, then `'pred_text'` elements will be processed. Otherwise, punctuation and capitalization will be restored in `'text'` elements.\n",
+    "`<PATH/TO/INPUT/MANIFEST>` is a path to NeMo [ASR manifest](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/datasets.html) which contains text in which you need to restore punctuation and capitalization. If manifest contains `'pred_text'` key, then `'pred_text'` elements will be processed. Otherwise, punctuation and capitalization will be restored in `'text'` elements.\n",
     "\n",
-    "`<PATH_TO_OUTPUT_MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored\n",
+    "`<PATH/TO/OUTPUT/MANIFEST>` is a path to NeMo ASR manifest into which result will be saved. The text with restored\n",
     "punctuation and capitalization is saved into `'pred_text'` elements if `'pred_text'` key is present in\n",
     "input manifest. Otherwise result will be saved into `'text'` elements.\n",
     "\n",