Rewrite train.sh to train.py (#842)

* Add a run_pipeline utility * Add more tests for training * Rewrite train.sh into train.py * Add the pipeline to the PYTHONPATH * Ensure that the W&B tracker throws errors in CI * Add the Taskcluster environment variables so test-fast works on the train test * Address review comments
mozilla · Sep 18, 2024 · 9d355d8 · 9d355d8
1 parent d7235e0
commit 9d355d8
Show file tree

Hide file tree

Showing 20 changed files with 729 additions and 216 deletions.
diff --git a/docs/opus-trainer.md b/docs/opus-trainer.md
@@ -60,7 +60,7 @@ It likely will be the case when using a pre-trained student model as a backward
 OpusTrainer configuration files for the trained models are located in 
 the [/pipeline/train/configs/opustrainer/](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/opustrainer/) directory.
 
-`<dataset0>`, `<dataset1>` and `<vocab>` will be replaced by the training datasets and a path to Sentencepiece `vocab.spm` passed in `pipeline/train/train.sh` script.
+`{dataset0}`, `{dataset1}` and `{vocab}` will be replaced by the training datasets and a path to Sentencepiece `vocab.spm` passed in `pipeline/train/train.py` script.
 
 See more details on configuration in the OpusTrainer [readme](https://github.com/hplt-project/OpusTrainer).
 
@@ -167,4 +167,3 @@ so it should only be used on small evaluation datasets.
     - flores_aug-noise_devtest
     - flores_aug-inline-noise_devtest
 ```
-
diff --git a/docs/training-guide.md b/docs/training-guide.md
@@ -139,10 +139,10 @@ For more details on data cleaning see the documents on [Data cleaning](cleaning.
 ## 4. Set hyperparameters
 
 The pipeline supports overriding the default [Marian settings](https://marian-nmt.github.io/docs/cmd/marian/) in the training config. The default settings are in the `pipeline/train/configs` directory,
-for example [`teacher.train.yml`] and in the [`train.sh`] script.
+for example [`teacher.train.yml`] and in the [`train.py`] script.
 
 [teacher.train.yml]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/configs/training/teacher.train.yml
-[train.sh]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh
+[train.py]: https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py
 
 ### Model training
 
@@ -224,7 +224,7 @@ Find the full description of the pipeline steps [here](pipeline-steps.md).
 ### Cluster specific configuaiton
 
 The Marian workspace is usually safe to set to about 3/4 of available GPU memory 
-(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) and throughout the ci steps in Task cluster).
+(in a [profile for Snakemake](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py) and throughout the ci steps in Task cluster).
 Setting a higher value speeds up training but might lead to out of GPU memory error.
 
 ### Taskcluster
@@ -319,7 +319,7 @@ Taskcluster retries automatically.
 
 Usually, by the time we train the student, it's so much data that it might not fit in 128 GB of RAM. 
 For very high-resource languages like French it can happen even earlier, on the backward/teacher training stage. 
-The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.sh) 
+The workaround is to remove `--shuffle-in-ram` from the [training script](https://github.com/mozilla/firefox-translations-training/tree/main/pipeline/train/train.py) 
 and add `--shuffle batches`  instead.
 More details in the [issue](https://github.com/mozilla/firefox-translations-training/issues/21).
 

diff --git a/pipeline/common/command_runner.py b/pipeline/common/command_runner.py
@@ -0,0 +1,96 @@
+import re
+from shlex import join
+import shlex
+import subprocess
+
+
+def _get_indented_command_string(command_parts: list[str]) -> str:
+    """
+    Print out a command with the flags indented, so that it's easy to read.
+    """
+    command = join(command_parts)
+    parts = re.split(r"( --\w)", command)
+
+    formatted_command = [parts[0].strip()]
+
+    for i in range(1, len(parts), 2):
+        option = parts[i].strip() + parts[i + 1].strip()
+        formatted_command.append(f"  {option}")
+
+    return "\n".join(formatted_command)
+
+
+def apply_command_args(dict: dict[str, any]):
+    """
+    Takes in a dictionary, and applies the keys as command line flags.
+
+    input: { "key": "value" }
+    output: "--key value"
+
+    input: { "inputs": ["valueA", "valueB"] }
+    output: "--inputs valueA valueB"
+    """
+
+    for key, value in dict.items():
+        yield f"--{key}"
+        if value is None:
+            continue
+
+        if isinstance(value, list):
+            for v in value:
+                yield str(v)
+            continue
+
+        yield str(value)
+
+
+def run_command_pipeline(
+    commands: list[list[str]], pipe_stderr=False, capture=False, logger=None
+) -> str | None:
+    """
+    Executes a series of shell commands in a pipeline, where the output of one command
+    is piped to the next. Optionally captures the final output or logs the pipeline
+    process. It raises `CalledProcessError` if any command in the pipeline fails.
+
+    Args:
+      commands: A list of command arguments where each command is
+        represented as a list of strings.
+      pipe_stderr: If True, pipes `stderr` of each command into `stdout`.
+      capture: If True, captures and returns the output of the final command in the
+        pipeline. If False, output is printed to stdout. Defaults to False.
+      logger: A logger instance used for logging the command execution. If provided,
+        it will log the constructed pipeline commands. Defaults to None.
+
+    Example:
+      python_scripts = run_pipeline(
+        [
+            ["ls", "-l"],
+            ["grep", ".py"],
+            ["sort"]
+        ],
+        capture=True
+      )
+    """
+    if pipe_stderr:
+        joiner = "2>&1 |"
+    else:
+        joiner = "|"
+
+    if logger:
+        # Log out a nice representation of this command.
+        final_command = _get_indented_command_string(commands[0])
+        for command_parts in commands[1:]:
+            final_command = (
+                f"{final_command}\n{joiner} {_get_indented_command_string(command_parts)}"
+            )
+
+        logger.info("Running:")
+        for line in final_command.split("\n"):
+            logger.info(line)
+
+    command_string = f" {joiner} ".join([shlex.join(command) for command in commands])
+
+    if capture:
+        return subprocess.check_output(command_string, shell=True).decode("utf-8")
+
+    subprocess.check_call(command_string, shell=True)
diff --git a/pipeline/train/configs/opustrainer/backward.yml b/pipeline/train/configs/opustrainer/backward.yml
@@ -1,5 +1,5 @@
 datasets:
-  original: <dataset0> # Original parallel corpus
+  original: {dataset0} # Original parallel corpus
 
 stages:
   - train

diff --git a/pipeline/train/configs/opustrainer/student.yml b/pipeline/train/configs/opustrainer/student.yml
@@ -1,5 +1,5 @@
 datasets:
-  original: <dataset0> # Original parallel corpus
+  original: {dataset0} # Original parallel corpus
 
 stages:
   - train
@@ -26,7 +26,7 @@ modifiers:
 # Tags modifier has to be the last one to retokenize the alignments
 - Tags: 0.005
   augment: 1
-  spm_vocab: <vocab>
+  spm_vocab: {vocab}
 
 seed: 1111
 # parallel sentences + token alignments

diff --git a/pipeline/train/configs/opustrainer/teacher.one-stage.yml b/pipeline/train/configs/opustrainer/teacher.one-stage.yml
@@ -1,6 +1,6 @@
 datasets:
-  original: <dataset0> # Original parallel corpus
-  backtranslated: <dataset1> # Back-translated data
+  original: {dataset0} # Original parallel corpus
+  backtranslated: {dataset1} # Back-translated data
 
 stages:
   - train
@@ -34,6 +34,6 @@ modifiers:
 
 
 # random seed should be different for different teacher models
-seed: <seed>
+seed: {seed}
 # parallel sentences + token alignments
 num_fields: 3
diff --git a/pipeline/train/configs/opustrainer/teacher.two-stage.yml b/pipeline/train/configs/opustrainer/teacher.two-stage.yml
@@ -1,6 +1,6 @@
 datasets:
-  original: <dataset0> # Original parallel corpus
-  backtranslated: <dataset1> # Back-translated data
+  original: {dataset0} # Original parallel corpus
+  backtranslated: {dataset1} # Back-translated data
 
 stages:
   - pretrain
@@ -39,6 +39,6 @@ modifiers:
 
 
 # random seed should be different for different teacher models
-seed: <seed>
+seed: {seed}
 # parallel sentences + token alignments
 num_fields: 3