Merge pull request #709 from allenai/shanea/debugging-docs

2015aroras · web-flow · commit 78d79a51ce9c · 2024-08-20T16:21:28.000-07:00
Add some docs about debugging
diff --git a/README.md b/README.md
@@ -206,6 +206,10 @@ Note: passing CLI overrides like `--reset_trainer_state` is only necessary if yo
 
 Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo.
 
+## Debugging
+
+See [Debugging](https://github.com/allenai/OLMo/blob/main/docs/NOTES.md#debugging).
+
 ## Citing
 
 ```bibtex
diff --git a/docs/NOTES.md b/docs/NOTES.md
@@ -102,3 +102,25 @@ outputs = model.generate(input_tensor, max_steps=3, beam_size=3)
 best_generation = outputs.token_ids[0][0].tolist()
 print(tokenizer.decode(best_generation))
 ```
+
+## Debugging
+
+### Finding the cause of hangs
+
+Hangs in distributed training can be due to several different causes, including
+bad user code, AMD/Nvidia memory-allocation issues, or issues in hardware setup.
+These issues can be difficult to root-cause and even harder to fix.
+
+One approach we use to find the cause of a hang in distributed training is to first identify which processes/nodes are hanging. The [scripts/pyspy_all_processes.sh](https://github.com/allenai/OLMo/blob/main/scripts/pyspy_all_processes.sh) script retrieves the python state of relevant python processes using `pyspy`. A process/node with different state may be experiencing a hang.
+
+If a hang is suspected to be in GPU code, then you can run `gcore <pid>` on a hanging process to get core dumps. Then you can run `gdb <corefile>` and check where the code is hanging from a C++ perspective. Code being stuck on a GPU memory allocation (malloc) may be indicative of a hardware/setup issue rather than a problem in training code.
+
+### Comparing two models that should be identical
+
+There are some scenarios when one might want to investigate why two models/setups that should be identical are yielding different results. A naive solution is to run both setups side-by-side and compare results manually (and this might not be possible if you have just 1 GPU).
+
+An alternative for comparing OLMo models is to run the training of both models with the `--module_outputs_save_steps=[<list of steps]` config option. This causes OLMo to save a portion of the inputs & outputs of each OLMo submodule into a `traces/` folder at the model step's save location. Then [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) can be used to compare these portions of inputs & outputs, thus hopefully isolating the issue to a subset of model modules. See [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) for more details on its usage.
+
+When comparing different hardware or dependency setups, it is possible that model
+state gets corrupted before the first forward pass of training. One can check this
+by running training with `--force_save_unsharded --dry_run --load_path=<original_model_path>` to save a checkpoint after the original model has loaded but before training has started. Then [scripts/compare_model_state.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_model_state.py) can be used to see if parameters are different between the 2 models.
diff --git a/scripts/compare_module_outputs.py b/scripts/compare_module_outputs.py
@@ -1,3 +1,31 @@
+"""
+Script for comparing collected outputs of OLMo submodules from 2
+different training run steps (of the same or different runs).
+
+This script is useful for identifying where model activations start to differ
+within 2 forward passes that should yield identical results. In turn, detecting
+regressions can be a lot quicker/easier. 
+
+This script requires that traces containing submodule outputs have been collected
+during training. The traces can be saved using
+`--module_outputs_save_steps=[<list of step>]`. Be mindful that the saving takes
+a lot of storage and is very slow, so collect traces sparingly. If comparing 2
+training runs starting from the same checkpoint, a viable approach is to collect
+the 2 steps after training resumes. The first step can be used to detect issues
+in the forward pass, while if only the second step shows discrepancies then the
+backward pass may be the cause of any issues.
+
+Example usage (Aug 2024):
+```
+python scripts/compare_module_outputs.py test_model/traces/step10 test_model_2/traces/step10
+```
+
+If this model produces no output stating diffs (without `--verbose`), then the
+outputs between the 2 models are identical. If `mis-matching wte elements: ...`
+shows a non-zero value, then the input data of the 2 forward passes being compared
+is likely different.
+"""
+
 import logging
 from argparse import ArgumentParser
 from pathlib import Path