Skip to content

Commit 78d79a5

Browse files
authored
Merge pull request #709 from allenai/shanea/debugging-docs
Add some docs about debugging
2 parents 9147889 + f2aa76a commit 78d79a5

File tree

3 files changed

+54
-0
lines changed

3 files changed

+54
-0
lines changed

README.md

+4
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,10 @@ Note: passing CLI overrides like `--reset_trainer_state` is only necessary if yo
206206

207207
Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo.
208208

209+
## Debugging
210+
211+
See [Debugging](https://github.com/allenai/OLMo/blob/main/docs/NOTES.md#debugging).
212+
209213
## Citing
210214

211215
```bibtex

docs/NOTES.md

+22
Original file line numberDiff line numberDiff line change
@@ -102,3 +102,25 @@ outputs = model.generate(input_tensor, max_steps=3, beam_size=3)
102102
best_generation = outputs.token_ids[0][0].tolist()
103103
print(tokenizer.decode(best_generation))
104104
```
105+
106+
## Debugging
107+
108+
### Finding the cause of hangs
109+
110+
Hangs in distributed training can be due to several different causes, including
111+
bad user code, AMD/Nvidia memory-allocation issues, or issues in hardware setup.
112+
These issues can be difficult to root-cause and even harder to fix.
113+
114+
One approach we use to find the cause of a hang in distributed training is to first identify which processes/nodes are hanging. The [scripts/pyspy_all_processes.sh](https://github.com/allenai/OLMo/blob/main/scripts/pyspy_all_processes.sh) script retrieves the python state of relevant python processes using `pyspy`. A process/node with different state may be experiencing a hang.
115+
116+
If a hang is suspected to be in GPU code, then you can run `gcore <pid>` on a hanging process to get core dumps. Then you can run `gdb <corefile>` and check where the code is hanging from a C++ perspective. Code being stuck on a GPU memory allocation (malloc) may be indicative of a hardware/setup issue rather than a problem in training code.
117+
118+
### Comparing two models that should be identical
119+
120+
There are some scenarios when one might want to investigate why two models/setups that should be identical are yielding different results. A naive solution is to run both setups side-by-side and compare results manually (and this might not be possible if you have just 1 GPU).
121+
122+
An alternative for comparing OLMo models is to run the training of both models with the `--module_outputs_save_steps=[<list of steps]` config option. This causes OLMo to save a portion of the inputs & outputs of each OLMo submodule into a `traces/` folder at the model step's save location. Then [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) can be used to compare these portions of inputs & outputs, thus hopefully isolating the issue to a subset of model modules. See [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) for more details on its usage.
123+
124+
When comparing different hardware or dependency setups, it is possible that model
125+
state gets corrupted before the first forward pass of training. One can check this
126+
by running training with `--force_save_unsharded --dry_run --load_path=<original_model_path>` to save a checkpoint after the original model has loaded but before training has started. Then [scripts/compare_model_state.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_model_state.py) can be used to see if parameters are different between the 2 models.

scripts/compare_module_outputs.py

+28
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,31 @@
1+
"""
2+
Script for comparing collected outputs of OLMo submodules from 2
3+
different training run steps (of the same or different runs).
4+
5+
This script is useful for identifying where model activations start to differ
6+
within 2 forward passes that should yield identical results. In turn, detecting
7+
regressions can be a lot quicker/easier.
8+
9+
This script requires that traces containing submodule outputs have been collected
10+
during training. The traces can be saved using
11+
`--module_outputs_save_steps=[<list of step>]`. Be mindful that the saving takes
12+
a lot of storage and is very slow, so collect traces sparingly. If comparing 2
13+
training runs starting from the same checkpoint, a viable approach is to collect
14+
the 2 steps after training resumes. The first step can be used to detect issues
15+
in the forward pass, while if only the second step shows discrepancies then the
16+
backward pass may be the cause of any issues.
17+
18+
Example usage (Aug 2024):
19+
```
20+
python scripts/compare_module_outputs.py test_model/traces/step10 test_model_2/traces/step10
21+
```
22+
23+
If this model produces no output stating diffs (without `--verbose`), then the
24+
outputs between the 2 models are identical. If `mis-matching wte elements: ...`
25+
shows a non-zero value, then the input data of the 2 forward passes being compared
26+
is likely different.
27+
"""
28+
129
import logging
230
from argparse import ArgumentParser
331
from pathlib import Path

0 commit comments

Comments
 (0)