Lambda torchtitan fork

Set up

First, set up a python virtual environment in order to set up everything correctly.

python -m venv venv-torchtitan
source venv-torchtitan/bin/activate

Install torch, use

pip install torch --index-url https://download.pytorch.org/whl/cu128

cd torchtitan
pip install -r requirements.txt
pip install torchao
pip install -e .

Running lambda fork

git checkout main
sudo nvidia-smi boost-slider --vboost 1
export PYTORCH_ALLOC_CONF=expandable_segments:True
torchrun --nproc-per-node=gpu -m torchtitan.train <config file>

Optimized config files can be found under ./configs

In order to run 16xB200 configurations, instead use run_train_c0.sh and run_train_c1.sh in the torchtitan directory. Run the following command on both nodes AT THE SAME TIME, running run_train_c0.sh on node-001 and run_train_c1.sh on node-002:

./run_train_c0.sh --config <config file you want to use for multi-node setup>              # RUN THIS ON NODE 1
./run_train_c1.sh --config <config file you want to use for multi-node setup>              # RUN THIS ON NODE 2

The 16xB200 config files can also be found under ./configs. For configurations larger than this, it is ideal to create a slurm file to run all these concurrently.

Running baselines

git checkout torchtitan-e7ee95a
sudo nvidia-smi boost-slider --vboost 0
export PYTORCH_ALLOC_CONF=expandable_segments:True
torchrun --nproc-per-node=gpu -m torchtitan.train <config file>

Config files for baselines can be found under: ./torchtitan/models/llama3/train_configs

Name		Name	Last commit message	Last commit date
Latest commit History 1,062 Commits
.ci/docker		.ci/docker
.github		.github
assets		assets
benchmarks		benchmarks
configs		configs
docs		docs
scripts		scripts
tests		tests
torchtitan		torchtitan
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
multinode_trainer.slurm		multinode_trainer.slurm
print-job-info.py		print-job-info.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_train.sh		run_train.sh
run_train_c0.sh		run_train_c0.sh
run_train_c1.sh		run_train_c1.sh
slurm-torchrun.sh		slurm-torchrun.sh
train.sbatch		train.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lambda torchtitan fork

Set up

Running lambda fork

Running baselines

About

Uh oh!

Releases

Packages

Languages

License

djmatusz-lambda/torchtitan

Folders and files

Latest commit

History

Repository files navigation

Lambda torchtitan fork

Set up

Running lambda fork

Running baselines

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages