Skip to content

djmatusz-lambda/torchtitan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lambda torchtitan fork

Set up

First, set up a python virtual environment in order to set up everything correctly.

python -m venv venv-torchtitan
source venv-torchtitan/bin/activate

Install torch, use

pip install torch --index-url https://download.pytorch.org/whl/cu128
cd torchtitan
pip install -r requirements.txt
pip install torchao
pip install -e .

Running lambda fork

git checkout main
sudo nvidia-smi boost-slider --vboost 1
export PYTORCH_ALLOC_CONF=expandable_segments:True
torchrun --nproc-per-node=gpu -m torchtitan.train <config file>

Optimized config files can be found under ./configs

In order to run 16xB200 configurations, instead use run_train_c0.sh and run_train_c1.sh in the torchtitan directory. Run the following command on both nodes AT THE SAME TIME, running run_train_c0.sh on node-001 and run_train_c1.sh on node-002:

./run_train_c0.sh --config <config file you want to use for multi-node setup>              # RUN THIS ON NODE 1
./run_train_c1.sh --config <config file you want to use for multi-node setup>              # RUN THIS ON NODE 2

The 16xB200 config files can also be found under ./configs. For configurations larger than this, it is ideal to create a slurm file to run all these concurrently.

Running baselines

git checkout torchtitan-e7ee95a
sudo nvidia-smi boost-slider --vboost 0
export PYTORCH_ALLOC_CONF=expandable_segments:True
torchrun --nproc-per-node=gpu -m torchtitan.train <config file>

Config files for baselines can be found under: ./torchtitan/models/llama3/train_configs

About

A PyTorch native platform for training generative AI models

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.7%
  • Shell 1.2%
  • Dockerfile 0.1%