Distributed implementation of the LLaMA 3.x model. Optimzied to allow both pipeline and tensor parallel inference execution using PyTorch.
torchrun-hpc -N1 -n2 --rdv tcp chat_server.py --model-dir <path/to/model>
The Livermore Big Artificial Neural Network toolkit (LBANN) is an open-source, HPC-centric, deep learning training framework that is optimized to compose multiple levels of parallelism.
LBANN provides model-parallel acceleration through domain decomposition to optimize for strong scaling of network training. It also allows for composition of model-parallelism with both data parallelism and ensemble training methods for training large neural networks with massive amounts of data. LBANN is able to advantage of tightly-coupled accelerators, low-latency high-bandwidth networking, and high-bandwidth parallel file systems.
A list of publications, presentations and posters are shown here.
Issues, questions, and bugs can be raised on the Github issue tracker.