llama2.jl

Tired of low-level languages? Ever wanted to infer a baby Llama 2 model in pure Julia? Great news – you can now do so at in under 300 lines of Julia.

This is a fork of Andrej's llama2.c which has been ported to (for now) a slightly hacky version of Julia. This README is heavily inspired by the Rust port llama.rs.

Don't want to read? Got ya back!

git clone https://github.com/juvi21/llama2.jl && cd llama2.jl && wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin && julia jl_helpers/install_pkg.jl && julia run.jl stories15M.bin tokenizer.bin

How to run?

Grab Andrej's baby Llama2 (see the original instructions) pretrained on the TinyStories dataset:

wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories15M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories42M.bin
wget https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.bin

Ensure you have the tokenizer binary - tokenizer.bin (if not, see tokenizer.py).
Run run.jl:

Single-threaded:
```
julia run.jl <model> <tokenizer> --temp [temperature]
```
Multi-Threaded: In Progress
CUDA: In Progress

Performance

On my current workstation, the performance is quite fast. However, I have been away visiting my parents for a few days, so I only had the opportunity to test it on one of my very first and less powerful station. More testing is coming soon! NOTE: I compiled llama2.c with the provided command in Andrej's README which is only the basic one to get started and not very optimized.

gcc -O3 -o run run.c -lm

system	model	llama2.c	llmaa2.c -0fast	llama2.jl
Ubuntu 22.04 AMD Ryzen 2600	stories15M.bin	85.418752 tok/s	189.591078 tok/s	257.445516 tok/s
Ubuntu 22.04 AMD Ryzen 2600	stories42M.bin	30.761836 tok/s	78.485688 tok/s	92.567484 tok/s
Ubuntu 22.04 AMD Ryzen 2600	stories110.bin	11.585283 tok/s	30.375223 tok/s	38.543434 tok/s

Contributions

Join the dark side and code in Julia. Contributions are highly encouraged!

Contribution Ideas:

Make it faster.
Add CUDA support.
Introduce Multi-Threaded support.
Cutom Prompt

Art

@Midjourney

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github/workflows		.github/workflows
assets		assets
jl_helpers		jl_helpers
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build_msvc.bat		build_msvc.bat
configurator.py		configurator.py
export_meta_llama_bin.py		export_meta_llama_bin.py
model.py		model.py
requirements.txt		requirements.txt
run.c		run.c
run.jl		run.jl
sample.py		sample.py
test_all.py		test_all.py
tinystories.py		tinystories.py
tokenizer.bin		tokenizer.bin
tokenizer.model		tokenizer.model
tokenizer.py		tokenizer.py
train.py		train.py
win.c		win.c
win.h		win.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llama2.jl

How to run?

Performance

Contributions

Art

About

Releases

Packages

Languages

License

juvi21/llama2.jl

Folders and files

Latest commit

History

Repository files navigation

llama2.jl

How to run?

Performance

Contributions

Art

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages