This repository has been archived by the owner on Jun 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 369
Structural Overhaul #162
Merged
Merged
Structural Overhaul #162
Changes from 33 commits
Commits
Show all changes
40 commits
Select commit
Hold shift + click to select a range
d84aa7f
Create a Model trait
danforbes e0713a1
Bloom model
danforbes 6bfda75
cargo fmt
danforbes 73f59c3
Rename llama-rs to llm-base
danforbes e670c25
Clippy
danforbes c4b4176
Remove redundant associated Model type from Model trait
danforbes 1cf305f
Remove associated Layer type from Model trait
danforbes 0d4dde9
cargo fmt
danforbes 849c28d
Docs
danforbes 54ad890
Tests and examples
danforbes 4ba7c1c
Layers are private
danforbes dcf85ff
Merge branch 'main' of github.com:rustformers/llama-rs into dfo/model…
philpax 43ecac1
Merge branch 'main' into dfo/model/bloom
philpax 440bd69
Fix build
philpax 5658484
refactor: introduce llm(-cli)
philpax bcf5627
Fix model name in LLaMA inference example
danforbes 5ac4b79
feat: wire up both bloom/llama to CLI
philpax 1601240
Merge branch 'dfo/model/bloom' of github.com:danforbes/llama-rs into …
philpax 1761512
Add example for testing BLOOM inference
danforbes 8d2d9c6
cargo fmt
danforbes 813bdd1
Add launch.json for debugging loading and inference
danforbes c608b4b
Merge branch 'main' into dfo/model/bloom
danforbes e19418c
Check tensor dimensions when loading
danforbes e35f93b
`Model` -> `KnownModel`, `ErasedModel -> Model`
danforbes 288df7f
Merge branch 'main' into dfo/model/bloom
danforbes 0aea8f7
Refactor ggml stuff into a single crate
danforbes 8594ac8
Use latest upstream ggml with alibi
danforbes a542c98
Improve examples
danforbes 16fca15
Latest upstream ggml
danforbes 974d2f7
Cleanup README
danforbes 1abaa41
Rebase fix
danforbes f994fa8
GPT2/Cerebras loading and inference
danforbes ff99a80
Rebase & remove BLOOM
danforbes 454f3a9
GitHub Action should support Git submodules
danforbes e69d487
Fix binary file name in README
danforbes 608090b
ggml-rs -> ggml
danforbes 78db42c
Add back BLOOM
danforbes 1eb2e11
feat: re-enable BLOOM for now
philpax 181d823
refactor: reintroduce ggml-sys and bindgen tool
philpax 9314c68
fix: check out submodules for clippy CI
philpax File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
[submodule "ggml-rs/ggml"] | ||
path = ggml-rs/ggml | ||
url = [email protected]:ggerganov/ggml.git |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
{ | ||
// Use IntelliSense to learn about possible attributes. | ||
// Hover to view descriptions of existing attributes. | ||
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387 | ||
"version": "0.2.0", | ||
"configurations": [ | ||
{ | ||
"type": "lldb", | ||
"request": "launch", | ||
"name": "Debug example 'gpt2_inference'", | ||
"cargo": { | ||
"args": [ | ||
"build", | ||
"--example=gpt2_inference", | ||
"--package=gpt2" | ||
], | ||
"filter": { | ||
"name": "gpt2_inference", | ||
"kind": "example" | ||
} | ||
}, | ||
"args": ["${env:HOME}/.ggml-models/cerebras-gpt-13b.bin"], | ||
"cwd": "${workspaceFolder}" | ||
}, | ||
{ | ||
"type": "lldb", | ||
"request": "launch", | ||
"name": "Debug example 'llama_inference'", | ||
"cargo": { | ||
"args": [ | ||
"build", | ||
"--example=llama_inference", | ||
"--package=llama" | ||
], | ||
"filter": { | ||
"name": "llama_inference", | ||
"kind": "example" | ||
} | ||
}, | ||
"args": ["${env:HOME}/.ggml-models/gpt4all-7b.bin"], | ||
"cwd": "${workspaceFolder}" | ||
} | ||
] | ||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1,20 @@ | ||
[workspace] | ||
members = [ | ||
"ggml-sys", | ||
"ggml", | ||
"ggml-format", | ||
"llama-rs", | ||
"llama-cli", | ||
"generate-ggml-bindings" | ||
# Crates | ||
"ggml-rs", | ||
"llm-base", | ||
"gpt2", | ||
"llama", | ||
"llm", | ||
"llm-cli", | ||
] | ||
resolver = "2" | ||
|
||
[workspace.package] | ||
version = "0.1.0" | ||
|
||
[workspace.dependencies] | ||
bytemuck = "1.13.1" | ||
log = "0.4" | ||
rand = "0.8.5" | ||
serde = { version = "1.0", features = ["derive"] } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,39 +1,31 @@ | ||
# LLaMA-rs | ||
|
||
<!-- markdownlint-disable-file MD026 --> | ||
This project is a Rust port of | ||
[llama.cpp](https://github.com/ggerganov/llama.cpp) 🦙🦀🚀 | ||
|
||
> Do the LLaMA thing, but now in Rust 🦀🚀🦙 | ||
|
||
![A llama riding a crab, AI-generated](./doc/resources/logo2.png) | ||
|
||
> _Image by [@darthdeus](https://github.com/darthdeus/), using Stable Diffusion_ | ||
|
||
[![ko-fi](https://ko-fi.com/img/githubbutton_sm.svg)](https://ko-fi.com/F1F8DNO5D) | ||
Just like its C++ counterpart, it is powered by the | ||
[`ggml`](https://github.com/ggerganov/ggml) tensor library, which allows running | ||
inference for Facebook's [LLaMA](https://github.com/facebookresearch/llama) | ||
model on a CPU with good performance using full precision, f16 or 4-bit | ||
quantized versions of the model. | ||
|
||
[![Latest version](https://img.shields.io/crates/v/llama-rs.svg)](https://crates.io/crates/llama_rs) | ||
![MIT/Apache2](https://shields.io/badge/license-MIT%2FApache--2.0-blue) | ||
[![Discord](https://img.shields.io/discord/1085885067601137734)](https://discord.gg/YB9WaXYAWU) | ||
|
||
![Gif showcasing language generation using llama-rs](./doc/resources/llama_gif.gif) | ||
|
||
**LLaMA-rs** is a Rust port of the | ||
[llama.cpp](https://github.com/ggerganov/llama.cpp) project. This allows running | ||
inference for Facebook's [LLaMA](https://github.com/facebookresearch/llama) | ||
model on a CPU with good performance using full precision, f16 or 4-bit | ||
quantized versions of the model. | ||
![A llama riding a crab, AI-generated](./doc/resources/logo2.png) | ||
|
||
Just like its C++ counterpart, it is powered by the | ||
[`ggml`](https://github.com/ggerganov/ggml) tensor library, achieving the same | ||
performance as the original code. | ||
> _Image by [@darthdeus](https://github.com/darthdeus/), using Stable Diffusion_ | ||
|
||
## Getting started | ||
|
||
Make sure you have a Rust 1.65.0 or above and C toolchain[^1] set up. | ||
|
||
`llama-rs` is a Rust library, while `llama-cli` is a CLI application that wraps | ||
`llama-rs` and offers basic inference capabilities. | ||
`llm-base`, `gpt2`, and `llama` are Rust libraries, while `llm-cli` is a CLI | ||
applications that wraps `gpt2` and `llama` and offer basic inference | ||
capabilities. | ||
|
||
The following instructions explain how to build `llama-cli`. | ||
The following instructions explain how to build CLI applications. | ||
|
||
**NOTE**: For best results, make sure to build and run in release mode. | ||
Debug builds are going to be very slow. | ||
|
@@ -43,41 +35,45 @@ Debug builds are going to be very slow. | |
Run | ||
|
||
```shell | ||
cargo install --git https://github.com/rustformers/llama-rs llama-cli | ||
cargo install --git https://github.com/rustformers/llama-rs llm-cli | ||
``` | ||
|
||
to install `llama-cli` to your Cargo `bin` directory, which `rustup` is likely to | ||
to install `llm-cli` to your Cargo `bin` directory, which `rustup` is likely to | ||
have added to your `PATH`. | ||
|
||
It can then be run through `llama-cli`. | ||
The CLI application can then be run through `llm-cli`. | ||
|
||
![Gif showcasing language generation using llama-rs](./doc/resources/llama_gif.gif) | ||
|
||
### Building from repository | ||
|
||
Clone the repository, and then build it through | ||
Clone the repository and then build it with | ||
|
||
```shell | ||
cargo build --release --bin llama-cli | ||
git clone --recurse-submodules [email protected]:rustformers/llama-rs.git | ||
cargo build --release | ||
``` | ||
|
||
The resulting binary will be at `target/release/llama-cli[.exe]`. | ||
The resulting binary will be at `target/release/llm-cli[.exe]`. | ||
danforbes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
It can also be run directly through Cargo, using | ||
|
||
```shell | ||
cargo run --release --bin llama-cli -- <ARGS> | ||
cargo run --release --bin llm-cli -- <ARGS> | ||
``` | ||
|
||
This is useful for development. | ||
|
||
### Getting the weights | ||
### Getting LLaMA weights | ||
|
||
In order to run the inference code in `llama-rs`, a copy of the model's weights | ||
are required. | ||
|
||
#### From Hugging Face | ||
|
||
Compatible weights - not necessarily the original LLaMA weights - can be found | ||
on [Hugging Face by searching for GGML](https://huggingface.co/models?search=ggml). At present, LLaMA-architecture models are supported. | ||
on [Hugging Face by searching for GGML](https://huggingface.co/models?search=ggml). | ||
At present, LLaMA-architecture models are supported. | ||
|
||
#### LLaMA original weights | ||
|
||
|
@@ -107,6 +103,13 @@ cargo run -p llama-cli quantize /path/to/your/models/7B/ggml-model-f16.bin /path | |
> The [llama.cpp repository](https://github.com/ggerganov/llama.cpp) has | ||
> additional information on how to obtain and run specific models. | ||
|
||
### GPT2 | ||
|
||
OpenAI's [GPT-2](https://jalammar.github.io/illustrated-gpt2/) architecture is | ||
also supported. The open-source family of | ||
[Cerebras](https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/) | ||
models is built on this architecture. | ||
|
||
_Support for other open source models is currently planned. For models where | ||
weights can be legally distributed, this section will be updated with scripts to | ||
make the install process as user-friendly as possible. Due to the model's legal | ||
|
@@ -133,9 +136,9 @@ Some additional things to try: | |
|
||
![Gif showcasing alpaca repl mode](./doc/resources/alpaca_repl_screencap.gif) | ||
|
||
- Sessions can be loaded (`--load-session`) or saved (`--save-session`) to file. To automatically load | ||
and save the same session, use `--persist-session`. This can be used to cache prompts to reduce load | ||
time, too: | ||
- Sessions can be loaded (`--load-session`) or saved (`--save-session`) to file. | ||
To automatically load and save the same session, use `--persist-session`. | ||
This can be used to cache prompts to reduce load time, too: | ||
|
||
![Gif showcasing prompt caching](./doc/resources/prompt_caching_screencap.gif) | ||
|
||
|
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we'll remove this wording as we grow to accommodate more LLMs. I'll revise the wording on this after this PR lands, so nothing for you to do here - just mentioning it.