SoTA Transformers with C-backend for fast inference on your CPU.
We identify three pillers to enable fast inference of SoTA AI models on your CPU:
- Fast C/C++ LLM inference kernels for CPU.
- Machine Learning Research & Exploration front - Compression through quantization, sparsification, training on more data, collecting data and training instruction & chat models.
- Easy to use API for fast AI inference in dynamically typed language like Python.
This project aims to address the third using LLaMa.cpp and GGML.
- Inference Speed! Focus on inference, not training.
- Precompressed models.
- Minimal setup required - soon
pip install cformers
should be good to get started. - Easily switch between models and quantization types.
- Support variety of prompts.
And most importantly:
- You, the users, get to decide which direction we take this project.
Setup
pip install transformers wget
git clone https://github.com/nolanoOrg/cformers.git
cd cformers/cformers/cpp && make && cd ..
Usage:
from interface import AutoInference as AI
ai = AI('EleutherAI/gpt-j-6B')
x = ai.generate('def parse_html(html_doc):', num_tokens_to_generate=500)
print(x['token_str'])
OR
from interface import AutoInference as AI
ai = AI('OpenAssistant/oasst-sft-1-pythia-12b')
x = ai.generate("<|prompter|>What's the Earth total population<|endoftext|><|assistant|>", num_tokens_to_generate=100)
print(x['token_str'])
OR
python chat.py
chat.py accepts the following parameteres:
-t 100
Number of tokens to generate-p Tell me a joke
for a single prompt interaction-m pythia
to load one of the available (bloom, pythia or gptj )
We are working on adding support for pip install cformers.
Following Architectures are supported:
- GPT-J
- BLOOM
- GPT-NeoX/Pythia/Open-Assistant/Open-Chat-Kit
- CodeGen
Currently following huggingface models are supported:
- EleutherAI/gpt-j-6B
- bigscience/bloom-560m
- bigscience/bloom-1b1
- bigscience/bloom-1b7
- bigscience/bloom-3b
- BigScience/bloom-7b1
- OpenAssistant/oasst-sft-1-pythia-12b
- togethercomputer/GPT-NeoXT-Chat-Base-20B (thanks to @HCBlackFox)
- Salesforce/codegen-350M-mono
- Salesforce/codegen-2B-mono
- Salesforce/codegen-6B-mono
- Salesforce/codegen-16B-mono
- gpt2 (thanks to @kamalojasv181)
We need to quantize and upload remaining models based on the supported architectures on huggingface. We would appreciate your help in this regard.
Features:
- Switch between models
- Chat-mode (interactive mode)
- Various tools to support Prompt-engineering, chaining, saving and sharing.
Code-base restructuring:
- Switch to Pybind11 rather than Subprocess - expected speedup: 3-4x
- Restructure the codebase to reuse.
- Somehow create llama.cpp as a git-submodule/dependency.
For now, we are focussing on AutoRegressive-style generative models.
- GPT-J
- BLOOM
- GPT-NeoX/Pythia/Open-Assistant/Open-Chat-Kit (Architecture supported, need to quantize and upload models.)
- CodeGen
- LLaMa & Alpaca
- OPT & Galactica
- T5
- RWKV
- GPT-2
- And more (including multimodal)...
- Int4 with fixed zero-offset
- Int4 with variable zero-offset
- GPTQ-Int4 with fixed zero-offset
- GPTQ-Int4 with variable zero-offset
- Int3 quantization, proxy quantization and binning.
We encourage contributions from the community.
- Let us know what features you want, what models you want to use.
- Reporting bugs, raising issues and sending Pull Requests.
Following are some easy first issues ways in which you can help improve CTransformers:
- Pick an existing HF model, quantize it, upload to HF and add it to the mapping in
ctransformers/map_model_to_url.py
- Add support for new models.
- Add support for new quantization types.
- Try out GPTQ on these models and upload the resulting models to HF.
- Benchmark the quantized models. #2
- Can we merge Query and Key Matrices for GPT-J/LLaMa? #3
- Explore CALM (Confident Adaptive Language Modelling) with 4-bit precision models #4
- Saving Keys and Values in memory at lower precision (refer FlexGen) #6
- Try out other quantization techniques like proxy quantization, etc.
- Explore SparseGPT #5
- Explore Quantization of Multimodal Models
If you are allergic to Python, you can:
- Port support for fast loading here: ggerganov/llama.cpp#91 (comment)
You can also contribute to LLaMa.cpp and we will port those niceties here.
- Add support for greater than 32 bin/group size int4 quantized weights with GGML/LLaMa.cpp (A potential pitfalls - the intermediate representation may not be losslessly grouppable to >32 bin size, only weight matrix may be grouppable to >32 bin size, etc.)
- Speed up quantized matrix multiplication in GGML/LLaMa.cpp
- Add Int3 and Int2 quantization support to GGML/LLaMa.cpp
- Add fast Ampere-sparse quantized matrix multiplication functions in GGML/LLaMa.cpp
Our interface is still limited to generation. We are working to support other features:
- Allow stopping-generation midway
- Anti-prompt for conversation models
- Returning embeddings and/or logits.
- [Dev] Switch to pybindings over the C++ kernels and calling them from Python.
- [Dev] Re-use the code in main.cpp
We would love to hear from you various ways in which we can speed up and improve the interface.
MIT License
Discord: https://discord.gg/HGujTPQtR6