The lemonade
SDK provides everything needed to get up and running quickly with LLMs on OnnxRuntime GenAI (OGA).
- Quick installation from PyPI.
- CLI with tools for prompting, benchmarking, and accuracy tests.
- REST API with OpenAI compatibility.
- Python API based on
from_pretrained()
for easy integration with Python apps.
You can quickly get started with lemonade
by installing the turnkeyml
PyPI package with the appropriate extras for your backend, or you can install from source by cloning and installing this repository.
To install lemonade
from PyPI:
-
Create and activate a miniconda environment.
conda create -n lemon python=3.10 conda activate lemon
-
Install lemonade for you backend of choice:
- OnnxRuntime GenAI with CPU backend:
pip install -e turnkeyml[llm-oga-cpu]
- OnnxRuntime GenAI with Integrated GPU (iGPU, DirectML) backend:
Note: Requires Windows and a DirectML-compatible iGPU.
pip install -e turnkeyml[llm-oga-igpu]
- OnnxRuntime GenAI with Ryzen AI Hybrid (NPU + iGPU) backend:
Note: Ryzen AI Hybrid requires a Windows 11 PC with a AMD Ryzen™ AI 9 HX375, Ryzen AI 9 HX370, or Ryzen AI 9 365 processor.
- Install the Ryzen AI driver >= 32.0.203.237 (you can check your driver version under Device Manager > Neural Processors).
- Visit the AMD Hugging Face page for supported checkpoints.
pip install -e turnkeyml[llm-oga-hybrid] lemonade-install --ryzenai hybrid
- Hugging Face (PyTorch) LLMs for CPU backend:
pip install -e turnkeyml[llm]
- llama.cpp: see instructions.
- OnnxRuntime GenAI with CPU backend:
-
Use
lemonade -h
to explore the LLM tools, and see the command and API examples below.
To install lemonade
from source code:
- Clone:
git clone https://github.com/onnx/turnkeyml.git
cd turnkeyml
(whereturnkeyml
is the repo root of your clone)- Note: be sure to run these installation instructions from the repo root.
- Follow the same instructions as in the PyPI installation, except replace the
turnkeyml
with a.
.- For example:
pip install -e .[llm-oga-igpu]
- For example:
The Lemonade Server is available as a standalone tool with a one-click Windows installer .exe
. Check out the Lemonade_Server_Installer.exe guide for installation instructions and the server spec to learn more about the functionality.
The lemonade
CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.
Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a Tool
, and a single call to lemonade
can invoke any number of Tools
. Each Tool
will perform its functionality, then pass its state to the next Tool
in the command.
You can read each command out loud to understand what it is doing. For example, a command like this:
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
Can be read like this:
Run
lemonade
on the input(-i)
checkpointmicrosoft/Phi-3-mini-4k-instruct
. First, load it in the OnnxRuntime GenAI framework (oga-load
), on to the integrated GPU device (--device igpu
) in the int4 data type (--dtype int4
). Then, pass the OGA model to the prompting tool (llm-prompt
) with the prompt (-p
) "Hello, my thoughts are" and print the response.
The lemonade -h
command will show you which options and Tools are available, and lemonade TOOL -h
will tell you more about that specific Tool.
To prompt your LLM try:
OGA iGPU:
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"
Hugging Face:
lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"
The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the "Hello, my thoughts are"
with any prompt you like.
You can also replace the facebook/opt-125m
with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.
You can also set the --device
argument in oga-load
and huggingface-load
to load your LLM on a different device.
Run lemonade huggingface-load -h
and lemonade llm-prompt -h
to learn more about those tools.
To measure the accuracy of an LLM using MMLU, try this:
OGA iGPU:
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management
Hugging Face:
lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management
That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at ~/.cache/lemonade
.
You can run the full suite of MMLU subjects by omitting the --test
argument. You can learn more about this with lemonade accuracy-mmlu -h
.
To measure the time-to-first-token and tokens/second of an LLM, try this:
OGA iGPU:
lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
Hugging Face:
lemonade -i facebook/opt-125m huggingface-load huggingface-bench
That command will run a few warmup iterations, then a few generation iterations where performance data is collected.
The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running lemonade oga-bench -h
or lemonade huggingface-bench -h
.
To see a report that contains all the benchmarking results and all the accuracy results, use the report
tool with the --perf
flag:
lemonade report --perf
The results can be filtered by model name, device type and data type. See how by running lemonade report -h
.
The peak memory used by the lemonade
build is captured in the build output. To capture more granular
memory usage information, use the --memory
flag. For example:
OGA iGPU:
lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench
Hugging Face:
lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench
In this case a memory_usage.png
file will be generated and stored in the build folder. This file
contains a figure plotting the memory usage over the build time. Learn more by running lemonade -h
.
You can launch an OpenAI-compatible server with:
lemonade serve
Visit the server spec to learn more about the endpoints provided.
Lemonade is also available via API.
The high-level lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular from_pretrained()
function. This makes it easy to integrate lemonade LLMs into Python applications.
OGA iGPU:
from lemonade.api import from_pretrained
model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu")
input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)
print(tokenizer.decode(response[0]))
You can learn more about the high-level APIs here.
The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.
Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:
import lemonade.tools.torch_llm as tl
import lemonade.tools.prompt as pt
from turnkeyml.state import State
state = State(cache_dir="cache", build_name="test")
state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
state = pt.Prompt().run(state, prompt="hi", max_new_tokens=15)
print("Response:", state.response)
Contributions are welcome! If you decide to contribute, please:
- Do so via a pull request.
- Write your code in keeping with the same style as the rest of this repo's code.
- Add a test under
test/lemonade
that provides coverage of your new feature.
The best way to contribute is to add new tools to cover more devices and usage scenarios.
To add a new tool:
- (Optional) Create a new
.py
file undersrc/lemonade/tools
(or use an existing file if your tool fits into a pre-existing family of tools). - Define a new class that inherits the
Tool
class fromTurnkeyML
. - Register the class by adding it to the list of
tools
near the top ofsrc/lemonade/cli.py
.
You can learn more about contributing on the repository's contribution guide.