Skip to content

Files

Latest commit

danielholandajeremyfowersvgodsoe
Mar 20, 2025
e4c5845 · Mar 20, 2025

History

History
235 lines (155 loc) · 9.77 KB

getting_started.md

File metadata and controls

235 lines (155 loc) · 9.77 KB

Lemonade SDK

The lemonade SDK provides everything needed to get up and running quickly with LLMs on OnnxRuntime GenAI (OGA).

Install

You can quickly get started with lemonade by installing the turnkeyml PyPI package with the appropriate extras for your backend, or you can install from source by cloning and installing this repository.

From PyPI

To install lemonade from PyPI:

  1. Create and activate a miniconda environment.

    conda create -n lemon python=3.10
    conda activate lemon
  2. Install lemonade for you backend of choice:

  3. Use lemonade -h to explore the LLM tools, and see the command and API examples below.

From Source Code

To install lemonade from source code:

  1. Clone: git clone https://github.com/onnx/turnkeyml.git
  2. cd turnkeyml (where turnkeyml is the repo root of your clone)
    • Note: be sure to run these installation instructions from the repo root.
  3. Follow the same instructions as in the PyPI installation, except replace the turnkeyml with a ..
    • For example: pip install -e .[llm-oga-igpu]

From Lemonade_Server_Installer.exe

The Lemonade Server is available as a standalone tool with a one-click Windows installer .exe. Check out the Lemonade_Server_Installer.exe guide for installation instructions and the server spec to learn more about the functionality.

CLI Commands

The lemonade CLI uses a unique command syntax that enables convenient interoperability between models, frameworks, devices, accuracy tests, and deployment options.

Each unit of functionality (e.g., loading a model, running a test, deploying a server, etc.) is called a Tool, and a single call to lemonade can invoke any number of Tools. Each Tool will perform its functionality, then pass its state to the next Tool in the command.

You can read each command out loud to understand what it is doing. For example, a command like this:

lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"

Can be read like this:

Run lemonade on the input (-i) checkpoint microsoft/Phi-3-mini-4k-instruct. First, load it in the OnnxRuntime GenAI framework (oga-load), on to the integrated GPU device (--device igpu) in the int4 data type (--dtype int4). Then, pass the OGA model to the prompting tool (llm-prompt) with the prompt (-p) "Hello, my thoughts are" and print the response.

The lemonade -h command will show you which options and Tools are available, and lemonade TOOL -h will tell you more about that specific Tool.

Prompting

To prompt your LLM try:

OGA iGPU:

    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 llm-prompt -p "Hello, my thoughts are"

Hugging Face:

    lemonade -i facebook/opt-125m huggingface-load llm-prompt -p "Hello, my thoughts are"

The LLM will run with your provided prompt, and the LLM's response to your prompt will be printed to the screen. You can replace the "Hello, my thoughts are" with any prompt you like.

You can also replace the facebook/opt-125m with any Hugging Face checkpoint you like, including LLaMA-2, Phi-2, Qwen, Mamba, etc.

You can also set the --device argument in oga-load and huggingface-load to load your LLM on a different device.

Run lemonade huggingface-load -h and lemonade llm-prompt -h to learn more about those tools.

Accuracy

To measure the accuracy of an LLM using MMLU, try this:

OGA iGPU:

    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 accuracy-mmlu --tests management

Hugging Face:

    lemonade -i facebook/opt-125m huggingface-load accuracy-mmlu --tests management

That command will run just the management test from MMLU on your LLM and save the score to the lemonade cache at ~/.cache/lemonade.

You can run the full suite of MMLU subjects by omitting the --test argument. You can learn more about this with lemonade accuracy-mmlu -h.

Benchmarking

To measure the time-to-first-token and tokens/second of an LLM, try this:

OGA iGPU:

    lemonade -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench

Hugging Face:

    lemonade -i facebook/opt-125m huggingface-load huggingface-bench

That command will run a few warmup iterations, then a few generation iterations where performance data is collected.

The prompt size, number of output tokens, and number iterations are all parameters. Learn more by running lemonade oga-bench -h or lemonade huggingface-bench -h.

LLM Report

To see a report that contains all the benchmarking results and all the accuracy results, use the report tool with the --perf flag:

lemonade report --perf

The results can be filtered by model name, device type and data type. See how by running lemonade report -h.

Memory Usage

The peak memory used by the lemonade build is captured in the build output. To capture more granular memory usage information, use the --memory flag. For example:

OGA iGPU:

    lemonade --memory -i microsoft/Phi-3-mini-4k-instruct oga-load --device igpu --dtype int4 oga-bench

Hugging Face:

    lemonade --memory -i facebook/opt-125m huggingface-load huggingface-bench

In this case a memory_usage.png file will be generated and stored in the build folder. This file contains a figure plotting the memory usage over the build time. Learn more by running lemonade -h.

Serving

You can launch an OpenAI-compatible server with:

    lemonade serve

Visit the server spec to learn more about the endpoints provided.

API

Lemonade is also available via API.

High-Level APIs

The high-level lemonade API abstracts loading models from any supported framework (e.g., Hugging Face, OGA) and backend (e.g., CPU, iGPU, Hybrid) using the popular from_pretrained() function. This makes it easy to integrate lemonade LLMs into Python applications.

OGA iGPU:

from lemonade.api import from_pretrained

model, tokenizer = from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", recipe="oga-igpu")

input_ids = tokenizer("This is my prompt", return_tensors="pt").input_ids
response = model.generate(input_ids, max_new_tokens=30)

print(tokenizer.decode(response[0]))

You can learn more about the high-level APIs here.

Low-Level API

The low-level API is useful for designing custom experiments. For example, sweeping over specific checkpoints, devices, and/or tools.

Here's a quick example of how to prompt a Hugging Face LLM using the low-level API, which calls the load and prompt tools one by one:

import lemonade.tools.torch_llm as tl
import lemonade.tools.prompt as pt
from turnkeyml.state import State

state = State(cache_dir="cache", build_name="test")

state = tl.HuggingfaceLoad().run(state, input="facebook/opt-125m")
state = pt.Prompt().run(state, prompt="hi", max_new_tokens=15)

print("Response:", state.response)

Contributing

Contributions are welcome! If you decide to contribute, please:

  • Do so via a pull request.
  • Write your code in keeping with the same style as the rest of this repo's code.
  • Add a test under test/lemonade that provides coverage of your new feature.

The best way to contribute is to add new tools to cover more devices and usage scenarios.

To add a new tool:

  1. (Optional) Create a new .py file under src/lemonade/tools (or use an existing file if your tool fits into a pre-existing family of tools).
  2. Define a new class that inherits the Tool class from TurnkeyML.
  3. Register the class by adding it to the list of tools near the top of src/lemonade/cli.py.

You can learn more about contributing on the repository's contribution guide.