Model Support:
- Gemma & Gemma 2 Models
- Llama & Llama2 & Llama3 Models
- Mistral & Mixtral Models
- Qwen2 Models
- IBM Granite Models
- GPT-2 Models
- BERT Models
- BPE Tokenizers
- WordPiece Tokenizers
Implements:
- Paged Attention
- Mixture of Experts
- Tool Calling
- Generate Embeddings
- Classifier Support
- Huggingface SafeTensors model and tokenizer format
- Support for F32, F16, BF16 types
- Support for Q8, Q4 model quantization
- Fast GEMM operations
- Distributed Inference!
Jlama requires Java 20 or later and utilizes the new Vector API for faster inference.
Add LLM Inference directly to your Java application.
Jlama includes a command line tool that makes it easy to use.
The CLI can be run with jbang.
#Install jbang (or https://www.jbang.dev/download/)
curl -Ls https://sh.jbang.dev | bash -s - app setup
#Install Jlama CLI (will ask if you trust the source)
jbang app install --force jlama@tjake
Now that you have jlama installed you can download a model from huggingface and chat with it. Note I have pre-quantized models available at https://hf.co/tjake
# Run the openai chat api and UI on a model
jlama restapi tjake/Llama-3.2-1B-Instruct-JQ4 --auto-download
open browser to http://localhost:8080/
Usage:
jlama [COMMAND]
Description:
Jlama is a modern LLM inference engine for Java!
Quantized models are maintained at https://hf.co/tjake
Choose from the available commands:
Inference:
chat Interact with the specified model
restapi Starts a openai compatible rest api for interacting with this model
complete Completes a prompt using the specified model
Distributed Inference:
cluster-coordinator Starts a distributed rest api for a model using cluster workers
cluster-worker Connects to a cluster coordinator to perform distributed inference
Other:
download Downloads a HuggingFace model - use owner/name format
list Lists local models
quantize Quantize the specified model
The main purpose of Jlama is to provide a simple way to use large language models in Java.
The simplest way to embed Jlama in your app is with the Langchain4j Integration.
If you would like to embed Jlama without langchain4j, add the following maven dependencies to your project:
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-core</artifactId>
<version>${jlama.version}</version>
</dependency>
<dependency>
<groupId>com.github.tjake</groupId>
<artifactId>jlama-native</artifactId>
<!-- supports linux-x86_64, macos-x86_64/aarch_64, windows-x86_64
Use https://github.com/trustin/os-maven-plugin to detect os and arch -->
<classifier>${os.detected.name}-${os.detected.arch}</classifier>
<version>${jlama.version}</version>
</dependency>
jlama uses Java 21 preview features. You can enable the features globally with:
export JDK_JAVA_OPTIONS="--add-modules jdk.incubator.vector --enable-preview"
or enable the preview features by configuring maven compiler and failsafe plugins.
Then you can use the Model classes to run models:
public void sample() throws IOException {
String model = "tjake/Llama-3.2-1B-Instruct-JQ4";
String workingDirectory = "./models";
String prompt = "What is the best season to plant avocados?";
// Downloads the model or just returns the local path if it's already downloaded
File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();
// Loads the quantized model and specified use of quantized memory
AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);
PromptContext ctx;
// Checks if the model supports chat prompting and adds prompt in the expected format for this model
if (m.promptSupport().isPresent()) {
ctx = m.promptSupport()
.get()
.builder()
.addSystemMessage("You are a helpful chatbot who writes short responses.")
.addUserMessage(prompt)
.build();
} else {
ctx = PromptContext.of(prompt);
}
System.out.println("Prompt: " + ctx.getPrompt() + "\n");
// Generates a response to the prompt and prints it
// The api allows for streaming or non-streaming responses
// The response is generated with a temperature of 0.7 and a max token length of 256
Generator.Response r = m.generate(UUID.randomUUID(), ctx, 0.0f, 256, (s, f) -> {});
System.out.println(r.responseText);
}
Or you can use a Builder API:
public void sample() throws IOException {
String model = "tjake/Llama-3.2-1B-Instruct-JQ4";
String workingDirectory = "./models";
String prompt = "What is the best season to plant avocados?";
// Downloads the model or just returns the local path if it's already downloaded
File localModelPath = new Downloader(workingDirectory, model).huggingFaceModel();
// Loads the quantized model and specified use of quantized memory
AbstractModel m = ModelSupport.loadModel(localModelPath, DType.F32, DType.I8);
PromptContext ctx;
// Checks if the model supports chat prompting and adds prompt in the expected format for this model
if (m.promptSupport().isPresent()) {
ctx = m.promptSupport()
.get()
.builder()
.addSystemMessage("You are a helpful chatbot who writes short responses.")
.addUserMessage(prompt)
.build();
} else {
ctx = PromptContext.of(prompt);
}
System.out.println("Prompt: " + ctx.getPrompt() + "\n");
// Generates a response to the prompt and prints it
// The api allows for streaming or non-streaming responses
// The response is generated with a temperature of 0.7 and a max token length of 256
Generator.Response r = m.generateBuilder()
.session(UUID.randomUUID()) //By default, UUID.randomUUID()
.promptContext(ctx) // Required or use prompt(String text)
.ntokens(256) //By default, 256
.temperature(0.0f) //By default, 0.0f
.onTokenWithTimings((s, aFloat) -> {}) //By default, (s, aFloat) -> {}, nothing
.generate();
System.out.println(r.responseText);
}
If you like or are using this project to build your own, please give us a star. It's a free way to show your support.
- Support more and more models
Add pure java tokenizersSupport Quantization (e.g. k-quantization)- Add LoRA support
- GraalVM support
Add distributed inference
The code is available under Apache License.
If you find this project helpful in your research, please cite this work at
@misc{jlama2024,
title = {Jlama: A modern Java inference engine for large language models},
url = {https://github.com/tjake/jlama},
author = {T Jake Luciani},
month = {January},
year = {2024}
}