-
Notifications
You must be signed in to change notification settings - Fork 509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions exposed in the libgemma.a? #70
Comments
This is a great question. There's two big TODOs that will make using gemma.cpp as a library much better:
So if you don't want to deal with the sharper edges, you might wait for the above updates. That said, here's a few notes to get started. First, have a look at DEVERLOPERS.md for some high level notes (will be adding these notes and additional detail there). Unless you are doing lower level research, form an application standpoint you can think of gemma.h and gemma.cc as the "core" of the library. You can think of Keep in mind gemma.cpp is oriented at more experimental / prototype / research applications. If you're targeting production, there's more standard paths via jax / pytorch / keras for NN deployments. Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations
Line 237 in b6aaf6b
creates a gemma model object, which is a wrapper around 3 things - the tokenizer object, weights, activations, and KV Cache: Line 267 in b6aaf6b
In a vanilla LLM app, you'll probably use a Gemma object directly, in more exotic data processing or research applications, you might decompose working with weights, kv cache and activations (e.g. you might have multiple kv caches and activations for a single set of weights) more directly rather than only using a Gemma object. Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)You pretty much only do things with the tokenizer, call Line 194 in b6aaf6b
The main entrypoint for generation is
|
GenerateGemma(model, args, prompt, abs_pos, pool, inner_pool, stream_token, |
with a tokenized prompt will 1) mutate the activation values in model
and 2) invoke StreamFunc - a lambda callback for each generated token.
Your application defines its own StreamFunc as a lambda callback to do something everytime a token string is streamed from the engine (eg print to the screen, write data to the disk, send the string to a server, etc.). You can see in run.cc
the StreamFunc lambda takes care of printing each token to the screen as it arrives:
Line 117 in b6aaf6b
auto stream_token = [&abs_pos, ¤t_pos, &args, &gen, &prompt_size, |
Optionally you can define accept_token as another lambda - this is mostly for constrained decoding type of use cases where you want to force the generation to fit a grammar. If you're not doing this, you can send an empty lambda as a no-op which is what run.cc
does.
If you want to invoke the neural network forward function directly call the Transformer()
function
For high-level applications, you might only call GenerateGemma()
and never interact directly with the neural network, but if you're doing something a bit more custom you can call transformer which performs a single inference operation on a single token and mutates the Activations and the KVCache through the neural network computation.
For low level operations, defining new architectures, call ops.h
functions directly
You use ops.h
if you're writing other NN architectures or modifying the inference path of the Gemma model.
Discussion
If you have additional questions or this is unclear, feel free to follow-up! We're also trying out a discord server for discussion here - https://discord.gg/H5jCBAWxAe
I'm working on adding an example of using libgemma here: #82 in addition to refactoring library usage. Still a bit more work before merging, but if you're interested in libgemma, might be worth tracking the implementation there. Closing this issue for now but feel free to chime in if you're blocked on something. |
Forgive my weak C++-fu. I've compiled gemma into libgemma.a to call from my C++ application. Is there documentation that details the function calls available in the library?
The text was updated successfully, but these errors were encountered: