Functions exposed in the libgemma.a? #70

Christopheraburns · 2024-03-01T00:43:58Z

Forgive my weak C++-fu. I've compiled gemma into libgemma.a to call from my C++ application. Is there documentation that details the function calls available in the library?

austinvhuang · 2024-03-01T04:30:19Z

This is a great question.

There's two big TODOs that will make using gemma.cpp as a library much better:

We have some example demo applications in the works. They're pretty trivial but are meant to illustrate swapping in your own applications calling into the API in place of the interactive TUI. One is a silly message of the day app, another basically does the "what does this code do?" RAG task of the README demo, but as a program. Once some of the P0s are cleared (like Generate compressed weights file from finetune #11) this is high up there.
There's some aspects of gemma.h and gemma.cc that are a little too coupled to run.cc, so there are some changes to the API interface that will happen once that gets decoupled a bit.

So if you don't want to deal with the sharper edges, you might wait for the above updates. That said, here's a few notes to get started. First, have a look at DEVERLOPERS.md for some high level notes (will be adding these notes and additional detail there).

Unless you are doing lower level research, form an application standpoint you can think of gemma.h and gemma.cc as the "core" of the library. You can think of run.cc as an example application that your application is substituting for, so the invocations into gemma.h and gemma.cc you see in run.cc are probably the functions you'll be invoking.

Keep in mind gemma.cpp is oriented at more experimental / prototype / research applications. If you're targeting production, there's more standard paths via jax / pytorch / keras for NN deployments.

Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations

Gemma(...) - constructor, called here:

gemma.cpp/run.cc

Line 237 in b6aaf6b

gcpp::Gemma model(loader, pool);

creates a gemma model object, which is a wrapper around 3 things - the tokenizer object, weights, activations, and KV Cache:

gemma.cpp/gemma.cc

Line 267 in b6aaf6b

hwy::AlignedFreeUniquePtr<uint8_t[]> compressed_weights;

In a vanilla LLM app, you'll probably use a Gemma object directly, in more exotic data processing or research applications, you might decompose working with weights, kv cache and activations (e.g. you might have multiple kv caches and activations for a single set of weights) more directly rather than only using a Gemma object.

Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)

You pretty much only do things with the tokenizer, call Encode() to go from string prompts to token id vectors, or Decode() to go from token id vector outputs from the model back to strings. See:

gemma.cpp/run.cc

Line 194 in b6aaf6b

HWY_ASSERT(model.Tokenizer().Encode(prompt_string, &prompt).ok());

The main entrypoint for generation is `GenerateGemma()`

Calling into GenerateGemma as is done here:

gemma.cpp/run.cc

Line 207 in b6aaf6b

GenerateGemma(model, args, prompt, abs_pos, pool, inner_pool, stream_token,

with a tokenized prompt will 1) mutate the activation values in model and 2) invoke StreamFunc - a lambda callback for each generated token.

Your application defines its own StreamFunc as a lambda callback to do something everytime a token string is streamed from the engine (eg print to the screen, write data to the disk, send the string to a server, etc.). You can see in run.cc the StreamFunc lambda takes care of printing each token to the screen as it arrives:

gemma.cpp/run.cc

Line 117 in b6aaf6b

auto stream_token = [&abs_pos, &current_pos, &args, &gen, &prompt_size,

Optionally you can define accept_token as another lambda - this is mostly for constrained decoding type of use cases where you want to force the generation to fit a grammar. If you're not doing this, you can send an empty lambda as a no-op which is what run.cc does.

If you want to invoke the neural network forward function directly call the `Transformer()` function

For high-level applications, you might only call GenerateGemma() and never interact directly with the neural network, but if you're doing something a bit more custom you can call transformer which performs a single inference operation on a single token and mutates the Activations and the KVCache through the neural network computation.

For low level operations, defining new architectures, call `ops.h` functions directly

You use ops.h if you're writing other NN architectures or modifying the inference path of the Gemma model.

Discussion

If you have additional questions or this is unclear, feel free to follow-up! We're also trying out a discord server for discussion here - https://discord.gg/H5jCBAWxAe

austinvhuang · 2024-03-08T02:20:19Z

I'm working on adding an example of using libgemma here: #82 in addition to refactoring library usage.

Still a bit more work before merging, but if you're interested in libgemma, might be worth tracking the implementation there. Closing this issue for now but feel free to chime in if you're blocked on something.

austinvhuang mentioned this issue Mar 1, 2024

Add DEVELOPERS notes on using gemma as a library #71

Merged

austinvhuang closed this as completed Mar 8, 2024

tilakrayal added the documentation Improvements or additions to documentation label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions exposed in the libgemma.a? #70

Functions exposed in the libgemma.a? #70

Christopheraburns commented Mar 1, 2024

austinvhuang commented Mar 1, 2024 •

edited

Loading

austinvhuang commented Mar 8, 2024

Functions exposed in the libgemma.a? #70

Functions exposed in the libgemma.a? #70

Comments

Christopheraburns commented Mar 1, 2024

austinvhuang commented Mar 1, 2024 • edited Loading

Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations

Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)

The main entrypoint for generation is GenerateGemma()

If you want to invoke the neural network forward function directly call the Transformer() function

For low level operations, defining new architectures, call ops.h functions directly

Discussion

austinvhuang commented Mar 8, 2024

austinvhuang commented Mar 1, 2024 •

edited

Loading

The main entrypoint for generation is `GenerateGemma()`

If you want to invoke the neural network forward function directly call the `Transformer()` function

For low level operations, defining new architectures, call `ops.h` functions directly