Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions exposed in the libgemma.a? #70

Closed
Christopheraburns opened this issue Mar 1, 2024 · 2 comments
Closed

Functions exposed in the libgemma.a? #70

Christopheraburns opened this issue Mar 1, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@Christopheraburns
Copy link

Forgive my weak C++-fu. I've compiled gemma into libgemma.a to call from my C++ application. Is there documentation that details the function calls available in the library?

@austinvhuang
Copy link
Collaborator

austinvhuang commented Mar 1, 2024

This is a great question.

There's two big TODOs that will make using gemma.cpp as a library much better:

  • We have some example demo applications in the works. They're pretty trivial but are meant to illustrate swapping in your own applications calling into the API in place of the interactive TUI. One is a silly message of the day app, another basically does the "what does this code do?" RAG task of the README demo, but as a program. Once some of the P0s are cleared (like Generate compressed weights file from finetune #11) this is high up there.

  • There's some aspects of gemma.h and gemma.cc that are a little too coupled to run.cc, so there are some changes to the API interface that will happen once that gets decoupled a bit.

So if you don't want to deal with the sharper edges, you might wait for the above updates. That said, here's a few notes to get started. First, have a look at DEVERLOPERS.md for some high level notes (will be adding these notes and additional detail there).

Unless you are doing lower level research, form an application standpoint you can think of gemma.h and gemma.cc as the "core" of the library. You can think of run.cc as an example application that your application is substituting for, so the invocations into gemma.h and gemma.cc you see in run.cc are probably the functions you'll be invoking.

Keep in mind gemma.cpp is oriented at more experimental / prototype / research applications. If you're targeting production, there's more standard paths via jax / pytorch / keras for NN deployments.

Gemma struct contains all the state of the inference engine - tokenizer, weights, and activations

Gemma(...) - constructor, called here:

gcpp::Gemma model(loader, pool);

creates a gemma model object, which is a wrapper around 3 things - the tokenizer object, weights, activations, and KV Cache:

hwy::AlignedFreeUniquePtr<uint8_t[]> compressed_weights;

In a vanilla LLM app, you'll probably use a Gemma object directly, in more exotic data processing or research applications, you might decompose working with weights, kv cache and activations (e.g. you might have multiple kv caches and activations for a single set of weights) more directly rather than only using a Gemma object.

Use the tokenizer in the Gemma object (or interact with the Tokenizer object directly)

You pretty much only do things with the tokenizer, call Encode() to go from string prompts to token id vectors, or Decode() to go from token id vector outputs from the model back to strings. See:

HWY_ASSERT(model.Tokenizer().Encode(prompt_string, &prompt).ok());

The main entrypoint for generation is GenerateGemma()

Calling into GenerateGemma as is done here:

GenerateGemma(model, args, prompt, abs_pos, pool, inner_pool, stream_token,

with a tokenized prompt will 1) mutate the activation values in model and 2) invoke StreamFunc - a lambda callback for each generated token.

Your application defines its own StreamFunc as a lambda callback to do something everytime a token string is streamed from the engine (eg print to the screen, write data to the disk, send the string to a server, etc.). You can see in run.cc the StreamFunc lambda takes care of printing each token to the screen as it arrives:

auto stream_token = [&abs_pos, &current_pos, &args, &gen, &prompt_size,

Optionally you can define accept_token as another lambda - this is mostly for constrained decoding type of use cases where you want to force the generation to fit a grammar. If you're not doing this, you can send an empty lambda as a no-op which is what run.cc does.

If you want to invoke the neural network forward function directly call the Transformer() function

For high-level applications, you might only call GenerateGemma() and never interact directly with the neural network, but if you're doing something a bit more custom you can call transformer which performs a single inference operation on a single token and mutates the Activations and the KVCache through the neural network computation.

For low level operations, defining new architectures, call ops.h functions directly

You use ops.h if you're writing other NN architectures or modifying the inference path of the Gemma model.

Discussion

If you have additional questions or this is unclear, feel free to follow-up! We're also trying out a discord server for discussion here - https://discord.gg/H5jCBAWxAe

@austinvhuang
Copy link
Collaborator

I'm working on adding an example of using libgemma here: #82 in addition to refactoring library usage.

Still a bit more work before merging, but if you're interested in libgemma, might be worth tracking the implementation there. Closing this issue for now but feel free to chime in if you're blocked on something.

@tilakrayal tilakrayal added the documentation Improvements or additions to documentation label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants