LLaMA (Large Language Model Meta AI) is a large language model from Meta. So this is a model which means that it contains a binary file with the weights and biases of the model. These models come in different sizes and are trained on different datasets. The larger the model the more data it has been trained on and the more accurate it is.
Is really a family of pre-trained models in various scales (the number of weights). From 7B to 70B.
It is based on the transformer architecture which some improvements like:
- RMSNorm pre-normalization (apperently used by GPT-3)
- SwiGLU activation function (apperently from Google's PaML)
- multi-query attention instead of multi-head attention
- Rotary Positional Embeddings (RoPE) instead of standard positional embeddings (apperently inspired by GPT-Neo), see rope.md
- AdamW optimizer
I've seen the achitecture of a transformer where there is an encoder and a decoder. But my understanding of Llama is that there is only an encoder.
+-----------+
| Softmax |
+-----------+
↑
+-----------+
| Linear |
+-----------+
↑
+-----------+
| RMS Norm |
+-----------+
↑
|
|
+-----------------------|--------------------------------------------+
| +--------------->+ |
| | ↑ |
| | +--------------------+ |
| | | Feed Forward SwiGLU| |
| | +--------------------+ |
| | ↑ |
| | +--------------------+ |
| | | RMS Norm | |
| | +--------------------+ |
| | ↑ |
| | | |
| +----------------| |
| +----------------->+ |
| | ↑ |
| | +-----------------------------------------------+ | N times
| | | Self Attention (Grouped Multi Query Attention)| |
| | | with KV Cache | |
| | +-----------------------------------------------+ |
| | ↑ ROPE ↑ ROPE ↑ |
| | Q| K| V| |
| | | | | |
| | ----------------------------------------------- |
| | | |
| | +--------------------+ |
| | | RMS Norm | |
| | +--------------------+ |
| | | |
| +-----------------------| |
| | |
+----------------------------|---------------------------------------+
|
+--------------------+
| Embeddings |
+--------------------+
llama.cpp is a library written in c and contains useful programs/examples that can be used to run inference on a Llama model, as well as quantize a model. So it can be used to run inference on a model, or to quantize a model and there are examples in the examples directory for how to do this.
This is just for getting an orientation of the headers and source cpp files that are used.
It can be run locally:
$ cd ai/llama.cpp
$ make -j8
Next we need to download a model to use and store it in the models directory. I tried the following model:
https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf
This the Llama 2 model trained on 13B tokens of chat data. It is a GGUF format which is suitable for CPU usage. More details about GGUF can be found in gptq.md.
Example of running:
$ cd ~/ai/llama.cpp
$ ./main -m models/llama-2-13b-chat.Q4_0.gguf --prompt "What is LoRA?</s>"
main
can be found in examples/main/main.cpp and the output of the log
statements can be found in main-timestamp.log.
We can build the main executable with debug symbols enabled:
$ env GGML_DEBUG=1 LLAMA_DEBUG=1 DEBUG=1 make -B main
After that we are able to run the main executable using a debugger:
$ gdb --args ./main -m models/llama-2-13b-chat.Q4_0.gguf --prompt "What is LoRA?</s>"
Reading symbols from ./main...
(gdb)
Now we can break in main and then run the program:
(gdb) break main
Breakpoint 1 at 0x40a4c9: file examples/main/main.cpp, line 105.
(gdb) run
The first line of code we encounter is:
gpt_params params;
We can inspect the type of this variable:
(gdb) ptype gpt_params
type = struct gpt_params {
uint32_t seed; // Random seed value
int32_t n_threads;
int32_t n_predict;
int32_t n_ctx;
int32_t n_batch;
int32_t n_keep;
int32_t n_draft;
int32_t n_chunks;
int32_t n_gpu_layers;
int32_t main_gpu;
float tensor_split[1];
I don't intend to step through the code but merely have the ability to inspect/ debug it at a later time.
Sampling parameters.
(gdb) p *ctx
$6 = {mem_size = 16, mem_buffer = 0x65c360, mem_buffer_owned = true, no_alloc = false, no_alloc_save = false,
n_objects = 0, objects_begin = 0x0, objects_end = 0x0, scratch = {offs = 0, size = 0, data = 0x0},
scratch_save = {offs = 0, size = 0, data = 0x0}}
(gdb) ptype *ctx
type = struct ggml_context {
size_t mem_size;
void *mem_buffer;
_Bool mem_buffer_owned;
_Bool no_alloc;
_Bool no_alloc_save;
int n_objects;
ggml_object *objects_begin;
ggml_object *objects_end;
ggml_scratch scratch;
ggml_scratch scratch_save;
}
So this struct contains pointers to memory buffers which is where all tensor will be allocated.
The presence penalty is a hyperparameter that is used control the absence or presence of new tokens in the generated text. A value of 0 has no effect on token generation. A negative value will encourage the model to generate new tokens. A positive value will encourage the model to not generate new tokens.
The repeat penalty is a hyperparameter that is used to control the repetition of tokens in the generated text. Setting this to 1 will not have any effect on token generation. A value of 0 will encourage the model to repeat tokens. A value greater than 1 will encourage the model to not repeat tokens.
I ran into this issue with the llam-chains-chat-demo where the llm would just repeat the new-line token over and over until the context size was reached. Adding a repeat penalty of 1.1 fixed this issue and is something to be aware of.
The frequency penalty is a hyperparameter that is used to control the frequency of tokens in the generated text. Setting this to 1 will not have any effect on token generation. A value of 0 will encourage the model to generate tokens that are more frequent. A value greater than 1 will encourage the model to generate tokens that are less frequent.
llm-chain have what they refer to as drivers, at the time of this writing there are two drivers: OpenAI and Llama. Llama uses a binding to llama.cpp, and is is created using bindget. The crate llm-chain-llama-sys contains the binding and llm-chain-llama contains Rust API.
This section tries to explain the key-value cache used in the llama 2 architecture. This is the caching of the key and value matrices that are used in the attention architecture.
First lets take a look at inference without the key-value cache. Now in the following example we are starting with input tokens with a dimension of 2. So for each token that we have, we have a vector of size 2. And we will assume that the first token is the start of sentence token and that the model is going to predict the next token. What it predicts does not matter to make the point here so just ignore the actual values.
Attention(Q, K V) = softmax(QK^T / sqrt(d_k)) V
input token [1 2] (Start of Sentence)
Time=1
[ Q K^T QK^T ] * V
[1 2] . [1] = [5] * [1 2] = [5 10]
[2]
(dot product) (scalar multiplication because [5] is a single number)
So we can see that we have performed the dot product of the query and the key and then multiplied that by the value matrix. Lets pretend that the output of this is then the probability of the next token, and that token was selected from the models vocabulary (we are ignoring softmax etc here).
So for the next token we then append the predicted token to the input tokens and run the inference again:
Time=2
[ Q K^T QK^T ] * V
[1 2 ] . [1 5] = [5 25 ] * [1 2] = [5*1 + 25*5 5*2 + 25*10] = [130 260]
[5 10] [2 10] [25 125] [5 10] [25*1 + 125*5 25*2 + 125*10] [650 1300]
(dot product) (matrix multiplication)
Once again we will take the last token and append it to the input tokens and run the inference again:
Time=3
[ Q K^T QK^T ] * V
[1 2 ] . [1 5 650] = [5 25 3250 ] =
[5 10] [2 10 1300] [25 125 16250 ]
[650 1300] [3250 16250 2210000]
Notice that we have calculated the dot product for the first and second token again, in the Key and Value matrices:
[ Q K^T QK^T ] * V
[1 2 ] . [1 5 ---] = [5 25 ---- ] =
[5 10] [2 10 ----] [25 125 ----- ]
[650 1300] [---- ----- -------]
The transformer architecture needs to have all the previous tokens in the sequence to predict the next token but we don't have to recalculate the dot products every time. We can cache the key and value matrices and then just use a single input token, the last token predicted, and not append that token to the input:
Time=2
[ Q K^T QK^T ] * V
[5 10 ] . [1 5] = [25 125] * [1 5 ] = [650 1300]
[2 10] [2 10]
By adding the output token to the Key
and the Value
matrices we perform
fewer operations. This is the key-value cache.
So for every token processed it needs to be added to the Key and Value cache matrices to be use in the next predition.
Let's take a look at how this is implemented in llama.cpp. I'll be using simple-prompt to demonstrate this.
$ gdb --args ./simple-prompt
(gdb) br simple-prompt.cpp:34
(gdb) r
(gdb) f
#1 0x0000000000408fab in main (argc=1, argv=0x7fffffffd198) at src/simple-prompt.cpp:34
34 llama_context * ctx = llama_new_context_with_model(model, ctx_params);
(gdb) s
gdb) l
8736 return nullptr;
8737 }
8738
8739 llama_context * ctx = new llama_context(*model);
if (!llama_kv_cache_init(ctx->kv_self, ctx, type_k, type_v, kv_size, cparams.offload_kqv)) {
LLAMA_LOG_ERROR("%s: llama_kv_cache_init() failed for self-attention cache\n", __func__);
llama_free(ctx);
return nullptr;
}
(gdb) p type_k
$11 = GGML_TYPE_F16
(gdb) p type_v
$12 = GGML_TYPE_F16
(gdb) p kv_size
$13 = 1024
(gdb) p cparams.offload_kqv
$14 = true
(gdb) p ctx->kv_self
$15 = {has_shift = false, do_defrag = false, do_copy = false, recurrent = false,
v_trans = true, head = 0, size = 0, used = 0, n = 0,
type_k = GGML_TYPE_F16,
type_v = GGML_TYPE_F16,
cells = std::vector of length 0, capacity 0,
k_l = std::vector of length 0, capacity 0,
v_l = std::vector of length 0, capacity 0,
ctxs = std::vector of length 0, capacity 0,
bufs = std::vector of length 0, capacity 0}
static bool llama_kv_cache_init(
struct llama_kv_cache & cache,
const llama_context * ctx,
ggml_type type_k,
ggml_type type_v,
uint32_t kv_size,
bool offload) {
...
const uint32_t n_embd_k_gqa = hparams.n_embd_k_gqa() + hparams.n_embd_k_s();
const uint32_t n_embd_v_gqa = hparams.n_embd_v_gqa() + hparams.n_embd_v_s();
const int64_t n_layer = hparams.n_layer;
(gdb) p n_embd_k_gqa
$19 = 5120
(gdb) p n_embd_v_gqa
$20 = 5120
(gdb) p n_layer
$21 = 40
cache.head = 0;
cache.size = kv_size;
cache.used = 0;
cache.type_k = type_k;
cache.type_v = type_v;
cache.cells.clear();
cache.cells.resize(kv_size);
After the resize the cache cells will contain 1024 cells:
(gdb) p cache.cells.size()
$28 = 1024
cache.k_l.reserve(n_layer);
cache.v_l.reserve(n_layer);
for (int i = 0; i < (int) n_layer; i++) {
struct ggml_context * ctx = offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
ggml_tensor * k = ggml_new_tensor_1d(ctx, type_k, n_embd_k_gqa*kv_size);
ggml_tensor * v = ggml_new_tensor_1d(ctx, type_v, n_embd_v_gqa*kv_size);
ggml_format_name(k, "cache_k_l%d", i);
ggml_format_name(v, "cache_v_l%d", i);
cache.k_l.push_back(k);
cache.v_l.push_back(v);
}
At this stage the kv_self is uninitialized. We can inspect this struct using:
gdb) ptype ctx.kv_self
type = struct llama_kv_cache {
bool has_shift;
bool do_defrag;
bool do_copy;
bool recurrent;
uint32_t head;
uint32_t size;
uint32_t used;
uint32_t n;
ggml_type type_k;
ggml_type type_v;
std::vector<llama_kv_cell> cells;
std::vector<ggml_tensor*> k_l;
std::vector<ggml_tensor*> v_l;
std::vector<ggml_context*> ctxs;
std::vector<ggml_backend_buffer*> bufs;
public:
size_t total_size(void) const;
~llama_kv_cache(void);
}
We can see we have two vectors of pointers to ggml_tensor. These are the key and value matrices for each layer.
A little further down we have:
(gdb) s
8789 if (!llama_kv_cache_init(ctx->model.hparams, ctx->kv_self, memory_type, cparams.n_ctx, model->n_gpu_layers)) {
Notice that we are passing in `ctx->kv_self`, and the cparams.n_ctx which is
the context length set to 1024 in this case.
```console
(gdb) s
1518 static bool llama_kv_cache_init(
(gdb) l
1519 const struct llama_hparams & hparams,
1520 struct llama_kv_cache & cache,
1521 ggml_type wtype,
1522 uint32_t n_ctx,
1523 int n_gpu_layers) {
1524 const uint32_t n_embd = hparams.n_embd_gqa();
1525 const uint32_t n_layer = hparams.n_layer;
1526
1527 const int64_t n_mem = n_layer*n_ctx;
1528 const int64_t n_elements = n_embd*n_mem;
The first line had be asking what gqa
is and I think this stands for
grouped query attention.
(gdb) s
1532 cache.head = 0;
(gdb) s
1533 cache.size = n_ctx;
(gdb) s
1534 cache.used = 0;
(gdb) s
1536 cache.cells.clear();
(gdb) p cache
$20 = (llama_kv_cache &) @0xc624c8: {has_shift = false, head = 0, size = 1024, used = 0, n = 0,
cells = std::vector of length 0, capacity 0, k = 0x0, v = 0x0, ctx = 0x0, buf = {data = 0x0, size = 0,
fallback = false}}
1536 cache.cells.clear();
(gdb) n
1537 cache.cells.resize(n_ctx);
(gdb) n
(gdb) p cache
$21 = (llama_kv_cache &) @0xc624c8: {has_shift = false, head = 0, size = 1024, used = 0, n = 0,
cells = std::vector of length 1024, capacity 1024 = {{pos = -1, delta = 0, seq_id = std::set with 0 elements}, {
pos = -1, delta = 0, seq_id = std::set with 0 elements}, {pos = -1, delta = 0,
seq_id = std::set with 0 elements}, {pos = -1, delta = 0, seq_id = std::set with 0 elements}, {pos = -1,
...
1544 params.mem_buffer = cache.buf.data;
(gdb) s
1545 params.no_alloc = false;
(gdb) s
1547 cache.ctx = ggml_init(params);
So we can see here that we are going to initialize context for ggml. I did not
notice that ggml_context *ctx
was a member of llama_kv_cache
.
Next we are going to create a one dimensional tensor of GGML_TYPE_F16 (half
precision float) with 209715200 elements.
1554 cache.k = ggml_new_tensor_1d(cache.ctx, wtype, n_elements);
(gdb) p n_elements
$31 = 209715200
(gdb) p wtype
$32 = GGML_TYPE_F16
Hmm, the size of the tensor don't make sense to me yet. The 1d tensor is like a list of number and it's size is 209715200. And the type of these slots is F16 so that would be 2 bytes per slot, so 16 bytes per slot.
llama_new_context_with_model: kv self size = 800.00 MiB
gdb) s
8804 ctx->logits.reserve(hparams.n_vocab);
(gdb) p hparams.n_vocab
$40 = 32000
That is pretty much it for the intialization of the llama_context. This will return us to simple_prompt.cpp:
4 llama_context * ctx = llama_new_context_with_model(model, ctx_params);
35 if (ctx == NULL) {
36 fprintf(stderr , "%s: error: failed to create the llama_context\n" , __func__);
37 return 1;
38 }
Now, lets look what happens when decode is called and how this interacts with the key-value cache.
(gdb) br simple-prompt.cpp:115
Breakpoint 2 at 0x40949e: file src/simple-prompt.cpp, line 115.
(gdb) c
Continuing.
batch.n_tokens: 6
batch.tokens: [1, 1724, 338, 4309, 4717, 29973, ]
prompt: What is LoRA?
Breakpoint 2, main (argc=1, argv=0x7fffffffd198) at src/simple-prompt.cpp:115
115 if (llama_decode(ctx, batch) != 0) {
And the batch looks like this:
(gdb) p batch
$54 = {n_tokens = 6, token = 0xc63980, embd = 0x0, pos = 0x8a86f0, n_seq_id = 0x8ab7b0, seq_id = 0x8abfc0,
logits = 0x8a9790 "", all_pos_0 = 0, all_pos_1 = 0, all_seq_id = 0}
(gdb) s
5678 const auto n_batch = cparams.n_batch;
(gdb) s
(gdb) p n_batch
$56 = 512
So, n_batch
is the maximum number of tokens that can be in a single batch, and
n_tokens
is the number of tokens in the current batch.
(gdb) n
5682 int n_threads = n_tokens == 1 ? cparams.n_threads : cparams.n_threads_batch;
I found this a little interesting and because I've always called decode with a number of tokens, never a single token. But thinking back the example with the key-value cache and how it would pass in a single token as the input but the key and value matrices would contain all the previous tokens.
(gdb) n
5695 auto & kv_self = lctx.kv_self;
5734 // if we have enough unused cells before the current head ->
5735 // better to start searching from the beginning of the cache, hoping to fill it
5736 if (kv_self.head > kv_self.used + 2*n_tokens) {
5737 kv_self.head = 0;
5738 }
5739
5740 if (!llama_kv_cache_find_slot(kv_self, batch)) {
1584 // find an empty slot of size "n_tokens" in the cache
1585 // updates the cache head
1586 // Note: On success, it's important that cache.head points
1587 // to the first cell of the slot.
1588 static bool llama_kv_cache_find_slot(
1589 struct llama_kv_cache & cache,
1590 const struct llama_batch & batch) {
1591 const uint32_t n_ctx = cache.size;
1592 const uint32_t n_tokens = batch.n_tokens;
1593
1594 if (n_tokens > n_ctx) {
1595 LLAMA_LOG_ERROR("%s: n_tokens=%d > n_ctx=%d\n", __func__, n_tokens, n_ctx);
1596 return false;
1597 }
In this case we have the max context size in token of 1024 and the number of tokens in the batch is 6:
(gdb) p n_ctx
$69 = 1024
(gdb) p n_tokens
$70 = 6
The number of tokens in the batch cannot exceed the max context size.
(gdb) l
1599 uint32_t n_tested = 0;
1600
1601 while (true) {
1602 if (cache.head + n_tokens > n_ctx) {
1603 n_tested += n_ctx - cache.head;
1604 cache.head = 0;
1605 continue;
1606 }
So we are going to loop and the first thing we do is check if the head plus the
number of tokens in the batch exceed the max number of tokens allowed. If this
is the case then n_tested
is incremented with the max context size minus the
cache head.
Lets pretend that we have a head that is 1020 and the number of tokens is 6 and n_ctx is 1024. Then 1020+6=1026 and 1026 > 1024. And n_tested will become 1024-1020=4. And the head will be set to 0. And then the loop will continue but this time head will be zero. And the if statement will compare 6 > 1024 which is false and skip the body of the if statement.
1608 bool found = true;
1609 for (uint32_t i = 0; i < n_tokens; i++) {
1610 if (cache.cells[cache.head + i].pos >= 0) {
1611 found = false;
1612 cache.head += i + 1;
(gdb) l
1613 n_tested += i + 1;
1614 break;
1615 }
1616 }
So we are going to loop over all the 6 tokens in the batch.
(gdb) p i
$83 = 0
(gdb) p cache.head
$84 = 0
(gdb) p cache.cells[cache.head + i].pos
$85 = -1
cache.cells is a vector of size 1024, the max number of tokens allowed.
(gdb) p lctx.kv_self.cells.size()
$24 = 1024
If an entry in this vector is currently not set, its position is -1. So in our case the cells checked will not be greater than zero and the if block will not be executed. So this is making sure that from the current head there are n_tokens number of slots available.
That will lead us to the following code:
1628 for (uint32_t i = 0; i < n_tokens; i++) {
1629 cache.cells[cache.head + i].pos = batch.pos[i];
1630
1631 for (int32_t j = 0; j < batch.n_seq_id[i]; j++) {
1632 cache.cells[cache.head + i].seq_id.insert(batch.seq_id[i][j]);
1633 }
1634 }
1635
1636 cache.used += n_tokens;
1637
1638 return true;
And it makes sence that we again will loop over the 6 tokens in the batch and now add them to the cells.
gdb) p cache.cells
$106 = std::vector of length 1024, capacity 1024 = {
{pos = 0, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 1, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 2, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 3, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 4, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
{pos = 5, delta = 0, seq_id = std::set with 1 element = {[0] = 0}},
That is it for finding a slot in the key-value cache.
(gdb) s
5747 kv_self.n = std::min(
(int32_t) cparams.n_ctx,
std::max(32, GGML_PAD(llama_kv_cache_cell_max(kv_self), 32))
);
A cell is considered in use if its position is greater than or equal to zero and it's sequence id is not empty. So that should return 6 in our case:
(gdb) p llama_kv_cache_cell_max(kv_self)
$108 = 6
(gdb) p cparams.n_ctx
$109 = 1024
(gdb) p kv_self.n
$110 = 32
What is kv_self.n?
(gdb) f
#0 llama_decode_internal (lctx=..., batch=...) at llama.cpp:5751
5751 ggml_allocr_reset(lctx.alloc);
(gdb) n
5753 ggml_cgraph * gf = llama_build_graph(lctx, batch);
So we are building a compute graph for this batch.
(gdb) n
5757 struct ggml_tensor * res = gf->nodes[gf->n_nodes - 1];
The result tensor is the last node in the graph. The embedding tensor is the second to last node in the graph:
(gdb) n
5758 struct ggml_tensor * embeddings = gf->nodes[gf->n_nodes - 2];
(gdb) n
5816 ggml_graph_compute_helper(lctx.work_buffer, gf, n_threads);
This will now run the compute graph which involves starting threads:
gdb) n
[New Thread 0x7ffe072eb6c0 (LWP 2085673)]
[New Thread 0x7ffe06aea6c0 (LWP 2085674)]
[New Thread 0x7ffe062e96c0 (LWP 2085675)]
[Thread 0x7ffe062e96c0 (LWP 2085675) exited]
[Thread 0x7ffe06aea6c0 (LWP 2085674) exited]
[Thread 0x7ffe072eb6c0 (LWP 2085673) exited]
5825 if (kv_self.has_shift) {
Next, we have:
5823 // update the kv ring buffer
5824 {
5825 if (kv_self.has_shift) {
5826 kv_self.has_shift = false;
5827 for (uint32_t i = 0; i < kv_self.size; ++i) {
5828 kv_self.cells[i].delta = 0;
5829 }
The llama_kv_cache
is a ringbuffer and when the buffer is full and we need to
add data, the oldest data is overwritten which I believe is called shifting.
This false in our case.
5832 kv_self.head += n_tokens;
(gdb) s
5835 if (kv_self.head >= kv_self.size) {
(gdb) p kv_self.head
$118 = 6
(gdb) p kv_self.size
$119 = 1024
So that was the initial prompt which has now been decoded and then we will use the logits to predict the next token. This will be a single token which we will then pass into llama_decode:
(gdb)
206 if (llama_decode(ctx, batch)) {
(gdb) p batch
$125 = {n_tokens = 1, token = 0xc63980, embd = 0x0, pos = 0x8a86f0, n_seq_id = 0x8ab7b0, seq_id = 0x8abfc0,
logits = 0x8a9790 "\001", all_pos_0 = 0, all_pos_1 = 0, all_seq_id = 0}
(gdb) p batch.pos[0]
$126 = 6
Notice that this time around the kv_self is:
(gdb) p lctx.kv_self
$128 = {has_shift = false, head = 6, size = 1024, used = 6, n = 32,
...
I'm trying to understand where the kv_self is used. I can see that it is updated as we saw above. Where is the Q and the K^T calculation done? For this we have to take a look at how the compute graph is built. This is done in build_llama. wip
This struct holdes input
data for llama_decode
and is defined as:
typedef struct llama_batch {
int32_t n_tokens;
llama_token * token;
float * embd;
llama_pos * pos;
int32_t * n_seq_id;
llama_seq_id * seq_id;
int8_t * logits;
} llama_batch;
typedef int32_t llama_token;
The n_tokens
is a counter of the number of tokens that this batch contains.
Now, llama_token
is defined as an integer and this is like an index into the
vocabulary of the model. And this is a pointer to a int so think of it as an
array. A llama_batch
can either contains tokens or embeddings (embd) which
are floats.
The pos
is the position of the token in the sequence. Notice that this is a
pointer and we can think of it as an array:
pos[0] position in the sequence for token[0]
Next we have the number of sequence id's, which is also an array so we have one
entry for each token in the batch. This indicates how many sequences this token
belongs to, or is part of. If we have a value that is greater than 1 this means
that this token is part of multiple sequences of this batch. This is something
that has confused me for a while and I'm trying to understand the usecase for
this. One thought I had was that this was done to save space, like if the same
token is used in multiple sequences we might get away with only storing it once
in the tokens array. But this does not seem to be the case as each token is
still stored in the array and the seq_id
array is of the same size.
Are these hints to the inference engine perhaps?
To try this out with two prompts/sequences:
prompt1: Yesterday was Friday
prompt2: Yesterday was Monday
So in this case we have two tokens in common for both sequences, Yesterday
and
was
.
Using the batch example this is
what the batch that we will pass to llama_decode
will look like this:
Creating new llama_batch with 2 sequences
Token id [43688] in common at positions:
Sequence 1, Index 0
Sequence 0, Index 0
Token id [369] in common at positions:
Sequence 1, Index 1
Sequence 0, Index 1
Processing prompt 0, nr tokens: 3 (batch_n_tokens: 0)
idx: 0, token_id: 43688
n_seq_id: 2
seq_id[0]: 1
seq_id[1]: 0
idx: 1, token_id: 369
n_seq_id: 2
seq_id[0]: 1
seq_id[1]: 0
idx: 2, token_id: 6794
n_seq_id: 1
seq_id[0]: 0
Processing prompt 1, nr tokens: 3 (batch_n_tokens: 3)
idx: 3, token_id: 43688
n_seq_id: 2
seq_id[0]: 1
seq_id[1]: 0
idx: 4, token_id: 369
n_seq_id: 2
seq_id[0]: 1
seq_id[1]: 0
idx: 5, token_id: 7216
n_seq_id: 1
seq_id[0]: 1
batch.n_tokens: 6
batch.tokens: [43688, 369, 6794, 43688, 369, 7216, ]
We can see that we still the tokens 43688
and 369
in the batch twice.
(gdb) p batch.token[0]
$2 = 43688
(gdb) p model.vocab.id_to_token[batch.token[0]]
$3 = {text = "Yesterday", score = 0, attr = LLAMA_TOKEN_ATTR_NORMAL}
(gdb) p batch.token[1]
$5 = 369
(gdb) p model.vocab.id_to_token[batch.token[1]]
$4 = {text = "Ġwas", score = 0, attr = LLAMA_TOKEN_ATTR_NORMAL}
And these tokens have the positions, 0, 1, 3, and 4.
In llama_sbatch::from_batch
there will be an inplace sort of the ids. The
vector of ids will initially look like this:
(gdb) p ids
$3 = std::vector of length 6, capacity 6 = {0, 1, 2, 3, 4, 5}
This vector of ids will now be sorted using the following function:
std::sort(ids.begin(), ids.end(),
[&batch](size_t a, size_t b) {
int32_t n_seq_a = batch.n_seq_id ? batch.n_seq_id[a] : 1;
int32_t n_seq_b = batch.n_seq_id ? batch.n_seq_id[b] : 1;
// sort by seq_id, then by pos
if (n_seq_a == n_seq_b) {
if (batch.seq_id) {
for (int32_t i = 0; i < n_seq_a; ++i) {
llama_seq_id seq_id_a = batch.seq_id[a][i];
llama_seq_id seq_id_b = batch.seq_id[b][i];
// smaller seq_ids go first
if (seq_id_a != seq_id_b) {
return seq_id_a < seq_id_b;
}
}
}
// when all else is equal, sort by pos
if (batch.pos) {
return batch.pos[a] < batch.pos[b];
}
// no pos, sort by id
return a < b;
}
// shared prompts go first
return n_seq_a > n_seq_b;
}
);
(gdb) br llama.cpp:3195 -commands
End with a line saying just "end".
>silent
>printf "a=%zu, b=%zu, n_seq_a=%d, n_seq_b=%d\n", a, b, n_seq_a, n_seq_b
>continue
>end
(gdb) c
Continuing.
a=1, b=0, n_seq_a=2, n_seq_b=2
a=1, b=0, n_seq_a=2, n_seq_b=2
a=2, b=0, n_seq_a=1, n_seq_b=2
a=2, b=1, n_seq_a=1, n_seq_b=2
a=3, b=0, n_seq_a=2, n_seq_b=2
a=3, b=2, n_seq_a=2, n_seq_b=1
a=3, b=1, n_seq_a=2, n_seq_b=2
a=3, b=0, n_seq_a=2, n_seq_b=2
a=4, b=0, n_seq_a=2, n_seq_b=2
a=4, b=2, n_seq_a=2, n_seq_b=1
a=4, b=1, n_seq_a=2, n_seq_b=2
a=5, b=0, n_seq_a=1, n_seq_b=2
a=5, b=2, n_seq_a=1, n_seq_b=1
And after the sort the ids vector will look like this:
(gdb) p ids
$16 = std::vector of length 6, capacity 6 = {0, 3, 1, 4, 2, 5}
Notice that the tokens that are shared between sequences are the first ids in the vector now. After this the ids will be iterated over, and in our case this will start with 0, which has 2 sequences ids:
(gdb) p batch.n_seq_id[0]
$72 = 2
Before this loop last_seq
is set to nullptr
:
// init seq
llama_sbatch_seq * last_seq = nullptr;
for (size_t i = 0; i < n_tokens; ++i) {
const size_t bi = ids[i];
const int32_t n_seqs = batch.n_seq_id[bi];
llama_seq_id * seq_ids = batch.seq_id[bi];
... // skipping the case where last_seq != nullptr
llama_sbatch_seq new_seq = {n_seqs, seq_ids, i, 1};
seq.push_back(new_seq);
last_seq = &seq.back();
}
(gdb) p new_seq
$75 = {n_seq_id = 2, seq_id = 0x55555665c010, offset = 0, length = 1}
(gdb) p seq
$76 = std::vector of length 0, capacity 0
So my understanding of this is that we are creating a new llama_sbatch_seq
for
the token at offset 0 in the original llama_batch
(which llama_sbatch
has a
pointer to as a member field). The initial length of this sequence is 1 (but
this will be updated later), and seq_id
is also pointer into the original
batch. And this token has 2 sequence ids, that is this token is part of another
sequence as well.
This is then added to the llama_sbatch
seq vector
(gdb) p this
$79 = (llama_sbatch * const) 0x555555ad5aa0
(gdb) p this.seq
$80 = std::vector of length 1, capacity 1 = {
{n_seq_id = 2, seq_id = 0x55555665c010, offset = 0, length = 1}
}
For the next iteration it will process the second element of the ids vector
which is 3, but this time last_seq
is not null:
for (size_t i = 0; i < n_tokens; ++i) { const size_t bi = ids[i];
const int32_t n_seqs = batch.n_seq_id[bi];
llama_seq_id * seq_ids = batch.seq_id[bi];
if (last_seq != nullptr) {
bool same = n_seqs == last_seq->n_seq_id;
for (int32_t j = 0; same && j < n_seqs; ++j) {
if (seq_ids[j] != last_seq->seq_id[j]) {
same = false;
}
}
if (same) {
last_seq->length += 1;
continue;
}
}
llama_sbatch_seq new_seq = {n_seqs, seq_ids, i, 1};
seq.push_back(new_seq);
last_seq = &seq.back();
}
So the same
variable is created by a comparison of the number number of
sequences for token 3 is the the same as for the last token then same will be
true:
(gdb) p n_seqs == last_seq->n_seq_id
$88 = true
(gdb) p n_seqs
$89 = 2
(gdb) p last_seq->n_seq_id
$90 = 2
(gdb)
The this will iterate over all the sequence ids and compare the sequence ids
of the current token (3) and the last sequence token. In our case they are all
the same so same
will remain true and the last_seq->length
will get
incremented by one.
We then move on to the third token in ids which is 1. The same will happend
again for this token and lenght will be incremented once more.
Next the forth token in ids will be processed which is 4. The same thing will
happen again and last_seq
will look like this at this point:
(gdb) p * last_seq
$114 = {n_seq_id = 2, seq_id = 0x55555665c010, offset = 0, length = 4}
Next we have the fifth token in ids which is 2. This token has only one sequence
so same will be false. So this will create new llama_sbatch_seq
:
(gdb) p new_seq
$122 = {n_seq_id = 1, seq_id = 0x555555ad5eb0, offset = 4, length = 1}
And last_seq
will be updated to point to this new sequence.
(gdb) p *last_seq
$124 = {n_seq_id = 1, seq_id = 0x555555ad5eb0, offset = 4, length = 1}
Next we have the last token in ids which is 5. This token has only one sequence
but this time the last seq also has only one sequence so same
will be true.
But in the check these two tokens are not part of the same sequence:
(gdb) p seq_ids[j]
$131 = 1
(gdb) p last_seq->seq_id[j]
$132 = 0
So same
will be set to false and a new llama_sbatch_seq
will be created:
(gdb) p new_seq
$133 = {n_seq_id = 1, seq_id = 0x555555a0c660, offset = 5, length = 1}
The loop will then exist. Lets inpsect the this.seq
after the final sorting:
(gdb) p seq
$136 = std::vector of length 3, capacity 4 = {
{n_seq_id = 1, seq_id = 0x555555ad5eb0, offset = 4, length = 1},
{n_seq_id = 1, seq_id = 0x555555a0c660, offset = 5, length = 1},
{n_seq_id = 2, seq_id = 0x55555665c010, offset = 0, length = 4}}
Keep in mind that ids
is a member of sbatch after this function returns
the values in this vector will be:
this.ids : { 0, 3, 1, 4, 2, 5 }
(token ids) : [43688, 43688, 369, 369, 6794, 7216, ]
So the first entry in the seq vector is for token 6794 which is not part of any other sequence and has a lenght of 1 therefor. Same with the last token 7216.
Later back in llama_decode_internal
this setup of batches that I've done
with the sequence ids will cause the first tokens to be set on the input
tensor the following:
ggml_backend_tensor_set(lctx.inp_tokens, batch.token, 0, n_tokens*ggml_element_size(lctx.inp_tokens));
(gdb) p n_tokens
$194 = 4
$195 = (const llama_ubatch &) @0x7fffffffd550: {equal_seqs = true,
n_tokens = 4, n_seq_tokens = 4, n_seqs = 1, token = 0x555555a8be80, embd = 0x0,
pos = 0x55555665c030, n_seq_id = 0x5555559fa260, seq_id = 0x555555a2a460,
output = 0x55555665c050 ""}
(gdb) p batch.token[0]
$190 = 43688
(gdb) p batch.token[1]
$191 = 43688
(gdb) p batch.token[2]
$192 = 369
(gdb) p batch.token[3]
$193 = 369
But notice that this is passing tokens which don't make sense to be procesed. I think I'm way off on this idea and how sequences are intended to be used.
This feels a little embarrassing and I now realize that I've been thinking about
this in the wrong way and actually I think I looked into the sequence ids before
and should have recalled this. I've been looking at this when using a batch to
call llama_decode
once, and one usage of the sequence ids is to be able to
pass multiple sequences to the model in a single batch. And we can specify the
sequence id with following llama_decode
calls as well.
This struct holds input
data for llama_decode. For example, if we pass in
a prompt of "What is LoRA" that would first be tokenized and then the tokens
will be added to the batch. An example of this can be found in
simple-prompt.cpp.
I'm still not sure when using multiple sequence ids for a single token is useful
An instance of a batch contains a count of the number of tokens (or embeddings)
that this batch holds. In the above case n_tokens
would be 7.
llama_batch batch = llama_batch_init(512, /*embd*/ 0, /*n_seq_max*/ 1);
for (int i = 0; i < n_tokens; i++) {
// the token of this batch entry.
batch.token[i] = input_tokens[i];
// the position in the sequence of this batch entry.
batch.pos[i] = i,
// the number of sequence id's of this batch entry.
batch.n_seq_id[i] = 1;
batch.seq_id[i][0] = 0; // the sequence id
// Determines if the logits for this token should be generated or not.
batch.logits[i] = false;
// Increment the number of tokens in the batch.
batch.n_tokens++;
}
We can take a look at the third token in this batch using:
$ gdb --args ./simple-prompt
(gdb) br llama.cpp:5514
(gdb) r
(gdb) p batch
$10 = {n_tokens = 7, token = 0xc4fb20, embd = 0x0, pos = 0x894ab0, n_seq_id = 0x897d10, seq_id = 0x898520,
logits = 0x895b50 "", all_pos_0 = 0, all_pos_1 = 0, all_seq_id = 0}
(gdb) p batch.token[3]
$6 = 4309
(gdb) p batch.pos[3]
$7 = 3
Now, I think I understand that position is the position of this token in the input sequence.
But I'm not sure what n_seq_id
is. There is one for each token in the batch
so it has a size of 7 (n_tokens).
(gdb) p batch.n_seq_id[3]
$13 = 1
(gdb) p *batch.seq_id[3]
$9 = 0
A batch can consist of multiple sequences by specifying A token in a batch can belong to multiple sequences.
This is an enum in llama.h which looks like this:
enum llama_pooling_type {
LLAMA_POOLING_TYPE_UNSPECIFIED = -1,
LLAMA_POOLING_TYPE_NONE = 0,
LLAMA_POOLING_TYPE_MEAN = 1,
LLAMA_POOLING_TYPE_CLS = 2,
};
Now, my understanding of these is that pooling is used for tasks like sentiment analysis where we have a sequence of tokens and we want to classify the sentiment of the text. For this we might use MAIN or CLS pooling. What happens is that the ouput of the transformer is a single vector with dimensions size of the embedding but the values will be the mean in the case of the MEAN type. So I think this is only applicable when we are generating embeddings (cparams.embeddings).
token0 [1 2 3 4 5]
token1 [6 7 8 9 10]
token2 [11 12 13 14 15]
token3 [16 17 18 19 20]
token4 [21 22 23 24 25]
And for each feature we would take the mean and that would be the value of the output vector.
output [11 12 13 14 15]
For CLS (in the context for BERT this is the "classification" token) the output is the "token" vector for the CLS:
CLS [1 2 3 4 5]
token1 [6 7 8 9 10]
token2 [11 12 13 14 15]
token3 [16 17 18 19 20]
token4 [21 22 23 24 25]
token4 [26 27 28 29 30]
There is model_param
value which is a pointer to floats and the size of this
array is the value of LLAMA_MAX_DEVICES
. This value is defined in llama.h:
#ifdef GGML_USE_CUBLAS
#include "ggml-cuda.h"
#define LLAMA_MAX_DEVICES GGML_CUDA_MAX_DEVICES
#else
#define LLAMA_MAX_DEVICES 1
#endif // GGML_USE_CUBLAS
...
struct llama_model_params {
int32_t n_gpu_layers; // number of layers to store in VRAM
int32_t main_gpu; // the GPU that is used for scratch and small tensors
const float * tensor_split; // how to split layers across multiple GPUs (size: LLAMA_MAX_DEVICES)
...
}
So in the case where cuBLAS (CUDA Basic Linear Algebra Subroutine) is used the size of this array will be the maximum number of devices that can be used. The values in this array will be of type float and and would be how the layers of the neural network should be split accorss the devices. This allows for specifying that more layers should be stored on one device than another. For example [0.7, 0.3] would mean that 70% of the layers should be stored on the first device and 30% on the second device.
The llama_batch
struct looks like this:
// Input data for llama_decode
// A llama_batch object can contain input about one or many sequences
// The provided arrays (i.e. token, embd, pos, etc.) must have size of n_tokens
//
// - token : the token ids of the input (used when embd is NULL)
// - embd : token embeddings (i.e. float vector of size n_embd) (used when token is NULL)
// - pos : the positions of the respective token in the sequence
// - seq_id : the sequence to which the respective token belongs
// - logits : if zero, the logits for the respective token will not be output
//
typedef struct llama_batch {
int32_t n_tokens;
llama_token * token;
float * embd;
llama_pos * pos;
int32_t * n_seq_id;
llama_seq_id ** seq_id;
int8_t * logits;
// NOTE: helpers for smooth API transition - can be deprecated in the future
// for future-proof code, use the above fields instead and ignore everything below
//
// pos[i] = all_pos_0 + i*all_pos_1
//
llama_pos all_pos_0; // used if pos == NULL
llama_pos all_pos_1; // used if pos == NULL
llama_seq_id all_seq_id; // used if seq_id == NULL
} llama_batch;
n_tokens
is the number of tokens in this batch.
token
is a int pointer to tokens.
embd
is a float pointer to embeddings.
pos
is a pointer to the position of the tokens in the sequence. So each token
in the batch will have a value in this array which is of size n_tokens
.
n_seq_id
is the number of sequence ids. I still don't understand what the
sequence ids are used for. Perhaps that are used for parallel processing?
seq_id
is the sequence id for each token in the batch. So each token will
have a sequence id.
all_pos_0
is only used if pos is NULL.
all_pos_1
is only used if pos is NULL.
So if the pos array/pointer is null then the decode function will check for
this condition and populate a std::vector(llama_pos> with the size of n_tokens.
It will then iterate through the range of 0..n_tokens and set the pos values:
all_pos_0 = 0
all_pos_1 = 1
pos[i] = all_pos_0 + i * all_pos_1
pos[0] = 0 + 0 * 1 = 0
pos[1] = 0 + 1 * 1 = 1
pos[2] = 0 + 2 * 1 = 2
And we can specify that the position starts with a value other than 0:
all_pos_0 = 4
all_pos_1 = 1
pos[0] = 4 + 0 * 1 = 4
pos[1] = 4 + 1 * 1 = 5
pos[2] = 4 + 2 * 1 = 6
So this is a just way of specifying the position of the tokens in the sequence and something we could have done manually ourselves. Also not that this is deprecated in llama.cpp.
// NOTE: helpers for smooth API transition - can be deprecated in the future
// for future-proof code, use the above fields instead and ignore everything below
//
// pos[i] = all_pos_0 + i*all_pos_1
//
llama_pos all_pos_0; // used if pos == NULL
llama_pos all_pos_1; // used if pos == NULL
llama_seq_id all_seq_id; // used if seq_id == NULL
all_seq_id
is only used if seq_id is NULL.
std::vector<int32_t> n_seq_id;
std::vector<llama_seq_id *> seq_id_arr;
std::vector<std::vector<llama_seq_id>> seq_id;
if (batch.seq_id == nullptr) {
n_seq_id.resize(n_tokens);
seq_id.resize(n_tokens);
seq_id_arr.resize(n_tokens);
for (uint32_t i = 0; i < n_tokens; i++) {
n_seq_id[i] = 1;
seq_id[i].resize(1);
seq_id[i][0] = batch.all_seq_id;
seq_id_arr[i] = seq_id[i].data();
}
batch.n_seq_id = n_seq_id.data();
batch.seq_id = seq_id_arr.data();
}
So we will first create n_seq_id
, seq_id
, and seq_id_arr
vectors of size
3 in this case. And 'batch.all_seq_id' is 0.
seq_id[ [0],
[0],
[0],
]
Lets start with n_seq_id
which is an array and each token in a batch will have
an entry in this array. The value in this position specifies the number of
sequences that the token is part of (still not sure exactly what this means of
how it is used but hopefully that will clear up).
Lets say that we have a batch of 3 tokens and the second token is part of two
sequences:
n_seq_id[1] = 2;
The corresponding entry in the seq_id
vector will point to a vector of size
2 in that case.
seq_id[ [0],
[1, 2],
[0],
]
One usage of the n_seq_id is in llama_kv_cache_find_slot
:
for (uint32_t i = 0; i < n_tokens; i++) {
cache.cells[cache.head + i].pos = batch.pos[i];
for (int32_t j = 0; j < batch.n_seq_id[i]; j++) {
cache.cells[cache.head + i].seq_id.insert(batch.seq_id[i][j]);
}
}
cache.used += n_tokens;
So for each token in the batch an entry at cache.head +
will be updated with
the current tokens position. And notice that it will loop through the number
of sequences that the current token has, which is the value of n_seq_id
which
is a member of th llama_kv_cell struct:
struct llama_kv_cell {
llama_pos pos = -1;
llama_pos delta = 0;
std::set<llama_seq_id> seq_id;
};
Hmm, I think what I've been missing might be that a batch can contain one or
more tokens, and each token has a position related with it.
For example, we could have an initial prompt that we pass as input put to
llama_decode
which we know takes a batch:
batch_0:
n_tokens: 3
token: [102, 23, 2993]
pos: [0, 1, 2]
n_seq_id: [1, 1, 1]
seq_id: [[0], [0], [0]]
And then we call one of the sampling functions, like llama_sample_greedy
, to
get a new token id. We will then send this token id and have the llm infer
the next token. This time we only send a single token as input:
batch_1:
n_tokens: 1
token: [1186]
pos: [3]
n_seq_id: [1]
seq_id: [[0]]
But notice that we have updated the position of the token in the sequence to be 3 and the sequence id is still 0. So we are still in the same sequence but we have moved to the next token in the sequence. And we can continue to do this until we have a new sequence. For example, we could have a new sequence that starts with the token 0:
batch_2:
n_tokens: 1
token: [0]
pos: [0]
n_seq_id: [1]
seq_id: [[1]]
Now, how about the case where we have multiple sequence ids in a batch. For example:
sequence_0: Dan loves ice cream
batch.n_tokens: 4
batch.pos: [0, 1, 2, 3]
batch.n_seq_id: [1, 1, 1, 1]
batch.seq_id: [[0], [0], [0], [0]]
An entry in the kv cache contains a position and also an vector of sequence ids like we also saw above:
struct llama_kv_cell {
llama_pos pos = -1;
llama_pos delta = 0;
std::set<llama_seq_id> seq_id;
};
When a batch is processes and added into the kv cache, all the tokens in the
batch will be iterated over and the next available slot will be found, head+i,
will set that kv cache entry's pos to the current tokens position. The same
iteration will also add the sequence ids of the current token to the same kv
cache entry.
We can add a second sequence to a batch by using the following:
```c++
llama_batch batch = llama_batch_init(512, 0, 2);
for (int i = 0; i < n_tokens; i++) {
batch.token[i] = input_tokens[i];
batch.pos[i] = i;
batch.n_seq_id[i] = 2;
batch.seq_id[i][0] = 0;
batch.seq_id[i][1] = 1;
batch.logits[i] = false;
batch.n_tokens++;
}
Now, I can see the usefulness of having sequence ids for example if I start with one query and then ask a completely different question I want to have it evaluated separate from the first, but I might also want to come back to the first.
TODO: Figure out the usage of multiple sequence ids in a batch.
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors
from ./models/llama-2-7b-chat.Q4_0.gguf (version GGUF V2)
llama_model_loader: - tensor 0: token_embd.weight q4_0 [4096, 32000, 1, 1]
This is the embeddings for the model. Recall that the model has a context size of 4096 and a vocab size of 32000. So for each token in the vocabulary there is en embedding with a dimension of 4096.
Sometimes you might have a token id and want to know what it represents. This can be done opening a program in a debugger. For example:
$ gdb -args ./simple-prompt
Reading symbols from ./simple-prompt...
(gdb) br simple-prompt.cpp:21
Breakpoint 1 at 0x408ea2: file src/simple-prompt.cpp, line 21.
(gdb) r
After the model has loaded we can then inspect the token embeddings, in this case I wanted to know what the token id 29871 represents
(gdb) p model.vocab.id_to_token[29871]
$6 = {text = "▁", score = -1e+09, type = LLAMA_TOKEN_TYPE_NORMAL}
This page https://gpus.llm-utils.org/llama-2-prompt-template/, and also https://huggingface.co/blog/llama2#how-to-prompt-llama-2, specifies that a prompt look as follows:
<s>[INST] <<SYS>>
{your_system_message}
<</SYS>>
{user_message_1} [/INST]
This is how the model was trained and so this is what it expects. This might sound obvious but I ran into this issue when trying to create a prompt that would use retrieval augmented generation (RAG). I was trying to add some additional examples of interactions for the model as a system message but I originally specified them something like this:
let sys_prompt = r#"
[INST] <<SYS>>
{{ system_prompt }}
Only respond with the YAML and nothing else.
Here are some previous interactions between the Assistant and a User:
User: What is RHSA-1820:1234?
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1820:1234"
limit: 4
```
User: Can you show me the details about advisory RHSA-1721:4231?
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1721:4231"
limit: 4
```
User: Is is RHSA-1721:4231 about an OpenSSL exploit?
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1721:4231"
limit: 4
```
Your output should only be YAML and include query, and limit fields. Do not output any other text or other information.
<</SYS>>
{{ user_message }} [/INST]"#;
Sometimes the llm would get this right but most of the times it would not and
not create a valid YAML. After a while I relized my mistake and changed the user
messages in the examples to include the [INST]
and [/INST]
tags:
let sys_prompt = r#"
[INST] <<SYS>>
{{ system_prompt }}
Only respond with the YAML and nothing else.
Here are some previous interactions between the Assistant and a User:
[INST] User: What is RHSA-1820:1234? [/INST]
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1820:1234"
limit: 4
```
[INST] User: Can you show me the details about advisory RHSA-1721:4231? [/INST]
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1721:4231"
limit: 4
```
[INST] User: Is is RHSA-1721:4231 about an OpenSSL exploit? [/INST]
Assistant:
```yaml
command: VectorStoreTool
input:
query: "RHSA-1721:4231"
limit: 4
```
Your output should only be YAML and include query, and limit fields. Do not output any other text or other information.
<</SYS>>
{{ user_message }} [/INST]"#;
With those changes the llm was able to generate valid YAML for the examples.
The llama model can be quantized to reduce the size of the model. The name of these models will have the size of the quantization in them and also additional letters. For example:
llama-2-7b-chat.Q2_K.gguf # 2-bit quantization using Q2_K method
llama-2-7b-chat.Q3_K_L.gguf # 3-bit quantization using Q3_K_L method
llama-2-7b-chat.Q3_K_M.gguf # 3-bit quantization using Q3_K_M method
llama-2-7b-chat.Q3_K_S.gguf # 3-bit quantization using Q3_K_S method
llama-2-7b-chat.Q4_0.gguf
llama-2-7b-chat.Q4_K_M.gguf
L = large
M = medium
S = small
Lets take a look at the following line of code:
const auto tn = LLM_TN(LLM_ARCH_LLAMA);
switch (model.arch) {
case LLM_ARCH_LLAMA:
case LLM_ARCH_REFACT:
model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
Notice that we are calling an operator on tn
, and specifying a llm_tensor
as the first argument:
enum llm_tensor {
LLM_TENSOR_TOKEN_EMBD,
...
}
And the second argument is a string.
And LLM_TN is defined as:
struct LLM_TN {
LLM_TN(llm_arch arch) : arch(arch) {}
llm_arch arch;
std::string operator()(llm_tensor tensor, const std::string & suffix) const {
return LLM_TENSOR_NAMES[arch].at(tensor) + "." + suffix;
}
...
};
In this case that would be like calling:
LLM_TENSOR_NAMES[LLM_ARCH_LLAMA].at(LLM_TENSOR_TOKEN_EMBD) + "." + "weight";
LLM_TENOR_NAMES is defined as:
static std::map<llm_arch, std::map<llm_tensor, std::string>> LLM_TENSOR_NAMES = {
{
LLM_ARCH_LLAMA,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ROPE_FREQS, "rope_freqs" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_ATTN_ROT_EMBD, "blk.%d.attn_rot_embd" },
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
{ LLM_TENSOR_FFN_GATE_EXP, "blk.%d.ffn_gate.%d" },
{ LLM_TENSOR_FFN_DOWN_EXP, "blk.%d.ffn_down.%d" },
{ LLM_TENSOR_FFN_UP_EXP, "blk.%d.ffn_up.%d" },
},
},
...
};
So that would return token_embd.weight
.
"output_norm.weight"
"output.weight"
This function is used to build the computation graph for the llama model and
is a function that is a member of the llm_build_context
:
struct llm_build_context {
const llama_model & model;
llama_context & lctx;
const llama_hparams & hparams;
const llama_cparams & cparams;
const llama_batch & batch;
const llama_kv_cache & kv_self;
...
const llm_build_cb & cb;
...
struct ggml_cgraph * build_llama() {
...
}
...
}
Lets take a closer look at build_llama
:
struct ggml_cgraph * build_llama() {
struct ggml_cgraph * gf = ggml_new_graph_custom(ctx0, LLAMA_MAX_NODES, false);
const int64_t n_embd_head = hparams.n_embd_head_v;
GGML_ASSERT(n_embd_head == hparams.n_embd_head_k);
GGML_ASSERT(n_embd_head == hparams.n_rot);
struct ggml_tensor * cur;
struct ggml_tensor * inpL;
inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
inpL
I think is the input layer. And notice that this is created by calling
llm_build_inp_embd
. So this is a function that is used by multiple models to
build this input layer. Note that a callback function is also passed in and this
is a member of the llm_build_context
and the callback is passed into the
constructor. To see where this callback function is defined we have to back up
to llama_build_graph
where we have:
llm_build_cb cb = [&](struct ggml_tensor * cur, const char * name, int il) {
// The following is setting the tensor name and if this is not the
// first layer then use the layer index in the name.
if (il >= 0) {
ggml_format_name(cur, "%s-%d", name, il);
} else {
ggml_set_name(cur, name);
}
if (!lctx.cparams.offload_kqv) {
if (strcmp(name, "kqv_merged_cont") == 0) {
// all nodes between the KV store and the attention output are run on the CPU
ggml_backend_sched_set_tensor_backend(lctx.sched, cur, lctx.backend_cpu);
}
}
const bool full_offload = lctx.model.n_gpu_layers > (int)lctx.model.hparams.n_layer;
if (batch.n_tokens < 32 || full_offload) {
if (il != -1 && strcmp(name, "norm") == 0) {
for (auto * backend : lctx.backends) {
if (ggml_backend_buft_supports_backend(lctx.model.buft_layer[il].buft, backend)) {
ggml_backend_sched_set_tensor_backend(lctx.sched, cur, backend);
break;
}
}
}
}
};
struct llm_build_context llm(lctx, batch, cb, worst_case);
llm.init();
switch (model.arch) {
case LLM_ARCH_LLAMA:
{
result = llm.build_llama();
} break;
case LLM_ARCH_BAICHUAN:
So we can see that an llm_build_context
is created using the llm_build_cb
and and this is what is passed into llm_build_inp_embd
:
inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
To see what batch is we have to back up the stack to
llama_new_context_with_model
:
int n_tokens = (int)std::min(cparams.n_ctx, cparams.n_ubatch);
int n_past = cparams.n_ctx - n_tokens;
llama_token token = llama_token_bos(&ctx->model); // not actually used by llama_build_graph, but required to choose between token and embedding inputs graph
ggml_cgraph * gf = llama_build_graph(*ctx, llama_batch_get_one(&token, n_tokens, n_past, 0), true);
Lets inspect some of the variables:
gdb) p n_tokens
$11 = 512
(gdb) p n_past
$12 = 0
(gdb) p token
$13 = 1
(gdb) p ctx->model->vocab.special_bos_id
$14 = 1
(gdb) p llama_batch_get_one(&token, n_tokens, n_past, 0)
$15 = {n_tokens = 512, token = 0x7fffffff655c, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0,
logits = 0x0, all_pos_0 = 0, all_pos_1 = 1, all_seq_id = 0}
n_past
is the position in the sequence.
if (batch.token) {
lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
cb(lctx.inp_tokens, "inp_tokens", -1);
ggml_set_input(lctx.inp_tokens);
inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
Note that the 1d tensor created is a list of i32 and the size in this case is
512. If we think about the input to the model, it is a sequence of tokens which
are the indexes into the vocabulary. So this tensor is a list of tokens.
And the callback will set the name of the tensor to inp_tokens
.
Now, the call to ggml_get_rows
was something I've not come accross before and
I needed to looking it. I've created a standalone example of how this function
can be used which is in get-rows.c.
So what this is doing is that it extracting rows from tok_embd
and the rows
to extract are specified by the lctx.inp_tokens
tensor, which are like
indices.
(gdb) p ggml_n_dims(tok_embd)
$38 = 2
(gdb) p tok_embd->ne[0]
$39 = 4096
(gdb) p tok_embd->ne[1]
$40 = 32000
And I find visualizing it like this helps me understand it better:
ne[1]
| 3
↓ 2
0
0
0
4096
ne[0] ->
And then we have the tensor with the indices:
(gdb) p ggml_n_dims(lctx.inp_tokens)
$43 = 1
(gdb) p lctx.inp_tokens.ne[0]
$44 = 512
But keep in mind that this is just building up the computation graph so that
there are no actual values in the tensors yet, at least not the inp_tokens
.
So this makes sense now I think, we have the input tokens which is a list of indices into the vocabulary. The vocabulary in llama has 32000 tokens and each token has an embedding dimention of 4096. What the get rows is doing is that is it extracting the embeddings for each token in the input.
After this we have:
inpL = llm_build_inp_embd(ctx0, lctx, hparams, batch, model.tok_embd, cb);
// inp_pos - contains the positions
struct ggml_tensor * inp_pos = build_inp_pos();
struct ggml_tensor * build_inp_pos() {
lctx.inp_pos = ggml_new_tensor_1d(ctx0, GGML_TYPE_I32, n_tokens);
cb(lctx.inp_pos, "inp_pos", -1);
ggml_set_input(lctx.inp_pos);
return lctx.inp_pos;
}
And inp_pos
is a tensor of i32 with the size of n_tokens
which is 512 and
these are the positional encoding values.
Next we have the KQ_mask
:
// KQ_mask (mask for 1 head, it will be broadcasted to all heads)
struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
So this is about the Key and Query masking, which is there to prevent the model from considering tokens in the future which can happen during training.
struct ggml_tensor * build_inp_KQ_mask(bool causal = true) {
if (causal) {
lctx.inp_KQ_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_kv, n_tokens);
} else {
lctx.inp_KQ_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_tokens, n_tokens);
}
cb(lctx.inp_KQ_mask, "KQ_mask", -1);
ggml_set_input(lctx.inp_KQ_mask);
return lctx.inp_KQ_mask;
}
The term "causal" in this context refers to the model's ability to generate or
process data in a sequence where the generation of each piece of data depends
only on the previously generated or processed data. It's about establishing a
cause-and-effect relationship in the sequence of data, where each token (in the
case of language models) is predicted based on the tokens that precede it.
With that in mind, and notice that causal has a default value of true, this will
create a new 2d tensor of type f32 with the size of n_kv
and n_tokens
. The
(gdb) p n_kv
$49 = 512
(gdb) p n_tokens
$50 = 512
So this will be a 512x512 tensor. The name of this tensor will then be set to
"KQ_mask". Notice that ggml_set_input
is called on this tensor as well.
void ggml_set_input(struct ggml_tensor * tensor) {
tensor->flags |= GGML_TENSOR_FLAG_INPUT;
}
enum ggml_tensor_flag {
GGML_TENSOR_FLAG_INPUT = 1,
GGML_TENSOR_FLAG_OUTPUT = 2,
GGML_TENSOR_FLAG_PARAM = 4,
};
After that we will be back in the build_llama
function and we are doing to
loop over all the n_layers
:
for (int il = 0; il < n_layer; ++il) {
struct ggml_tensor * inpSA = inpL;
(gdb) p n_layer
$53 = 32
And recall that inpL
is the input embeddings tensor (512(rows/tokens)x4096
(columns/features). So there is a context size of 512 and each has an embedding
size of 4096.
And the first thing that happens is that a new pointer to inpL
is saved in
inpSA
which I think might stand for "input to self-attention". This is later
used to build the feed forward layer/residual layer after the self attention.
So the example has a default context of 512 for the context size which can be seen in common.h:
struct gpt_params {
...
int32_t n_ctx = 512; // context size
...
And the embedding size, sometimes called the hidden size of 4096.
Next we have:
cur = llm_build_norm(ctx0, inpL, hparams,
model.layers[il].attn_norm, NULL,
LLM_NORM_RMS, cb, il);
cb(cur, "attn_norm", il);
static struct ggml_tensor * llm_build_norm(
struct ggml_context * ctx,
struct ggml_tensor * cur, // X (input)
const llama_hparams & hparams,
struct ggml_tensor * mw, // weights
struct ggml_tensor * mb, // biases
llm_norm_type type,
const llm_build_cb & cb,
int il) {
switch (type) {
case LLM_NORM: cur = ggml_norm (ctx, cur, hparams.f_norm_eps); break;
case LLM_NORM_RMS: cur = ggml_rms_norm(ctx, cur, hparams.f_norm_rms_eps); break;
}
In this case type is LLM_NORM_RMS
which is a Root Mean Squared normalization.
Just to remind ourselves what RMS normalization is, lets take a look at an example:
token matrix normalized matrix
f₁ f₂ f₃ f₁ f₂ f₃
+-----------+ +-----+ +-----------+
token 1 |v₁ |v₂ |v₃ | |μ₁|σ₁| |z₁ |z₂ |z₃ |
token 2 | | | | | | | | | | |
token 3 | | | | | | | | | | |
token 4 | | | | | | | | | | |
token 5 | | | | | | | | | | |
+-----------+ +-----+ +-----------+
So we want to normalize the values by scaling them by the root mean squared, which compared to standard normalization, where we would scale the values by the mean and standard deviation (providing a mean of 0 and and std of 1). TODO: link to notes about standard normalization.
The middle matrix is just intended to clarify that each row is normalized independently. The mean and standard deviation is calculated as follows (for the first token only) as shown below:
token 1:
v₁ = 4
v₂ = 6
v₃ = 8
μ = 4+6+8/3 = 6
σ = 1.63
v₁ = (4-6)² = 4
v₂ = (6-6)² = 0
v₃ = (8-6)² = 4
variance = 4+0+4/3 = 8/3
√(8/3) = 1.63
= 1.63
Now for each of the value (v₁, v₂, v₃) we will pass them to RMSNorm:
xᵢ
RMSNorm(xᵢ) = ----
RMS
RMS = √1/3 (4² + 6² + 8²)
= √1/3 (16 + 36 + 64)
= √1/3 116
= √38.68
= 6.22
z₁ = 4/6.22 = 0.64
z₂ = 6/6.22 = 0.96
z₃ = 8/6.22 = 1.29
So we apply the normalization per token/row and we calculated the mean and the standard deviation for each token/row (set of features), this is only done once per row. And we are simply dividing by this value which is the scaling part.
With that in mind lets take a look at:
case LLM_NORM_RMS: cur = ggml_rms_norm(ctx, cur, hparams.f_norm_rms_eps); break;
The input to this function is the cur
tensor which is currently (in the first
iteration) the input embeddings tensor:
(gdb) p *cur
$64 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1},
nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_GET_ROWS, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0xb48d50, 0x7ffef1c40180, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "inp_embd", '\000' <repeats 55 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
struct ggml_tensor * ggml_rms_norm(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps) {
return ggml_rms_norm_impl(ctx, a, eps, false);
}
static struct ggml_tensor * ggml_rms_norm_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
float eps,
bool inplace) {
bool is_node = false;
if (!inplace && (a->grad)) {
is_node = true;
}
struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
ggml_set_op_params(result, &eps, sizeof(eps));
result->op = GGML_OP_RMS_NORM;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src[0] = a;
return result;
}
inplace
is false which can see is passed in from ggml_rms_norm
so this will
call ggml_dup_tensor
. The result tensor will look like this after
ggml_dup_tensor
:
(gdb) p *result
$66 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1},
nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = '\000' <repeats 63 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Next we are going to set a parameter of the operation for operation of this
tensor (GGML_OP_RMS_NORM
):
ggml_set_op_params(result, &eps, sizeof(eps));
(gdb) p eps
$67 = 9.99999997e-07
(gdb) p sizeof(eps)
$68 = 4
And apart from an assert and a check (not shown) the above values are set on the tensor:
static void ggml_set_op_params(struct ggml_tensor * tensor, const void * params, size_t params_size) {
memcpy(tensor->op_params, params, params_size);
}
op_params
is defined as:
int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)];
Then we set the type of operation, set the source of this tensor to the input tensor:
result->op = GGML_OP_RMS_NORM;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src[0] = a;
So when a computation graph is computed this tensor will be the output of the RMSNorm and the source of this tensor is the input tensor. And this operation will take epsilon as a parameter. The resulting tensor will be the normalized input tensor.
So that will returns us to llm_build_norm
:
if (mw || mb) {
cb(cur, "norm", il);
}
In this case mb, which I think is the bias is null:
(gdb) p mb
$77 = (ggml_tensor *) 0x0
And the mw, which I think is the weights is:
$79 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x799d80, ne = {4096, 1, 1,
1}, nb = {4, 16384, 16384, 16384}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x7fff0a704ec0,
name = "blk.0.attn_norm.weight", '\000' <repeats 41 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
So cur's name will be set to norm-0
in this case..
Now, as part of the normalization the values are actually also scaled by a learned parameter which is called gamma/weight and a beta/bias)
γ = 1.5, β = 0.5
z₁ = (4/6.22) γ + β = 1.46490128
z₂ = (6/6.22) γ + Β = 1.94735192
z₃ = (8/6.22) γ + Β = 2.42980256
And these valeue would be in a tensor of size 4096x1, one for each
feature. And this is what mw
and mb
are for:
if (mw) {
cur = ggml_mul(ctx, cur, mw);
if (mb) {
cb(cur, "norm_w", il);
}
}
if (mb) {
cur = ggml_add(ctx, cur, mb);
}
return cur;
So if the wieghts are not null then we multiply the tensor by the weights. The
next line I find confusing as it is setting the name of the output tensor to
norm_w-0
only if the bias is not null and other wise it has non name:
(gdb) p *cur
$88 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1},
nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0x7ffef1c407c0, 0xb49200, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = '\000' <repeats 63 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
So why would we name this as norm_w
only if there is no bias tensor?
Perhaps this is done to indicated that there will be a bias added later, but why
not perform the setting of the name in the later if block? Ah, but the weights
matrix might be null in which case the current block might not have been executed
but the bias block would (assuming the bias is not null). So that does make
sense now.
In this case since mb is null there will be now addition performed when there computation is executed so no operation for the addition is added either. After that we are done in llm_build_norm and back in llm_build_context::build_llama:
// norm
cur = llm_build_norm(ctx0, inpL, hparams,
model.layers[il].attn_norm, NULL,
LLM_NORM_RMS, cb, il);
cb(cur, "attn_norm", il);
And here we can see that the returned tensor is getting the name attn_norm-0
.
Next, we have the self attention block for the current layer:
// self-attention
{
// compute Q and K and RoPE them
struct ggml_tensor * Qcur = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
cb(Qcur, "Qcur", il);
if (model.layers[il].bq) {
Qcur = ggml_add(ctx0, Qcur, model.layers[il].bq);
cb(Qcur, "Qcur", il);
}
Lets take a look at model.layers[il].wq
first:
(gdb) p *model.layers[il].wq
$90 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x799d80, ne = {4096, 4096,
1, 1}, nb = {18, 2304, 9437184, 9437184}, op = GGML_OP_NONE, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x7fff1019cec0, name = "blk.0.attn_q.weight", '\000' <repeats 44 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
And we are in the first attention block of 32:
(gdb) p model.layers.size()
$93 = 32
And the first operation is a multrix multipliction of the input and the query
matrix. cur
in this case is the output from the normalization layer that we
went through earlier. The output tensor will be named Qcur-0
:
(gdb) p *Qcur
$107 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1,
1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb49390, 0x7ffef1c40950, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0,
view_offs = 0, data = 0x0, name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
Next we are going to do the same for the key matrix and also add a bias if it is not null:
struct ggml_tensor * Kcur = ggml_mul_mat(ctx0, model.layers[il].wk, cur);
cb(Kcur, "Kcur", il);
if (model.layers[il].bk) {
Kcur = ggml_add(ctx0, Kcur, model.layers[il].bk);
cb(Kcur, "Kcur", il);
}
Then the same is done for the value matrix:
struct ggml_tensor * Vcur = ggml_mul_mat(ctx0, model.layers[il].wv, cur);
cb(Vcur, "Vcur", il);
if (model.layers[il].bv) {
Vcur = ggml_add(ctx0, Vcur, model.layers[il].bv);
cb(Vcur, "Vcur", il);
}
After that we have the following which is related to rotary position embeddings. And recall that RoPE is added in each block of the attention layer and not like the original implementation where it was added to the input embeddings and was absolute.
Qcur = ggml_rope_custom(
ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos,
n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow
);
cb(Qcur, "Qcur", il);
Lets start by taking a look at ggml_reshape_3d
. So this is going to take the
Qcur
tensor:
(gdb) p *Qcur
$110 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1,
1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb49390, 0x7ffef1c40950, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0,
view_offs = 0, data = 0x0, name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
(gdb) p ggml_n_dims(Qcur)
$111 = 2
(gdb) p n_embd_head
$112 = 128
(gdb) p n_head
$113 = 32
(gdb) p n_tokens
$114 = 512
So reshape will take the 2d tensor and reshape it to a 3d tensor with the following dimensions, x=128, y=32, z=512:
ggml_reshape_3d(ctx0, Qcur, 128, 32, 512)
And that reshaped tensor will be passed as the second argument to
ggml_custom_rope
:
struct ggml_tensor * ggml_rope_custom(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode,
int n_ctx,
int n_orig_ctx,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow) {
return ggml_rope_impl(
ctx, a, b, n_dims, mode, n_ctx, n_orig_ctx, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow, 0.0f, false, false
);
}
b
is a tensor of i32 with the size of 512 and these will hold the calculated
angles to rotate each item in the token embedding sequence by:
(gdb) p *b
$117 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {512, 1, 1, 1},
nb = {4, 2048, 2048, 2048}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1,
grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "inp_pos", '\000' <repeats 56 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
The rest of the values can be seen below:
(gdb) s
ggml_rope_custom (ctx=0x747a28 <g_state+200>, a=0x7ffef1c40f90, b=0x7ffef1c404a0, n_dims=128,
mode=0, n_ctx=0, n_orig_ctx=4096, freq_base=10000, freq_scale=1, ext_factor=0, attn_factor=1,
beta_fast=32, beta_slow=1) at ggml.c:5518
mode
is documented as:
// rotary position embedding
// if mode & 1 == 1, skip n_past elements (DEPRECATED)
// if mode & 2 == 1, GPT-NeoX style
// if mode & 4 == 1, ChatGLM style
//
// b is an int32 vector with size a->ne[2], it contains the positions
GGML_API struct ggml_tensor * ggml_rope(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode,
int n_ctx);
ggml_rope_custom
delegates to ggml_rope_impl
:
5504 struct ggml_tensor * ggml_rope_custom(
5505 struct ggml_context * ctx,
5506 struct ggml_tensor * a,
5507 struct ggml_tensor * b,
5508 int n_dims,
5509 int mode,
5510 int n_ctx,
5511 int n_orig_ctx,
5512 float freq_base,
5513 float freq_scale,
5514 float ext_factor,
5515 float attn_factor,
5516 float beta_fast,
5517 float beta_slow) {
5518 return ggml_rope_impl(
5519 ctx, a, b, n_dims, mode, n_ctx, n_orig_ctx, freq_base, freq_scale,
5520 ext_factor, attn_factor, beta_fast, beta_slow, 0.0f, false, false
5521 );
5522 }
static struct ggml_tensor * ggml_rope_impl(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
int n_dims,
int mode,
int n_ctx,
int n_orig_ctx,
float freq_base,
float freq_scale,
float ext_factor,
float attn_factor,
float beta_fast,
float beta_slow,
float xpos_base,
bool xpos_down,
bool inplace) {
bool is_node = false;
if (a->grad) {
is_node = true;
}
struct ggml_tensor * result = inplace ? ggml_view_tensor(ctx, a) : ggml_dup_tensor(ctx, a);
int32_t params[13] = { /*n_past*/ 0, n_dims, mode, n_ctx, n_orig_ctx };
memcpy(params + 5, &freq_base, sizeof(float));
memcpy(params + 6, &freq_scale, sizeof(float));
memcpy(params + 7, &ext_factor, sizeof(float));
memcpy(params + 8, &attn_factor, sizeof(float));
memcpy(params + 9, &beta_fast, sizeof(float));
memcpy(params + 10, &beta_slow, sizeof(float));
memcpy(params + 11, &xpos_base, sizeof(float));
memcpy(params + 12, &xpos_down, sizeof(bool));
ggml_set_op_params(result, params, sizeof(params));
result->op = GGML_OP_ROPE;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src[0] = a;
result->src[1] = b;
return result;
}
This starting to look familiar, this is duplicating the tensor that was reshaped previously and then setting the operation parameters and the operation type. Alright there are quite a few parameters here, lets take a look at them.
rope.md contains a detailed explanation of the RoPE operation and the angles are calculated as follows:
^-2(i-1)/d
Θ = { θᵢ = 10000 , i ∈ {1, 2, ..., d/2 }
Now, I think that 10000 is the base_freq
parameter in llama.cpp
freq_base
is the base of the exponential function, determining the starting
point of the positional frequencies.
A more negative freq_scale (e.g., -2) results in a steeper decay. This means the theta values decrease more rapidly as you move across the dimensions. This setting emphasizes positional encoding more strongly in the lower dimensions.
A less negative freq_scale (e.g., -0.5) shows a slower decay rate. The theta values decrease more gradually, indicating a more uniform distribution of positional information across dimensions.
-2: ['1.00000', '0.74989', '0.56234', '0.42170', '0.31623', '0.23714', '0.17783', '0.13335', '0.10000', '0.07499', '0.05623', '0.04217', '0.03162', '0.02371', '0.01778', '0.01334', '0.01000', '0.00750', '0.00562', '0.00422', '0.00316', '0.00237', '0.00178', '0.00133', '0.00100', '0.00075', '0.00056', '0.00042', '0.00032', '0.00024', '0.00018', '0.00013']
-1: ['1.00000', '0.86596', '0.74989', '0.64938', '0.56234', '0.48697', '0.42170', '0.36517', '0.31623', '0.27384', '0.23714', '0.20535', '0.17783', '0.15399', '0.13335', '0.11548', '0.10000', '0.08660', '0.07499', '0.06494', '0.05623', '0.04870', '0.04217', '0.03652', '0.03162', '0.02738', '0.02371', '0.02054', '0.01778', '0.01540', '0.01334', '0.01155']
-0.5: ['1.00000', '0.93057', '0.86596', '0.80584', '0.74989', '0.69783', '0.64938', '0.60430', '0.56234', '0.52330', '0.48697', '0.45316', '0.42170', '0.39242', '0.36517', '0.33982', '0.31623', '0.29427', '0.27384', '0.25483', '0.23714', '0.22067', '0.20535', '0.19110', '0.17783', '0.16548', '0.15399', '0.14330', '0.13335', '0.12409', '0.11548', '0.10746']
So that was 1 of the parameters, what are the rest for?
TODO: complete the following list once I've sorted out where from where these parameters come from (RoPE, YARN, XPos, etc.)
freq_base This is from RoPE and is the base frequence which is 10000 in the paper
freq_scale From YARN and is the scaling factor for the angles
ext_factor From YARN and is the extrapolation factor
attn_factor From XPos?
beta_fast
beta_slow
xpos_base
&xpos_down
I think that freq_scale
is actually for YARN scaling and not like I wrote
above that is was a parameter of the angle calculation.
Imagine a model trained up to a context length of 512 tokens, and you wish to extend its capabilities to handle up to 1024 tokens. A blending range might be set from 400 to 600 tokens. In this range:
Positions closer to 400 would use predominantly interpolated embeddings, as they're closer to the trained range. As positions move towards 600, there's an increasing reliance on extrapolated embeddings. Beyond 600 tokens, the model uses purely extrapolated embeddings for positional information. The parameters beta_fast and beta_slow control the blending of interpolated and extrapolated embeddings.
The attn_factor
is used to adjust the overall strength or influence of the
positional embeddings on the self-attention mechanism within the transformer
model. Essentially, it acts as a scaling factor that can amplify or diminish the
effect that positional information has on how the model weighs and integrates
different parts of the input sequence.
A higher attn_factor means that the positional embeddings will have a more
pronounced impact on the attention scores, potentially making the model more
sensitive to the sequence order and the relative distances between tokens.
When extending the context length beyond what the model was originally trained
for, adjusting the attn_factor can help the model better accommodate and make
use of the extended positional embeddings, ensuring that the attention mechanism
remains effective over longer sequences.
That leaves xpos_base
and xpos_down
.
I'm currently in:
(gdb) bt
#0 ggml_rope_impl (ctx=0x747a28 <g_state+200>, a=0x7ffef1c40f90, b=0x7ffef1c404a0, n_dims=128,
mode=0, n_ctx=0, n_orig_ctx=4096, freq_base=10000, freq_scale=1, ext_factor=0, attn_factor=1,
beta_fast=32, beta_slow=1, xpos_base=0, xpos_down=false, inplace=false) at ggml.c:5461
#1 0x0000000000417bd6 in ggml_rope_custom (ctx=0x747a28 <g_state+200>, a=0x7ffef1c40f90,
b=0x7ffef1c404a0, n_dims=128, mode=0, n_ctx=0, n_orig_ctx=4096, freq_base=10000, freq_scale=1,
ext_factor=0, attn_factor=1, beta_fast=32, beta_slow=1) at ggml.c:5518
#2 0x00000000004b3274 in llm_build_context::build_llama (this=0x7fffffff63e0) at llama.cpp:5875
#3 0x0000000000485e7f in llama_build_graph (lctx=..., batch=..., worst_case=true) at llama.cpp:8641
#4 0x0000000000495385 in llama_new_context_with_model (model=0x797fd0, params=...)
at llama.cpp:13290
#5 0x000000000052f81f in llama_init_from_gpt_params (params=...) at common/common.cpp:1877
#6 0x00000000005c0ad9 in main (argc=5, argv=0x7fffffffb318) at examples/main/main.cpp:199
(gdb) br
Breakpoint 4 at 0x417930: file ggml.c, line 5461.
(gdb) save breakpoints brs.txt
Saved to file 'brs.txt'.
To restart where I left of:
$ gdb --args ./main -m models/llama-2-7b-chat.Q4_0.gguf -p "What is your name?"
(gdb) source brs.txt
So we where in build_llama in llama.cpp:
Qcur = ggml_rope_custom(
ctx0, ggml_reshape_3d(ctx0, Qcur, n_embd_head, n_head, n_tokens), inp_pos,
n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow
);
cb(Qcur, "Qcur", il);
So the tensor a
is the reshaped tensor Qcur
and b
is the tensor inp_pos
So Qcur is a 512x4096 matrix but notice that it will be reshaped before being
passed into ggml_rope_custom
.
(gdb) p *Qcur
$4 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1},
nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0xb49390, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
And a will then look like:
(gdb) p *a
$6 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {128, 32, 512, 1},
nb = {4, 512, 16384, 8388608}, op = GGML_OP_RESHAPE, op_params = {0 <repeats 16 times>}, flags = 0,
grad = 0x0, src = {0x7ffef1c40ae0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x7ffef1c40ae0, view_offs = 0, data = 0x0,
name = "Qcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
Now I think this is done because multihead attention is used and there are 32 heads.
The rest of this function will do the same as what we have seen before the
operation is set to GGML_OP_ROPE
and the source tensors are set to a
and b
and finally the result is passed back.
Back in build_llama
the name of the tensor will be set:
cb(Qcur, "Qcur", il);
The the same things is done for the Key matrix which is pretty much what we expected:
Kcur = ggml_rope_custom(
ctx0, ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens), inp_pos,
n_rot, rope_type, 0, n_orig_ctx, freq_base, freq_scale,
ext_factor, attn_factor, beta_fast, beta_slow
);
cb(Kcur, "Kcur", il);
After this we have the Key-Value cache, and it might be useful to re-read the key-value cache notes for a refresher before this section:
cur = llm_build_kv(ctx0, model, hparams, kv_self, gf,
model.layers[il].wo, model.layers[il].bo,
Kcur, Vcur, Qcur, KQ_mask, nullptr, n_ctx, n_tokens, kv_head, n_kv, 1.0f/sqrtf(float(n_embd_head)), cb, il);
Lets take a look at llm_build_kv
:
static struct ggml_tensor * llm_build_kv(
struct ggml_context * ctx,
const llama_model & model,
const llama_hparams & hparams,
const llama_kv_cache & kv,
struct ggml_cgraph * graph,
struct ggml_tensor * wo,
struct ggml_tensor * wo_b,
struct ggml_tensor * k_cur,
struct ggml_tensor * v_cur,
struct ggml_tensor * q_cur,
struct ggml_tensor * kq_mask,
struct ggml_tensor * kq_pos,
int64_t n_ctx,
int32_t n_tokens,
int32_t kv_head,
int32_t n_kv,
float kq_scale,
const llm_build_cb & cb,
int il) {
Notice that the wo
is taken from the model:
(gdb) p *model.layers[il].wo
$12 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x799d80, ne = {4096, 4096, 1,
1}, nb = {18, 2304, 9437184, 9437184}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x7fff0f89cec0,
name = "blk.0.attn_output.weight", '\000' <repeats 39 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
Notice that the name of this tensor is blk.0.attn_output.weight
and there will
be 32 of these. In this case bo
the bias tensor is NULL.
And k_cur
, q_cur
are the tensors that we saw above.
(gdb) p *q_cur
$18 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {128, 32, 512, 1},
nb = {4, 512, 16384, 8388608}, op = GGML_OP_ROPE, op_params = {0, 128, 0, 0, 4096, 1176256512,
1065353216, 0, 1065353216, 1107296256, 1065353216, 0, 0, 0, 0, 0}, flags = 0, grad = 0x0, src = {
0x7ffef1c40f90, 0x7ffef1c404a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
(gdb) p *k_cur
$19 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {128, 32, 512, 1},
nb = {4, 512, 16384, 8388608}, op = GGML_OP_ROPE, op_params = {0, 128, 0, 0, 4096, 1176256512,
1065353216, 0, 1065353216, 1107296256, 1065353216, 0, 0, 0, 0, 0}, flags = 0, grad = 0x0, src = {
0x7ffef1c412b0, 0x7ffef1c404a0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "Kcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Looking at the dimensions for these tensors we can see they are 128x32x512 and these are the dimension is in a ggml tensor, so we first have the x-axis, then the y-axis, and then the z-axis.
↑ .
↑| .
↑|| .
||| 512
32 ||| .
(heads) ||| .
||| .
||------------------------->
|------------------------>
------------------------->
128
(features)
So we have 32 heads, and the embedding dimension is 128, recall that we split the features which are originally 4096 into 32 heads containing 128 features each. And context length is 512, the number of tokens embeddings.
And v_cur
looks like this:
(gdb) p *v_cur
$21 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512, 1, 1},
nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "Vcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
And we can see that this is is a 512x4096 matrix (x-axis 4096, y-axis 512) and these are the tokens in the sequence and tokens embeddings for them.
And the kv_mask
looks like this:
(gdb) p *kq_mask
$23 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {512, 512, 1, 1},
nb = {4, 2048, 1048576, 1048576}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1,
grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = "KQ_mask", '\000' <repeats 56 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Recall that the embedding dimension in this case is 512 and this makes the
attention score matrix a 512x512 matrix. And this matrix allows for masking out
of tokens that the tokens that come later in the sequence. I've written about
this before and should add a link here. TODO: add link.
kq_pos
is NULL in this case.
(gdb) p n_ctx
$25 = 512
(gdb) p n_tokens
$26 = 512
(gdb) p kv_head
$27 = 0
(gdb) p n_kv
$28 = 512
(gdb) p kq_scale
$29 = 0.0883883461
So the first thing that happens in llm_build_kv
is:
ggml_build_forward_expand(graph, q_cur);
(gdb) p hparams.n_embd_k_gqa()
$33 = 4096
uint32_t n_embd_k_gqa() const { // dimension of key embeddings across all k-v heads
return n_embd_head_k * n_head_kv;
}
(gdb) p hparams.n_embd_v_gqa()
$34 = 4096
struct ggml_tensor * v_cur_t = ggml_transpose(ctx, ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens));
So we have the value matrix which is a 512x4096 matrix.
(gdb) p v_cur.ne
$41 = {4096, 512, 1, 1}
(gdb) p ggml_n_dims(v_cur)
$39 = 2
And we are going to reshape this to 4096x512 matrix?
Lets call this and see what we get back:
(gdb) p *ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens)
$44 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512,
1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_RESHAPE, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x7ffef1c40e00, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0,
name = "Vcur-0 (reshaped)", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
And comparing this to the original v_cur
tensor:
(gdb) p *v_cur
$46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 512,
1, 1}, nb = {4, 16384, 8388608, 8388608}, op = GGML_OP_MUL_MAT, op_params = {
0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0xb496b0, 0x7ffef1c40950, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0, name = "Vcur-0", '\000' <repeats 57 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Not the op
is set to GGML_OP_RESHAPE
and the name is Vcur-0 (reshaped)
but
other than that I can't really see any difference. Actually there is also a
difference with the src
and src_view
fields.
There is also a note on the line later down in the code asking if this reshaping is really necessary.
//struct ggml_tensor * v_cur_t = ggml_transpose(ctx, v_cur); // TODO: reshape above is likely not needed
I've opened a pull request to suggest changing this which was later accepted.
We can take a look at this transposed tensor:
(gdb) p *ggml_transpose(ctx, ggml_reshape_2d(ctx, v_cur, n_embd_v_gqa, n_tokens))
$48 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {512, 4096, 1, 1}, nb = {16384, 4,
8388608, 8388608}, op = GGML_OP_TRANSPOSE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x7ffef1c41c10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7ffef1c40e00, view_offs = 0, data = 0x0,
name = "Vcur-0 (reshaped) (transposed)", '\000' <repeats 33 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
And this is named v_cur_t
so this is just the value matrix transposed, not
the value cache. The name of this tensor will be v_cur_t
.
4096 (features)
|
|
|
|
|
↓
--------->
512 (sequence tokens)
Next we have the key cache view:
struct ggml_tensor * k_cache_view = ggml_view_1d(ctx, kv.k_l[il], n_tokens*n_embd_k_gqa,
(ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa))*kv_head);
cb(k_cache_view, "k_cache_view", il);
(gdb) p *kv.k_l[il]
$17 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x7c43f0, ne = {2097152, 1, 1, 1}, nb = {2,
4194304, 4194304, 4194304}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x7ffef6000020, name = "cache_k_l0", '\000' <repeats 53 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Note that the ne
is 2097152 which is 512x4096. The above is creating a
view of the kv.k_l
with 2097152 elements, and the offset used is:
(gdb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)
$22 = 8192
(gdb) p kv_head
$23 = 0
So in this case the offset will be zero. Why is this a 1d tensor?
This might be done for performance/memory optimization reasons and I hope to
understand this better when looking at more code.
So that was the tensor view for the key cache.
(gdb) p *k_cache_view
$41 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {2097152, 1, 1, 1}, nb = {2,
4194304, 4194304, 4194304}, op = GGML_OP_VIEW, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x7c44d0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7c44d0, view_offs = 0, data = 0x7ffef6000020,
name = "k_cache_view-0\000w)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Next we have the Value cache.
struct ggml_tensor * v_cache_view = ggml_view_2d(ctx, kv.v_l[il], n_tokens, n_embd_v_gqa,
( n_ctx)*ggml_element_size(kv.v_l[il]),
(kv_head)*ggml_element_size(kv.v_l[il]));
(gdb) p *v_cache_view
$39 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {512, 4096, 1, 1}, nb = {2, 1024,
4194304, 4194304}, op = GGML_OP_VIEW, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x7c4660, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7c4660, view_offs = 0, data = 0x7ffef6400020, name = "cache_v_l0 (view)", '\000' <repeats 46 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Just keep in mind that this is building up a computation graph and the operations required for key-value caching. So we have tensors for the key and value values which will be used later.
Following that we have (back in llm_build_kv
) we have:
struct ggml_tensor * cur;
cur = llm_build_kqv(ctx, model, hparams, kv, graph, wo, wo_b,
q_cur, kq_mask, kq_pos, n_ctx, n_tokens, n_kv, kq_scale, cb, il);
cb(cur, "kqv_out", il);
return cur;
This is the actual QKV computation, which recall looks like this:
Attention(Q, K V) = softmax(QK^T / sqrt(d_k)) V
static struct ggml_tensor * llm_build_kqv(
struct ggml_context * ctx,
const llama_model & model,
const llama_hparams & hparams,
const llama_kv_cache & kv,
struct ggml_cgraph * graph,
struct ggml_tensor * wo,
struct ggml_tensor * wo_b,
struct ggml_tensor * q_cur,
struct ggml_tensor * kq_mask,
struct ggml_tensor * kq_pos,
int64_t n_ctx,
int32_t n_tokens,
int32_t n_kv,
float kq_scale,
const llm_build_cb & cb,
int il) {
const int64_t n_head = hparams.n_head;
const int64_t n_head_kv = hparams.n_head_kv;
const int64_t n_embd_head_k = hparams.n_embd_head_k;
const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
const int64_t n_embd_head_v = hparams.n_embd_head_v;
struct ggml_tensor * q = ggml_permute(ctx, q_cur, 0, 2, 1, 3);
cb(q, "q", il);
So let start with ggml_permute
and what this is doing is that it is taking
the tensor q_cur
dimensions which looks like this:
(gdb) p *q_cur
$48 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {128, 32, 512, 1}, nb = {4, 512, 16384, 8388608}, op = GGML_OP_ROPE,
op_params = {0, 128, 0, 0, 4096, 1176256512, 1065353216, 0, 1065353216,
1107296256, 1065353216, 0, 0, 0, 0, 0}, flags = 0, grad = 0x0, src = {0x7ffef5c40f90, 0x7ffef5c404a0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x0, name = "Qcur-0", '\000' <repeats 57 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
(gdb) p q_cur.ne
$46 = {128, 32, 512, 1}
↑ ↑ ↑ ↑
index: 0 1 2 3
And permute that into:
$46 = {128, 32, 512, 1}
0 1 2 3
| | | |
| +|---+ |
| |+---+ |
| | | |
↓ ↓ ↓ ↓
{128, 512, 32, 1}
↑ ↑ ↑ ↑
index: 0 1 2 3
The permuation q
will look like this after the operation:
(gdb) p *q
$49 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {128, 512, 32, 1}, nb = {4, 16384, 512, 8388608},
op = GGML_OP_PERMUTE, op_params = {0, 2, 1, 3, 0 <repeats 12 times>}, flags = 0, grad = 0x0,
src = {0x7ffef5c41120, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0,
perf_time_us = 0, view_src = 0x7ffef5c41120, view_offs = 0, data = 0x0,
name = "Qcur-0 (permuted)", '\000' <repeats 46 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
There is a standalone permute example to
get a better understanding of how this works.
So that is the Q
tensor.
Next, we have the K
tensor:
struct ggml_tensor * k =
ggml_view_3d(ctx, kv.k_l[il],
n_embd_head_k, n_kv, n_head_kv,
ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa),
ggml_row_size(kv.k_l[il]->type, n_embd_head_k),
0);
cb(k, "k", il);
And then we have the multiplication of Q and K matrices:
struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
cb(kq, "kq", il);
After this there are a couple of if
statements that depend on the model arch
which I will skip for now as I'm using the llama-2-7b-chat.Q4_0.gguf
model.
TODO: look into LLM_ARCH_PHI2
and LLM_ARCH_GROK
in this regard.
After the that there is a check if Alibi is used:
#if defined(GGML_USE_KOMPUTE)
#pragma message("TODO: ALiBi support in ggml_soft_max_ext is not implemented for Kompute")
#pragma message(" Falling back to ggml_alibi(). Will become an error in Mar 2024")
#pragma message("ref: https://github.com/ggerganov/llama.cpp/pull/5488")
if (hparams.f_max_alibi_bias > 0.0f) {
kq = ggml_scale(ctx, kq, kq_scale);
cb(kq, "kq_scaled", il);
kq = ggml_alibi(ctx, kq, /*n_past*/ 0, n_head, hparams.f_max_alibi_bias);
cb(kq, "kq_scaled_alibi", il);
kq = ggml_add(ctx, kq, kq_mask);
cb(kq, "kq_masked", il);
kq = ggml_soft_max(ctx, kq);
cb(kq, "kq_soft_max", il);
} else
#endif
{
kq = ggml_soft_max_ext(ctx, kq, kq_mask, kq_pos, kq_scale, hparams.f_max_alibi_bias);
cb(kq, "kq_soft_max_ext", il);
}
This will build the softmax operation:
Attention(Q, K V) = softmax(QK^T / sqrt(d_k)) V
And the declaration of ggml_soft_max_ext
looks like this:
struct ggml_tensor * ggml_soft_max_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * mask,
struct ggml_tensor * pos,
float scale,
float max_bias) {
return ggml_soft_max_impl(ctx, a, mask, pos, scale, max_bias, false);
}
Lets take a look at the arguments to this function:
(gdb) p *kq
$20 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {512, 512, 32, 1}, nb = {4, 2048, 1048576, 33554432},
op = GGML_OP_MUL_MAT, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x7ffef5c41f30, 0x7ffef5c41da0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0,
perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0, name = "kq-0", '\000' <repeats 59 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
This is the attention score matrix, the result of QK^T. This is the input to the softmax function.
Then we have a mask tensor which must be part of the extenion of the softmax which I've not seen before:
(gdb) p *kq_mask
$21 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {512, 512, 1, 1}, nb = {4, 2048, 1048576, 1048576},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1, grad = 0x0, src = {0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0, name = "KQ_mask", '\000' <repeats 56 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
This may be used to ignore padding.
Next is the positional encoding tensor if absolute positional encoding is used, but if RoPE is used the this will be NULL(0x0):
(gdb) p *kq_pos
Cannot access memory at address 0x0
Next is the kq_scale
which I think is the 1/sqrt(d_k)
above:
(gdb) p kq_scale
$22 = 0.0883883461
And finally the max_bias
which is set to zero and looks like it is related
to Alibi bias.
(gdb) p hparams.f_max_alibi_bias
$23 = 0
Next a 3d view of kv.v_l[il]
will be created:
(gdb) p *kv.v_l[il]
$33 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x7c7970,
ne = {2097152, 1, 1, 1}, nb = {2, 4194304, 4194304, 4194304},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x7ffef6400020, name = "cache_v_l0", '\000' <repeats 53 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
So this is field in llama_kv_cache
which is a 1d tensor with 2097152 elements.
// split cached v into n_head heads
struct ggml_tensor * v =
ggml_view_3d(ctx, kv.v_l[il],
n_kv, n_embd_head_v, n_head_kv,
ggml_element_size(kv.v_l[il])*n_ctx,
ggml_element_size(kv.v_l[il])*n_ctx*n_embd_head_v,
0);
cb(v, "v", il);
(gdb) p *v
$30 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {512, 128, 32, 1}, nb = {2, 1024, 131072, 4194304},
op = GGML_OP_VIEW, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x7c7be0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x7c7be0, view_offs = 0, data = 0x7ffef6400020,
name = "cache_v_l0 (view)", '\000' <repeats 46 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
Then we have the multiplication of V with the result of the softmax (the attention score):
struct ggml_tensor * kqv = ggml_mul_mat(ctx, v, kq);
cb(kqv, "kqv", il);
After this kqv will have the following shape:
(gdb) p *kqv
$43 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {128, 512, 32, 1}, nb = {4, 512, 262144, 8388608},
op = GGML_OP_MUL_MAT, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = { 0x7ffef5c423e0, 0x7ffef5c42250, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0,
view_offs = 0, data = 0x0, name = '\000' <repeats 63 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
struct ggml_tensor * kqv_merged = ggml_permute(ctx, kqv, 0, 2, 1, 3);
cb(kqv_merged, "kqv_merged", il);
(gdb) p *kqv_merged
$46 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {128, 32, 512, 1}, nb = {4, 262144, 512, 8388608}, op = GGML_OP_PERMUTE,
op_params = {0, 2, 1, 3, 0 <repeats 12 times>}, flags = 0, grad = 0x0,
src = {0x7ffef5c42570, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0,
perf_time_us = 0, view_src = 0x7ffef5c42570, view_offs = 0, data = 0x0,
name = "kqv-0 (permuted)", '\000' <repeats 47 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
The following is making the kqv_merged
tensor contiguous:
struct ggml_tensor * cur = ggml_cont_2d(ctx, kqv_merged, n_embd_head_k*n_head, n_tokens);
cb(cur, "kqv_merged_cont", il);
(gdb) p n_embd_head_k
$50 = 128
(gdb) p n_head
$51 = 32
(gdb) p n_tokens
$52 = 512
So the above will take the kqv_merged
tensor which has the shape of 128x32x512
turn it into a 4096x512 matrix:
(gdb) p *cur
$53 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608},
op = GGML_OP_CONT, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = { 0x7ffef5c42700, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0,
name = "kqv_merged-0 (cont)", '\000' <repeats 44 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
This is then multiplied by the weights:
cur = ggml_mul_mat(ctx, wo, cur);
And if the bias weights are not null then they are added:
if (wo_b) {
cb(cur, "kqv_wo", il);
}
if (wo_b) {
cur = ggml_add(ctx, cur, wo_b);
}
Notice that the name of the kvq_wo
is only set if there is a bias tensor.
And cur
is then returned and we will be back in llm_build_kv
.
cur = llm_build_kqv(ctx, model, hparams, kv, graph, wo, wo_b,
q_cur, kq_mask, kq_pos, n_ctx, n_tokens, n_kv, kq_scale, cb, il);
cb(cur, "kqv_out", il);
This is setting the naem and then returning cur
, and this will land us back
in build_llama
.
And the ends the self-attention block.
The following is the addition of the input, the skip connection which goes around self-attention.
struct ggml_tensor * ffn_inp = ggml_add(ctx0, cur, inpSA);
cb(ffn_inp, "ffn_inp", il);
And after this we have the RMSNorm (refer to the diagram at the top of this document to get an overview of the components/operations of the llama arch.
if (model.layers[il].ffn_gate_inp == nullptr) {
cur = llm_build_norm(ctx0, ffn_inp, hparams,
model.layers[il].ffn_norm, NULL,
LLM_NORM_RMS, cb, il);
cb(cur, "ffn_norm", il);
cur = llm_build_ffn(ctx0, cur,
model.layers[il].ffn_up, NULL,
model.layers[il].ffn_gate, NULL,
model.layers[il].ffn_down, NULL,
NULL,
LLM_FFN_SILU, LLM_FFN_PAR, cb, il);
cb(cur, "ffn_out", il);
} else {
// MoE branch
In this case ffn_gate_inp is null so we will enter that block.
And llm_build_norm
is the same as we went through above, recall that there
is an RMSNorm as the first layer in the transformer block.
After that we have llm_build_ffn
which we have not come accross before. I've
written about FeedForward layers before
but essentially they consist of an operation that expands (up) the dimensions of
the input, then applies a non-linear (the gate) activation function, and then
reduces (down) the dimensions back to the original size.
(gdb) p *model.layers[il].ffn_up
$66 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x7b9310, ne = {4096, 11008, 1, 1}, nb = {
18, 2304, 25362432, 25362432}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x7fff0d768ec0, name = "blk.0.ffn_up.weight", '\000' <repeats 44 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
(gdb) p *model.layers[il].ffn_down
$67 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x7b9310, ne = {11008, 4096, 1, 1}, nb = {
18, 6192, 25362432, 25362432}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x7fff0a708ec0, name = "blk.0.ffn_down.weight", '\000' <repeats 42 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
(gdb) p *model.layers[il].ffn_gate
$68 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x7b9310, ne = {4096, 11008, 1, 1}, nb = {
18, 2304, 25362432, 25362432}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x7fff0bf38ec0, name = "blk.0.ffn_gate.weight", '\000' <repeats 42 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
There are a few different types of gate operations in llama.cpp:
enum llm_ffn_op_type {
LLM_FFN_SILU,
LLM_FFN_GELU,
LLM_FFN_RELU,
LLM_FFN_RELU_SQR,
};
After that (skipping the MoE branch for now as it is not used in this example) we have another skip connection addition:
cur = ggml_add(ctx0, cur, ffn_inp);
cb(cur, "ffn_out", il);
ggml_tensor * layer_dir = lctx.cvec.tensor_for(il);
if (layer_dir != nullptr) {
cur = ggml_add(ctx0, cur, layer_dir);
}
cb(cur, "l_out", il);
// input for next layer
inpL = cur;
And that was the first layer, and the same will be done for n_layer
(32) in
the model.
(gdb) until 6330
(gdb) p *cur
$75 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0,
ne = {4096, 512, 1, 1}, nb = {4, 16384, 8388608, 8388608},
op = GGML_OP_ADD, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = { 0x7ffef5ca4950, 0x7ffef5ca3e60, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x0, name = "l_out-31\0001", '\000' <repeats 53 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
After this we should have an RSMNorm layer, an linear layer and then a softmax.
So first we have the the RMSNorm layer:
cur = inpL;
cur = llm_build_norm(ctx0, cur, hparams,
model.output_norm, NULL,
LLM_NORM_RMS, cb, -1);
cb(cur, "result_norm", -1);
The we have the linear layer:
// lm_head
cur = ggml_mul_mat(ctx0, model.output, cur);
cb(cur, "result_output", -1);
ggml_build_forward_expand(gf, cur);
return gf;
But we are missing the softmax?
The return will take us back into llm_build_context
:
switch (model.arch) {
case LLM_ARCH_LLAMA:
{
result = llm.build_llama();
} break;
...
}
llm.free();
return result;
}
Notice that this was just building the graph for the llama model. For other models there are similar functions that will be called.
And this return will take us back into llama_new_context_with_model
:
ggml_cgraph * gf = llama_build_graph(*ctx, llama_batch_get_one(&token, n_tokens, n_past, 0), true);
Lets take a look at the compute graph:
(gdb) p *gf
$83 = {size = 8192, n_nodes = 1030, n_leafs = 359, nodes = 0x7ffef5c00080,
grads = 0x0, leafs = 0x7ffef5c10080,
visited_hash_table = {size = 16411, keys = 0x7ffef5c20080},
order = GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT,
perf_runs = 0, perf_cycles = 0, perf_time_us = 0}
(gdb) p ggml_graph_dump_dot(graph, 0x0, "llama.dot")
ggml_graph_dump_dot: dot -Tpng llama.dot -o llama.dot.png && open llama.dot.png
(gdb) shell dot -Tpng llama.dot -o llama.dot.png && open llama.dot.png
I'm going to skip the ggml backend code and address this in a separate section after going though backends in isolation.
// initialize scheduler with the worst-case graph
if (!ggml_backend_sched_reserve(ctx->sched, gf)) {
LLAMA_LOG_ERROR("%s: failed to allocate compute buffers\n", __func__);
llama_free(ctx);
return nullptr;
}
...
return ctx;
And that is the last line of llama_new_context_with_model
and we will return
control to llama_init_from_gpt_params
.
The following function is called from main.cpp and I was wondering a little about it:
void llama_backend_init(void) {
ggml_time_init();
// needed to initialize f16 tables
{
struct ggml_init_params params = { 0, NULL, false };
struct ggml_context * ctx = ggml_init(params);
ggml_free(ctx);
}
#ifdef GGML_USE_MPI
ggml_mpi_backend_init();
#endif
}
At first I though that it looked odd that the was creating a ggml_context
to
just free it again. But this is done as ggml has some static global variables
(so only accessible from ggml.c) that get initialized:
(gdb) info variables ggml
All variables matching regular expression "ggml":
File ggml-backend.c:
406: static struct ggml_backend_reg ggml_backend_registry[16];
407: static size_t ggml_backend_registry_count;
File ggml.c:
318: float ggml_table_f32_f16[65536];
315: static ggml_fp16_t ggml_table_exp_f16[65536];
306: static ggml_fp16_t ggml_table_gelu_f16[65536];
309: static ggml_fp16_t ggml_table_gelu_quick_f16[65536];
312: static ggml_fp16_t ggml_table_silu_f16[65536];
We can see the declaration of ggml_table_gelu_f16
for example:
static ggml_fp16_t ggml_table_gelu_f16[1 << 16];
This section is going to take a detailed look at how a computation graph is
built in llama.cpp. I'll be using the main
example to step through the code.
$ gdb --args ./main -m models/llama-2-7b-chat.Q4_0.gguf -p "What is your name?"
Reading symbols from ./main...
(gdb) br llama.cpp:5810
Breakpoint 1 at 0x4b365f: file llama.cpp, line 5810.
(gdb) r
(gdb) f
#4 0x00000000005c08ef in main (argc=5, argv=0x7fffffffc958) at examples/main/main.cpp:199
199 std::tie(model, ctx) = llama_init_from_gpt_params(params);
std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(gpt_params & params) {
auto mparams = llama_model_params_from_gpt_params(params);
llama_model * model = llama_load_model_from_file(params.model.c_str(), mparams);
if (model == NULL) {
fprintf(stderr, "%s: error: failed to load model '%s'\n", __func__, params.model.c_str());
return std::make_tuple(nullptr, nullptr);
}
auto cparams = llama_context_params_from_gpt_params(params);
llama_context * lctx = llama_new_context_with_model(model, cparams);
So we load the model from the file system and then setup all the parameters,
like the model parameters and the computation parameters. We pass the model
and the cparams to the llama_new_context_with_model
function which will now
take a closer look at.
struct llama_context * llama_new_context_with_model(
struct llama_model * model,
struct llama_context_params params) {
...
llama_context * ctx = new llama_context(*model);
// Following this we have a number of fields of the context that are set
// to the values of the parameters that were passed in.
...
if (!hparams.vocab_only) {
// initialize backends
Backend are initialized here and there are different backends for CUDA, Kompute, SYCL, Metal, and Vulkan. So the llama_context has members for backends:
std::vector<ggml_backend_t> backends;
#ifdef GGML_USE_METAL
ggml_backend_t backend_metal = nullptr;
#endif
ggml_backend_t backend_cpu = nullptr;
So we have a vector of backends which will initially be empty, optionally if
METAL is used then there is also a backend_metal, and there is also a
backend_cpu.
Notice that the type of the backends is ggml_backend_t
which is a pointer to
In ggml-alloc.h we have:
typedef struct ggml_backend * ggml_backend_t;
llama.h includes ggml-backend.h.
struct ggml_backend {
ggml_guid_t guid;
struct ggml_backend_i iface;
ggml_backend_context_t context;
};
ggml_backend_context_t is a void pointer which I think it to be able to support
different types of backends. The iface
is a struct of function pointers which
are used to interact with the backend. The guid
is a unique identifier for the
backend. So lets look a how these backends are initialized.
#elif defined(GGML_USE_CUBLAS)
if (model->n_gpu_layers > 0) {
// with split_mode LLAMA_SPLIT_MODE_NONE or LLAMA_SPLIT_MODE_ROW, only the main GPU backend is used
if (model->split_mode == LLAMA_SPLIT_MODE_NONE || model->split_mode == LLAMA_SPLIT_MODE_ROW) {
ggml_backend_t backend = ggml_backend_cuda_init(model->main_gpu);
if (backend == nullptr) {
LLAMA_LOG_ERROR("%s: failed to initialize CUDA%d backend\n", __func__, model->main_gpu);
llama_free(ctx);
return nullptr;
}
ctx->backends.push_back(backend);
} else {
// LLAMA_SPLIT_MODE_LAYER requires a backend for each GPU
for (int device = 0; device < ggml_backend_cuda_get_device_count(); ++device) {
ggml_backend_t backend = ggml_backend_cuda_init(device);
if (backend == nullptr) {
LLAMA_LOG_ERROR("%s: failed to initialize CUDA%d backend\n", __func__, device);
llama_free(ctx);
return nullptr;
}
ctx->backends.push_back(backend);
}
}
}
#elif defined(GGML_USE_VULKAN)
...
#endif
ctx->backend_cpu = ggml_backend_cpu_init();
if (ctx->backend_cpu == nullptr) {
LLAMA_LOG_ERROR("%s: failed to initialize CPU backend\n", __func__);
llama_free(ctx);
return nullptr;
}
ctx->backends.push_back(ctx->backend_cpu);
if (!llama_kv_cache_init(ctx->kv_self, ctx->model, type_k, type_v, kv_size, cparams.offload_kqv)) {
LLAMA_LOG_ERROR("%s: llama_kv_cache_init() failed for self-attention cache\n", __func__);
llama_free(ctx);
return nullptr;
}
This section takes a closer look at the llama_model
struct and the
llm_load_tensors
function.
Just printing the model gives a little too much information at once so lets start by looking at the type:
$ gdb --args ./main -m models/llama-2-7b-chat.Q4_0.gguf -p "What is your name?"
(gdb) br llm_load_tensors
Breakpoint 2 at 0x452b77: file llama.cpp, line 4341.
(gdb) r
(gdb) ptype struct llama_model
type = struct llama_model {
e_model type;
llm_arch arch;
llama_ftype ftype;
std::string name;
llama_hparams hparams;
llama_vocab vocab;
ggml_tensor *tok_embd;
ggml_tensor *type_embd;
ggml_tensor *pos_embd;
ggml_tensor *tok_norm;
ggml_tensor *tok_norm_b;
ggml_tensor *output_norm;
ggml_tensor *output_norm_b;
ggml_tensor *output;
ggml_tensor *output_b;
std::vector<llama_layer> layers;
llama_split_mode split_mode;
int main_gpu;
int n_gpu_layers;
std::unordered_map<std::string, std::string> gguf_kv;
llama_model::layer_buft buft_input;
llama_model::layer_buft buft_output;
std::vector<llama_model::layer_buft> buft_layer;
std::vector<ggml_context*> ctxs;
std::vector<ggml_backend_buffer*> bufs;
llama_mmaps mappings;
llama_mlocks mlock_bufs;
llama_mlocks mlock_mmaps;
std::vector<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ggml_tensor*>> tensors_by_name;
int64_t t_load_us;
int64_t t_start_us;
public:
~llama_model(void);
}
(gdb) p model->type
$6 = MODEL_7B
(gdb) p model->arch
$7 = LLM_ARCH_LLAMA
(gdb) p model->name
$8 = "LLaMA v2"
(gdb) p model.ftype
$9 = LLAMA_FTYPE_MOSTLY_Q4_0
(gdb) p model->vocab.id_to_token.size()
$13 = 32000
Now, let look closer at the tensors that the model has:
(gdb) p model->tok_embd
Notice that a model has a vector of llama_layers
. Before looking at this I
had to detour and look more closely att ggml_backends.
The following will choose a backend based on the backends that llama.cpp was
compiled with, and if none then the ggml_backend_cpu_buffer_type
will be
returned:
// there is very little benefit to offloading the input layer, so always keep it on the CPU
model.buft_input = llama_default_buffer_type_cpu(true);
Notice that the above will call the conversion constructor of layer_buft
:
// layer -> buffer type mapping
struct layer_buft {
layer_buft(ggml_backend_buffer_type_t matrix) : buft_matrix(matrix), buft(matrix) {}
// matrices only - used by split buffers and backends that support only matrix multiplication
ggml_backend_buffer_type_t buft_matrix;
// everything else
ggml_backend_buffer_type_t buft;
}
Next the vector of models buft_layer
vector will be resized from 0 to the size
of the number of layers in the model:
std::vector<llama_model::layer_buft> buft_layer;
And the resizing is done like this:
model.buft_layer.resize(n_layer);
So each of the 32 layers will have a layer_buft
:
(gdb) p model.buft_layer
$13 = std::vector of length 32, capacity 32 = {{buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {
buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}, {buft_matrix = 0x0, buft = 0x0}}
Next, each layer is assigned a buffer type:
// assign cpu layers
for (int64_t i = 0; i < i_gpu_start; ++i) {
model.buft_layer[i] = llama_default_buffer_type_cpu(true);
}
// count used buffer types
std::map<ggml_backend_buffer_type_t, int> buft_layer_count;
buft_layer_count[model.buft_input.buft]++;
buft_layer_count[model.buft_input.buft_matrix]++;
buft_layer_count[model.buft_output.buft]++;
buft_layer_count[model.buft_output.buft_matrix]++;
for (int64_t i = 0; i < n_layer; ++i) {
buft_layer_count[model.buft_layer[i].buft]++;
buft_layer_count[model.buft_layer[i].buft_matrix]++;
}
This above is counting the number of ggml_backend_buffer_type_t
s, and after
the first four have been added we have:
(gdb) p buft_layer_count
$38 = std::map with 1 element = {[0x68cec0 <ggml_backend_cpu_buffer_type>] = 4}
And after the ones in the for loop:
(gdb) p buft_layer_count
$40 = std::map with 1 element = {[0x68cec0 <ggml_backend_cpu_buffer_type>] = 68}
There is also a map of ggml_context
s for each element in the
buft_layer_count
map:
// create one context per buffer type
size_t ctx_size = ggml_tensor_overhead()*(ml.n_tensors + 1); // +1 for models where tok_embd is duplicated as output
// for moe merged tensors
ctx_size += ggml_tensor_overhead()*hparams.n_expert*n_layer;
std::map<ggml_backend_buffer_type_t, ggml_context *> ctx_map;
for (auto & it : buft_layer_count) {
struct ggml_init_params params = {
/*.mem_size =*/ ctx_size,
/*.mem_buffer =*/ NULL,
/*.no_alloc =*/ true,
};
ggml_context * ctx = ggml_init(params);
if (!ctx) {
throw std::runtime_error(format("failed to create context"));
}
ctx_map[it.first] = ctx;
model.ctxs.push_back(ctx);
}
(gdb) n
llm_load_tensors: ggml ctx size = 0.11 MiB
Next, the tensors are going to be created for weights. Lets start by looking at the first part of this section:
// create tensors for the weights
{
const int64_t n_embd = hparams.n_embd;
const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();
const int64_t n_embd_gqa = n_embd_v_gqa;
const int64_t n_vocab = hparams.n_vocab;
const int64_t n_vocab_type = hparams.n_vocab_type;
const int64_t n_ff = hparams.n_ff;
const int64_t n_expert = hparams.n_expert;
if (n_expert > 0 && hparams.n_expert_used == 0) {
throw std::runtime_error("model has expert layers but no expert layers are used");
}
GGML_ASSERT(n_embd_gqa == n_embd_k_gqa);
ggml_context * ctx_input = ctx_map.at(model.buft_input.buft);
ggml_context * ctx_output = ctx_map.at(model.buft_output.buft);
ggml_context * ctx_output_split = ctx_map.at(model.buft_output.buft_matrix);
auto ctx_for_layer = [&](int i) { return ctx_map.at(model.buft_layer[i].buft); };
auto ctx_for_layer_split = [&](int i) { return ctx_map.at(model.buft_layer[i].buft_matrix); };
model.layers.resize(n_layer);
const auto tn = LLM_TN(model.arch);
switch (model.arch) {
case LLM_ARCH_LLAMA:
case LLM_ARCH_REFACT:
case LLM_ARCH_MINICPM:
{
model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
(gdb) p n_embd
$74 = 4096
(gdb) p n_embd_k_gqa
$75 = 4096
(gdb) p n_embd_v_gqa
$76 = 4096
(gdb) p n_embd_gqa
$77 = 4096
(gdb) p n_vocab
$78 = 32000
(gdb) p n_vocab_type
$79 = 0
(gdb) p n_ff
$80 = 11008
(gdb) p n_expert
$82 = 0
Then we have the ggml_context
s for the input, output, and output split
followed by the model.layers being resized to 32:
(gdb) ptype model.layers
type = std::vector<llama_layer>
(gdb) p model.layers
$88 = std::vector of length 0, capacity 0
(gdb) p n_layer
$89 = 32
There is a section about LLM_TN earlier in this document.)
(gdb) p model.arch
$93 = LLM_ARCH_LLAMA
(gdb) p tn
$94 = {arch = LLM_ARCH_LLAMA}
model.tok_embd = ml.create_tensor(ctx_input, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab});
This function is a member of llama_model_loader
and has a default value for
the required
parameter which is true
and note that it takes an
std::vector<int64_t>
which has an initializer list constructor and this is
used create the vector from the values of n_embd
and n_vocab
. And this is
named ne
which I think is number of elements.
struct ggml_tensor * create_tensor(struct ggml_context * ctx,
const std::string & name,
const std::vector<int64_t> & ne,
bool required = true) {
const struct ggml_tensor * cur = check_tensor_dims(name, ne, required);
if (cur == NULL) {
return NULL;
}
return create_tensor_for(ctx, cur);
}
check_tensor_dims
is a function in llama.cpp.
const struct ggml_tensor * check_tensor_dims(const std::string & name,
const std::vector<int64_t> & ne,
bool required) const {
const struct ggml_tensor * cur = get_tensor_meta(name.c_str());
...
}
(gdb) p name
$97 = "token_embd.weight"
(gdb) p ne
$98 = std::vector of length 2, capacity 2 = {4096, 32000}
llama_model_loader
has a field named weights
which is of type:
(gdb) ptype llama_model_loader
type = struct llama_model_loader {
int n_kv;
int n_tensors;
int n_created;
int64_t n_elements;
size_t n_bytes;
bool use_mmap;
llama_files files;
llama_ftype ftype;
llama_fver fver;
llama_mmaps mappings;
std::vector<llama_model_loader::llama_tensor_weight> weights;
...
(gdb) ptype llama_model_loader::llama_tensor_weight
type = struct llama_model_loader::llama_tensor_weight {
uint16_t idx;
size_t offs;
ggml_tensor *tensor;
public:
llama_tensor_weight(uint16_t, const char *, const gguf_context *, ggml_tensor *);
}
When a llama_tensor_weight
is created we can see that it gets passed char*
which is the name of the tensor. This is used to search for an index in the
gguf context. This happens when the model is loaded which I've not looked at
yet.
We can inspect one of the weights using:
(gdb) p weights[0]
$113 = {idx = 0, offs = 741056, tensor = 0x8f21a0}
(gdb) p *weights[0].tensor
$112 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x0, ne = {4096, 32000, 1, 1},
nb = {18, 2304, 73728000, 73728000},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>},
flags = 0, grad = 0x0, src = { 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0,
data = 0x0,
name = "token_embd.weight", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
(gdb) f
#0 llama_model_loader::check_tensor_dims (this=0x7fffffff8d60, name="token_embd.weight",
ne=std::vector of length 2, capacity 2 = {...}, required=true) at llama.cpp:3226
3226 const struct ggml_tensor * cur = get_tensor_meta(name.c_str());
(gdb) tbreak get_tensor_meta
Note: breakpoint 4 also set at pc 0x4af908.
Setting a temporary breakpoint (will be deleted after it is hit) like this
useful when you don't want to step into the name.c_str()
function.
struct ggml_tensor * get_tensor_meta(const char * name) const {
const auto * weight = get_weight(name);
if (!weight) {
return nullptr;
}
return weight->tensor;
}
const llama_tensor_weight * get_weight(const char * name) const {
for (const auto & weight : weights) {
if (strcmp(name, weight.tensor->name) == 0) {
return &weight;
}
}
return nullptr;
}
And recall that we are currently in llama_model_loader
Next in check_tensor_dims
we have:
bool is_ok = true;
for (size_t i = 0; i < GGML_MAX_DIMS; ++i) {
if ((i < ne.size() && ne[i] != cur->ne[i]) || (i >= ne.size() && cur->ne[i] != 1)) {
is_ok = false;
break;
}
}
Recall that GGML_MAX_DIMS
is 4 an the ne
vector has a size of 2. So the
number of dimensions in ne cannot be greater than 4. And the number of elements
in the ith dimension of the vector must match the number of elements in the ith
dimension of the tensor. This is true in the current case so the second part of
the if statement is not executed. But lets say that i is greater than ne.size()
then the number of elements in the tensor dimension must be not be 1.
After this we will return to create_tensor
.
return create_tensor_for(ctx, cur);
}
struct ggml_tensor * create_tensor_for(struct ggml_context * ctx, const struct ggml_tensor * cur) {
struct ggml_tensor * tensor = ggml_dup_tensor(ctx, cur);
ggml_set_name(tensor, ggml_get_name(cur));
n_created++;
return tensor;
}
This above function is using ggml_dup_tensor
to create a new tensor with the
same type and dimensions as the cur
tensor has. The name is not duplicated
and it is set to the name of the cur
tensor:
(gdb) p *tensor
$131 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 32000, 1, 1}, nb = {18,
2304, 73728000, 73728000}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0, name = "token_embd.weight", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
This tensor is returned and is set to the model.tok_embd
tensor. We can
visualize this as matrix with 32000 rows, one row for each token in the
vocabulary, and 4096 columns which are the features for each token:
0 4096
| .....................
| .....................
| .....................
| .....................
| .....................
| .....................
| .....................
↓
32000
Notice that the data
field is 0x0/NULL
so at this point the tensor does
not contain any embeddings.
Next we are doing to to something similar to the above but the following will create a 1d tensor with 4096 elements:
// output
{
model.output_norm = ml.create_tensor(ctx_output, tn(LLM_TENSOR_OUTPUT_NORM, "weight"), {n_embd});
...
If we jump forward to after all the tensors have been created we find the following:
ml.init_mappings(true, use_mlock ? &model.mlock_mmaps : nullptr);
model.mappings.reserve(ml.mappings.size());
void init_mappings(bool prefetch = true, llama_mlocks * mlock_mmaps = nullptr) {
if (use_mmap) {
mappings.reserve(files.size());
mmaps_used.reserve(files.size());
for (const auto & file : files) {
std::unique_ptr<llama_mmap> mapping(new llama_mmap(file.get(), prefetch ? -1 : 0, ggml_is_numa()));
mmaps_used.emplace_back(mapping->size, 0);
if (mlock_mmaps) {
std::unique_ptr<llama_mlock> mlock_mmap(new llama_mlock());
mlock_mmap->init(mapping->addr);
mlock_mmaps->emplace_back(std::move(mlock_mmap));
}
mappings.emplace_back(std::move(mapping));
}
}
// compute the total size of all tensors for progress reporting
for (auto & w : weights) {
size_data += ggml_nbytes(w.tensor);
}
}
In llm_load_tensors
we later have:
// populate tensors_by_name
for (ggml_context * ctx : model.ctxs) {
for (auto * cur = ggml_get_first_tensor(ctx); cur != NULL; cur = ggml_get_next_tensor(ctx, cur)) {
model.tensors_by_name.emplace_back(ggml_get_name(cur), cur);
}
}
This will populate the tensors_by_name
vector with the name of the tensor and
the tensor itself. This is useful for looking up tensors by name:
(gdb) p ggml_get_name(cur)
$24 = 0xbe2930 "token_embd.weight"
(gdb) until 4788
After all tensors have been added we will the following number of tensors in this vector:
(gdb) p model.tensors_by_name.size()
$26 = 291
And we can inspect one of the tensors like this:
(gdb) p model.tensors_by_name[0]
$27 = {first = "token_embd.weight", second = 0xbe2810}
(gdb) p model.tensors_by_name[0].second
$28 = (ggml_tensor *) 0xbe2810
(gdb) p *model.tensors_by_name[0].second
$29 = {type = GGML_TYPE_Q4_0, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {4096, 32000, 1, 1}, nb = {18,
2304, 73728000, 73728000}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0,
view_src = 0x0, view_offs = 0, data = 0x0, name = "token_embd.weight", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
And notice that data
is still null.
Next we have the loading of tensors:
// load tensor data
for (auto & it : ctx_bufs) {
ggml_context * ctx = it.first;
auto & bufs = it.second;
if (!ml.load_all_data(ctx, bufs,
use_mlock ? &model.mlock_mmaps : NULL,
progress_callback,
progress_callback_user_data)) {
return false;
}
}
And load_all_data
is where the tensors data is populated.
if (buf_mmap && cur->data == nullptr) {
ggml_backend_tensor_alloc(buf_mmap, cur, (uint8_t *) mapping->addr + weight->offs);
This will then return the model which has now been completely loaded.
wip
$ gdb --args ./main -m models/llama-2-7b-chat.Q4_0.gguf -p "What is your name?"
(gdb) br llama_load_model_from_file
Breakpoint 1 at 0x4bda64
(gdb) r
struct llama_model * llama_load_model_from_file(
const char * path_model,
struct llama_model_params params) {
llama_model * model = new llama_model;
(gdb) p mparams
$1 = {n_gpu_layers = 0, split_mode = LLAMA_SPLIT_MODE_LAYER, main_gpu = 0, tensor_split = 0x7fffffffacbc,
progress_callback = 0x0, progress_callback_user_data = 0x0, kv_overrides = 0x0, vocab_only = false,
use_mmap = true, use_mlock = false}
struct llama_model_params {
int32_t n_gpu_layers; // number of layers to store in VRAM
enum llama_split_mode split_mode; // how to split the model across multiple GPUs
// main_gpu interpretation depends on split_mode:
// LLAMA_SPLIT_NONE: the GPU that is used for the entire model
// LLAMA_SPLIT_ROW: the GPU that is used for small tensors and intermediate results
// LLAMA_SPLIT_LAYER: ignored
int32_t main_gpu;
// proportion of the model (layers or rows) to offload to each GPU, size: llama_max_devices()
const float * tensor_split;
// Called with a progress value between 0.0 and 1.0. Pass NULL to disable.
// If the provided progress_callback returns true, model loading continues.
// If it returns false, model loading is immediately aborted.
llama_progress_callback progress_callback;
// context pointer passed to the progress callback
void * progress_callback_user_data;
// override key-value pairs of the model meta data
const struct llama_model_kv_override * kv_overrides;
// Keep the booleans together to avoid misalignment during copy-by-value.
bool vocab_only; // only load the vocabulary, no weights
bool use_mmap; // use mmap if possible
bool use_mlock; // force system to keep model in RAM
};
use_mlock
will use mlock
to keep an virtual address space section/range in
RAM and prevent it from being swapped to disk.
enum llama_split_mode {
LLAMA_SPLIT_MODE_NONE = 0, // single GPU
LLAMA_SPLIT_MODE_LAYER = 1, // split layers and KV across GPUs
LLAMA_SPLIT_MODE_ROW = 2, // split rows across GPUs
};
A new llama_model
is created which is initially "empty" and just contains
default values. These will be populated from information from the model .gguf
file:
(gdb) p path_model
$8 = 0x7921d0 "models/llama-2-7b-chat.Q4_0.gguf"
In our case there progress.callback param is null so this will first be set up but I'm going to skip this for now and jump to the following line:
(gdb) until 14225
llama_load_model_from_file (path_model=0x7921d0 "models/llama-2-7b-chat.Q4_0.gguf", params=...) at llama.cpp:14225
14225 int status = llama_model_load(path_model, *model, params);
// Returns 0 on success, -1 on error, and -2 on cancellation via llama_progress_callback
static int llama_model_load(const std::string & fname, llama_model & model, llama_model_params & params) {
try {
llama_model_loader ml(fname, params.use_mmap, params.kv_overrides);
model.hparams.vocab_only = params.vocab_only;
So a llama_model_loader is created using it's constructor:
llama_model_loader(const std::string & fname,
bool use_mmap,
const struct llama_model_kv_override * param_overrides_p) {
...
This is fairly long contructor and I'm not going to got through it line by line but will step through it and create temporary breakpoint that I think are of interest:
(gdb) tbreak gguf_init_from_file
Temporary breakpoint 2 at 0x4517d5: file ggml.c, line 20574.
The file is read into a FILE pointer and a gguf_context
is malloced:
struct gguf_context * ctx = GGML_ALIGNED_MALLOC(sizeof(struct gguf_context));
For a good reference of the gguf file format see: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md
(gdb) p sizeof(struct gguf_context)
$14 = 72
(gdb) p *ctx
$13 = {header = {magic = "r\030y", version = 0, n_tensors = 0, n_kv = 0}, kv = 0x0, infos = 0x13,
alignment = 7938376, offset = 15, size = 7362932551652699234, data = 0x6e776f645f6e66}
(gdb) ptype *ctx
type = struct gguf_context {
struct gguf_header header;
struct gguf_kv *kv;
struct gguf_tensor_info *infos;
size_t alignment;
size_t offset;
size_t size;
void *data;
}
Next the header is read.
(gdb) ptype ctx->header
type = struct gguf_header {
char magic[4];
uint32_t version;
uint64_t n_tensors;
uint64_t n_kv;
}
Next version
, n_tensors
, and n_kv
are read from the file using
gguf_fread_el
.
(gdb) p ctx.header
$17 = {magic = "GGUF", version = 2, n_tensors = 291, n_kv = 19}
n_kv
is the number of key-value pairs in the file and has nothing to do with
the kv-cache.
After all the key-value pairs have been read which are the metadata of the model I think, then the tensor will be read from the file.
ctx->infos = GGML_MALLOC(ctx->header.n_tensors * sizeof(struct gguf_tensor_info));
for (uint64_t i = 0; i < ctx->header.n_tensors; ++i) {
struct gguf_tensor_info * info = &ctx->infos[i];
(gdb) ptype *ctx->infos
type = struct gguf_tensor_info {
struct gguf_str name;
uint32_t n_dims;
uint64_t ne[4];
enum ggml_type type;
uint64_t offset;
const void *data;
size_t size;
}
(gdb) p ctx->infos[0]
$29 = {name = {n = 17, data = 0x90acc0 "token_embd.weight"}, n_dims = 2, ne = {4096, 32000, 1, 1},
type = GGML_TYPE_Q4_0, offset = 0, data = 0x0, size = 0}
(gdb) p ctx->infos[1]
$30 = {name = {n = 22, data = 0x90ace0 "blk.0.attn_norm.weight"}, n_dims = 1, ne = {4096, 1, 1, 1},
type = GGML_TYPE_F32, offset = 73728000, data = 0x0, size = 0}
(gdb) p ctx->infos[2]
$31 = {name = {n = 21, data = 0x90ad00 "blk.0.ffn_down.weight"}, n_dims = 2, ne = {11008, 4096, 1, 1},
type = GGML_TYPE_Q4_0, offset = 73744384, data = 0x0, size = 0}
(gdb) p ctx->infos[3]
$32 = {name = {n = 21, data = 0x90ad20 "blk.0.ffn_gate.weight"}, n_dims = 2, ne = {4096, 11008, 1, 1},
type = GGML_TYPE_Q4_0, offset = 99106816, data = 0x0, size = 0}
(gdb) p ctx->infos[4]
$33 = {name = {n = 19, data = 0x90ad40 "blk.0.ffn_up.weight"}, n_dims = 2, ne = {4096, 11008, 1, 1},
type = GGML_TYPE_Q4_0, offset = 124469248, data = 0x0, size = 0}
(gdb) p ctx->infos[5]
$34 = {name = {n = 21, data = 0x90ad60 "blk.0.ffn_norm.weight"}, n_dims = 1, ne = {4096, 1, 1, 1},
type = GGML_TYPE_F32, offset = 149831680, data = 0x0, size = 0}
(gdb) p ctx->infos[6]
$35 = {name = {n = 19, data = 0x90ad80 "blk.0.attn_k.weight"}, n_dims = 2, ne = {4096, 4096, 1, 1},
type = GGML_TYPE_Q4_0, offset = 149848064, data = 0x0, size = 0}
(gdb) p ctx->infos[7]
$36 = {name = {n = 24, data = 0x90ada0 "blk.0.attn_output.weight"}, n_dims = 2, ne = {4096, 4096, 1, 1},
type = GGML_TYPE_Q4_0, offset = 159285248, data = 0x0, size = 0}
Now, the offset
is an offset into the file where the data for this tensor can
be found. So at this point the data has not been read and this happend later
but my also be skipped.
Hmm, the data is not set because params.no_alloc
is true:
(gdb) p params.no_alloc
$44 = true
So when is the data read, I mean the weights need to have the values loaded
don't they?
We are currently in the llama_model_loader
and later there is a
llm_load_tensors
which I left to look at the model loading. Perhaps this where
the tensors are loaded with data.
This was set earlier in llama_model_loader
constructor:
struct gguf_init_params params = {
/*.no_alloc = */ true,
/*.ctx = */ &ctx,
};
So when we return from this function and the gguf_context
is set to meta:
meta = gguf_init_from_file(fname.c_str(), params);
std::vector<llama_tensor_weight> weights;
...
for (ggml_tensor * cur = ggml_get_first_tensor(ctx); cur; cur = ggml_get_next_tensor(ctx, cur)) {
weights.emplace_back(0, cur->name, meta, cur);
}
Recall the emplace_back
will create a new llama_tensor_weight
and add it to
the weights
vector. The constructor for llama_tensor_weight
looks like this:
struct llama_tensor_weight {
uint16_t idx; // source file index
size_t offs; // tensor data offset in the original file
ggml_tensor * tensor;
llama_tensor_weight(uint16_t idx,
const char * name,
const struct gguf_context * gguf_ctx,
ggml_tensor * tensor) : idx(idx), tensor(tensor) {
const int tensor_idx = gguf_find_tensor(gguf_ctx, name);
offs = gguf_get_data_offset(gguf_ctx) + gguf_get_tensor_offset(gguf_ctx, tensor_idx);
}
};
If vocab_only
is set then the tensors will not be loaded and
llama_model_load
will return early:
if (params.vocab_only) {
LLAMA_LOG_INFO("%s: vocab only - skipping tensors\n", __func__);
return 0;
}
if (!llm_load_tensors(
ml, model, params.n_gpu_layers, params.split_mode, params.main_gpu, params.tensor_split, params.use_mlock,
params.progress_callback, params.progress_callback_user_data
)) {
return -2;
}
This macro uses a calling convention of __ms_abi__
which is the Microsoft ABI
if the GGML_MULTIPLATFORM
macro is defined:
#ifdef GGML_MULTIPLATFORM
# if defined(_WIN32)
# define GGML_CALL
# else
# define GGML_CALL __attribute__((__ms_abi__))
# endif
#else
# define GGML_CALL
#endif
This initially looked very odd to me as I thought that __ms_abi__
would only
be used on Windows. But this is done do ensure interoperatibilty beteen
different platforms so we want then to have a common calling convention between
NVCC (NVIDIA CUDA Compiler), ROCm (for AMD GPUs), XCode (for macOS), and the
common GCC and Clang compilers on various platforms.
By defining -DGGML_MULTIPLATFORM during the build, the system enforces that back-references and function pointers conform to the Microsoft ABI (ms_abi). This conformance is crucial for ensuring compatibility and predictable behavior when the system interfaces with GPU modules compiled with tools that support the MS ABI, such as NVCC (NVIDIA CUDA Compiler), ROCm (for AMD GPUs), XCode (for macOS), and the common GCC and Clang compilers on various platforms.
$ source flake_venv/bin/activate
$ flake8 --ignore E203,E211,E221,E225,E231,E241,E251,E261,E266,E501,E701,E704,W503 --exclude examples/*,examples/*/**,*/**/__init__.py .
Is a tool used to check if files adhere to the rules specified in an .editorconfig file.
It can be run in llama.cpp using the following command:
$ source editorvm/bin/activate
(editorvm) $ editorconfig-checker -verbose
This is related to negative prompts (as least I think so at this stage), and can be used to instruct the model what not to include. I think this is quite common to use with diffusion models and image generation where you can first describe what you want with the prompt, and then also specify things that you don't want.
In sampling.h
we have:
// Classifier-Free Guidance
// https://arxiv.org/abs/2306.17806
std::string cfg_negative_prompt; // string to help guidance
float cfg_scale = 1.f; // how strong is guidance
So cfg_scale
has a default value of 1.0 which means guidance is disabled (
think of this a multiplication/scaling by 1/identity).
-cfg-scale N strength of guidance (default: 1.000000, 1.0 = disable)
Lets set a breakpoint in main.cpp:
std::tie(model, ctx) = llama_init_from_gpt_params(params);
if (sparams.cfg_scale > 1.f) {
struct llama_context_params lparams = llama_context_params_from_gpt_params(params);
ctx_guidance = llama_new_context_with_model(model, lparams);
}
And later in main.cpp we have:
if (ctx_guidance) {
LOG("cfg_negative_prompt: \"%s\"\n", log_tostr(sparams.cfg_negative_prompt));
guidance_inp = ::llama_tokenize(ctx_guidance, sparams.cfg_negative_prompt, true, true);
LOG("guidance_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx_guidance, guidance_inp).c_str());
std::vector<llama_token> original_inp = ::llama_tokenize(ctx, params.prompt, true, true);
LOG("original_inp tokenized: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, original_inp).c_str());
original_prompt_len = original_inp.size();
guidance_offset = (int)guidance_inp.size() - original_prompt_len;
LOG("original_prompt_len: %s", log_tostr(original_prompt_len));
LOG("guidance_offset: %s", log_tostr(guidance_offset));
}
Lets inspect the guidance_inp and original_inp:
(gdb) call LOG_TOKENS_TOSTR_PRETTY<llama_context*, std::vector<int, std::allocator<int> > >($13, original_inp)
$15 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]"
That is pretty painful to type so lets define a command for it:
(gdb) define call_log_tokens
Type commands for definition of "call_log_tokens".
End with a line saying just "end".
>call LOG_TOKENS_TOSTR_PRETTY<llama_context*, std::vector<int, std::allocator<int> > >($arg0, $arg1)
>end
(gdb) call_log_tokens ctx original_inp
$16 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]"
And we can save it in .gdbinit
:
$ cat .gdbinit
define call_log_tokens
call LOG_TOKENS_TOSTR_PRETTY<llama_context*, std::vector<int, std::allocator<int> > >($arg0, $arg1)
end
And the we can call this using:
(gdb) call_log_tokens ctx original_inp
$2 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]
(gdb) call_log_tokens ctx_guidance guidance_inp
$4 = "[ '<s>':1 ]"
Guidance is something that is decoded before the input tokens are and uses
a separate llama_context
:
for (int i = 0; i < input_size; i += params.n_batch) {
int n_eval = std::min(input_size - i, params.n_batch);
if (llama_decode(ctx_guidance, llama_batch_get_one(input_buf + i, n_eval, n_past_guidance, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
n_past_guidance += n_eval;
}
Lets take a closer look at how main.cpp decodes the input and first print some of the variables used:
(gdb) p embd.size()
$40 = 6
(gdb) p params.n_batch
$41 = 2048
(gdb) call_log_tokens ctx embd
$39 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]"
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
int n_eval = (int) embd.size() - i;
if (n_eval > params.n_batch) {
n_eval = params.n_batch;
}
LOG("eval: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str());
if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
n_past += n_eval;
LOG("n_past = %d\n", n_past);
// Display total tokens alongside total time
if (params.n_print > 0 && n_past % params.n_print == 0) {
LOG_TEE("\n\033[31mTokens consumed so far = %d / %d \033[0m\n", n_past, n_ctx);
}
}
Don't get confused by the the call to llama_batch_get_one
and keep in mind
that embd is a vector of type std::vector<int>
and that embd[i]
is a just
an integer representing the token id:
(gdb) p embd[0]
$46 = 1
(gdb) p &embd[0]
$47 = (int *) 0x555555a1a810
(gdb) p *$47
$49 = 1
(gdb) p $48+1
$50 = (int *) 0x555555a1a814
(gdb) p *$50
$51 = 1724
And not that these match the token ids show above in the call_log_tokens
output.
So we decode the tokens in the input embedding and n_past
is zero at this
stage so it is incremented by the number of tokens in the input embedding which
is 6.
And that is how the input is decoded. i
will be incremented by params.n_batch
which in this case is 2048. So i will be 2048 and embd.size() is 6 so the loop
will exit.
static llama_token_data_array llama_sampling_prepare_impl(
struct llama_sampling_context * ctx_sampling,
struct llama_context * ctx_main,
struct llama_context * ctx_cfg,
const int idx,
bool apply_grammar,
std::vector<float> * original_logits) {
...
if (ctx_cfg) {
float * logits_guidance = llama_get_logits_ith(ctx_cfg, idx);
llama_sample_apply_guidance(ctx_main, logits, logits_guidance, params.cfg_scale);
}
So we have the the logits (original_logits
) and the logits_guidance
which is
are both vectors of floats representing the log probability of the two
prompts:
(gdb) call_log_tokens ctx original_inp
$2 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]
(gdb) call_log_tokens ctx_guidance guidance_inp
$4 = "[ '<s>':1 ]"
```console
(gdb) call_log_tokens ctx original_inp
$2 = "[ '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973 ]
(gdb) call_log_tokens ctx_guidance guidance_inp
$4 = "[ '<s>':1 ]"
void llama_sample_apply_guidance(
struct llama_context * ctx,
float * logits,
float * logits_guidance,
float scale) {
GGML_ASSERT(ctx);
const auto t_start_sample_us = ggml_time_us();
const auto n_vocab = llama_n_vocab(llama_get_model(ctx));
llama_log_softmax(logits, n_vocab);
llama_log_softmax(logits_guidance, n_vocab);
for (int i = 0; i < n_vocab; ++i) {
auto & l = logits[i];
const auto & g = logits_guidance[i];
l = scale * (l - g) + g;
}
ctx->t_sample_us += ggml_time_us() - t_start_sample_us;
}
So both logits are passed through the softmax function and the for each feature in the logits the following is done:
l = scale * (l - g) + g;
So this is updating each original logit float with the difference between
the original logit and the guidance logit (think about this is a distance in
in the embedding space). This is then scaled by scale
and then adding
the guidance logit back to it.
l = [ 2.0 1.5 0.5 ]
g = [ 1.0 2.0 0.0 ]
scale = 0.5
l[0] = 0.5 * (2.0 - 1.0) + 1.0 = 1.5
l[1] = 0.5 * (1.5 - 2.0) + 2.0 = 1.75
l[2] = 0.5 * (0.5 - 0.0) + 0.0 = 0.25
scale = 1.0
l[0] = 1.0 * (2.0 - 1.0) + 1.0 = 2.0
l[1] = 1.0 * (1.5 - 2.0) + 2.0 = 1.5
l[2] = 1.0 * (0.5 - 0.0) + 0.0 = 0.5
scale = 0.0
l[0] = 0.0 * (2.0 - 1.0) + 1.0 = 1.0
l[1] = 0.0 * (1.5 - 2.0) + 2.0 = 2.0
l[2] = 0.0 * (0.5 - 0.0) + 0.0 = 0.0
Alright so I think one way to think about this is that I might have an input prompt like "What is LoRA?", and a negative prompt something like "Don't include information about Lora in the context of large language models." Now, both of these prompts will be decoded by llama.cpp and the logits for each will be returned both will have a size of the vocabulary size (32000). Taking the difference between a feature in the logits and the corresponding feature in the guidance logits will give a measure of how much the feature in the input prompt should be "moved" towards the negative prompt.
<---|---|---|---|---|---|---|---|---|---|--->
0 1.0 2.0
(g) (l)
2.0 - 1.0 = 1.0
So the feature in the input prompt is 1.0 away from the desired value when taking the negative prompt into account. And the scale determines how much we should adjust in input prompt towards the negative prompt.
We figure out how much we should adjust the input prompt by scaling:
1.0
0.5 * |----| = 0.5 [--]
And then we add this to the guidance logit:
<---|---|---|---|---|---|---|---|---|---|--->
0 ↑ |
1.0 |
(g) |
(1.0 + 0.5)
And that will be the value of for this particular feature.
Recall that we are currently in llama_sampling_prepare_impl
so this is
preparing for doing the actual sampling and next the penalties are applied.
// apply penalties
const auto& penalty_tokens = params.use_penalty_prompt_tokens ? params.penalty_prompt_tokens : prev;
const int penalty_tokens_used_size = std::min((int)penalty_tokens.size(), penalty_last_n);
if (penalty_tokens_used_size) {
const float nl_logit = logits[llama_token_nl(llama_get_model(ctx_main))];
llama_sample_repetition_penalties(ctx_main, &cur_p,
penalty_tokens.data() + penalty_tokens.size() - penalty_tokens_used_size,
penalty_tokens_used_size, penalty_repeat, penalty_freq, penalty_present);
if (!penalize_nl) {
for (size_t idx = 0; idx < cur_p.size; idx++) {
if (cur_p.data[idx].id == llama_token_nl(llama_get_model(ctx_main))) {
cur_p.data[idx].logit = nl_logit;
break;
}
}
}
}
One thing I noticed is that the new line token is first extracted from the
model and used to to set the ln_logit
. The same call is then performed in the
for loop. Perhaps this could be extracted to a separate variable and reused.
TODO: take a closer look at how penalties are applied.
And following that we have grammar:
// apply grammar checks before sampling logic
if (apply_grammar && ctx_sampling->grammar != NULL) {
llama_sample_grammar(ctx_main, &cur_p, ctx_sampling->grammar);
}
TODO: take a closer look at how grammar is applied.
After this the function will return controll to llama_sampling_sample_impl
After all sampling is we are back in main.cpp
set print elements 0
(gdb) call_log_tokens ctx ctx_sampling->prev
$100 = "[ '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '<s>':1, ' What':1724, ' is':338, ' Lo':4309, 'RA':4717, '?':29973, '':13 ]"
The last token it the token predicted.
wip
A grammer is something that can be specified/configured and is part of the sampling process, that is after the model has decoded the input and produced some output tokens. Like there are different type of sampling configurations like greedy, topk, etc.
struct llama_sampling_context {
// parameters that will be used for sampling
llama_sampling_params params;
// mirostat sampler state
float mirostat_mu;
llama_grammar * grammar;
// internal
grammar_parser::parse_state parsed_grammar;
// TODO: replace with ring-buffer
std::vector<llama_token> prev;
std::vector<llama_token_data> cur;
std::mt19937 rng;
};
These are vectors, per layer, that can be applied to the result of the model layers during inference. This is something that is added at inference time to control the models behaviour. This is done without any changes to the prompt or any fine tuning of the model. More notes can be found in control_vectors.md.
A llama_context
as two properties named n_eval
and n_p_eval
(number of
prompt evals) which are used to keep track of the number of tokens used for
evaluation.
The n_eval
property is used to keep track of the number of tokens used for
evaluations other than the initial prompt, it is used when the batch size is 1.
struct llama_context {
...
int32_t n_sample = 0; // number of tokens sampled
int32_t n_p_eval = 0; // number of tokens in eval calls for the prompt (with batch size > 1)
int32_t n_eval = 0; // number of eval calls
int64_t n_queued_tokens = 0;
In llama_decode_internal
we have the batch.n_tokens
will be used to set
the n_queued_tokens
property:
lctx.n_queued_tokens += n_tokens_all;
I found the naming a little confusing and perhaps this could be changed to
simple n_tokens
instead of n_tokens_all
.
static int llama_decode_internal(
llama_context & lctx,
llama_batch batch_all, // TODO: rename back to batch
std::string& error) {
Now, decode is strictly involved with performing the forward pass of the model and this will generate logits for the tokens in the batch in this case.
For example, we might have the first call to decode if warmup is set to true in which case only two tokens will be passed.
(gdb) p batch_all.token[0]
$1 = 1
(gdb) p batch_all.token[1]
$2 = 2
(gdb) p batch_all.n_tokens
$3 = 2
These are the bos and eos tokens only:
std::vector<llama_token> tmp = { llama_token_bos(model), llama_token_eos(model), };
The warmup will also call llama_synchronize
which will update the n_p_eval
but this will also be reset with a call to llama_reset_timings
.
The n_eval
property is used to keep track of the number of tokens used for
evaluations other than the initial prompt, it is used when the batch size is 1.
This is done in:
void llama_synchronize(struct llama_context * ctx) {
ggml_backend_sched_synchronize(ctx->sched);
// add the evaluation to the stats
if (ctx->n_queued_tokens == 1) {
ctx->t_eval_us += ggml_time_us() - ctx->t_compute_start_us;
ctx->n_eval++;
} else if (ctx->n_queued_tokens > 1) {
ctx->t_p_eval_us += ggml_time_us() - ctx->t_compute_start_us;
ctx->n_p_eval += ctx->n_queued_tokens;
}
...
ctx->n_queued_tokens = 0;
}
For the initial prompt the number of tokens in the batch will be more than 1 unless the prompt is empty in which case it will be 1 as it will only contains the start of sequence token.
The property n_p_eval
contains the number of tokens in the first batch
(p for prompt).
There is also a property n_sample
which is the number of samples taken and
this is updated for each sampling function:
llama_sample_token_greedy
llama_sample_token_with_rng
llama_beam_search
What about the other sampling functions?
It turns out that they call
llama_token X = llama_sample_token(ctx, candidates);
Which calls:
llama_token llama_sample_token(struct llama_context * ctx, llama_token_data_array * candidates) {
return llama_sample_token_with_rng(ctx, candidates, ctx->rng);
}
This is a function declared in sampling.h
:
void llama_sampling_accept(
struct llama_sampling_context* ctx_sampling,
struct llama_context* ctx_main,
llama_token id,
bool apply_grammar);
And is defined in sampling.cpp
:
void llama_sampling_accept(
struct llama_sampling_context * ctx_sampling,
struct llama_context * ctx_main,
llama_token id,
bool apply_grammar) {
ctx_sampling->prev.erase(ctx_sampling->prev.begin());
ctx_sampling->prev.push_back(id);
if (ctx_sampling->grammar != NULL && apply_grammar) {
llama_grammar_accept_token(ctx_main, ctx_sampling->grammar, id);
}
}
struct llama_sampling_context {
// parameters that will be used for sampling
llama_sampling_params params;
// mirostat sampler state
float mirostat_mu;
llama_grammar * grammar;
// internal
grammar_parser::parse_state parsed_grammar;
// TODO: replace with ring-buffer
std::vector<llama_token> prev;
std::vector<llama_token_data> cur;
std::mt19937 rng;
};
So prev
is just a vector of llama_token
s. But cur
is a vector of
llma_token_data
typedef struct llama_token_data {
llama_token id; // token id
float logit; // log-odds of the token
float p; // probability of the token
} llama_token_data;
This also contains the llama_token
which is just an integer and we can think
of it as an index into the models vocabulary. The logit
is the log-odds of
the token and the p
is the probability of the token.
ctx_sampling->prev.erase(ctx_sampling->prev.begin());
ctx_sampling->prev.push_back(id);
The first line above will erase/remove the first element of the vector, and the second inserts the new token id to the end of the vector.
This a feature of the main example in llama.cpp and can be enabled by passing
--prompt-cache
to main which should be a file which if it does not exist will
be populated by data:
$ ./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf -p 'Hello world' --n-predict 20 --prompt-cache danbev-prompt-cache.txt
Thread 1 "main" hit Breakpoint 1, main (argc=9, argv=0x7fffffffe018) at examples/main/main.cpp:225
225 std::string path_session = params.path_prompt_cache;
(gdb) p
The history is empty.
(gdb) n
226 std::vector<llama_token> session_tokens;
(gdb) p path_session
$1 = "danbev-prompt-cache.txt"
After the decoding (final thing in the loop)
std::string path_session = params.path_prompt_cache;
std::vector<llama_token> session_tokens;
...
while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
// predict
if (!embd.empty()) {
...
if (!embd.empty() && !path_session.empty()) {
session_tokens.insert(session_tokens.end(), embd.begin(), embd.end());
n_session_consumed = session_tokens.size();
}
}
(gdb) p embd
$4 = std::vector of length 3, capacity 4 = {1, 15043, 3186}
(gdb) p path_session
$5 = "danbev-prompt-cache.txt"
(gdb) p session_tokens
$6 = std::vector of length 0, capacity 0
(gdb) p session_tokens
$7 = std::vector of length 3, capacity 3 = {1, 15043, 3186}
The next time we start with this same command line argument this file will be loaded:
if (!path_session.empty()) {
LOG_TEE("%s: attempting to load saved session from '%s'\n", __func__, path_session.c_str());
if (!file_exists(path_session)) {
LOG_TEE("%s: session file does not exist, will create.\n", __func__);
} else if (file_is_empty(path_session)) {
LOG_TEE("%s: The session file is empty. A new session will be initialized.\n", __func__);
} else {
// The file exists and is not empty
session_tokens.resize(n_ctx);
size_t n_token_count_out = 0;
if (!llama_state_load_file(ctx,
path_session.c_str(),
session_tokens.data(),
session_tokens.capacity(),
&n_token_count_out)) {
LOG_TEE("%s: error: failed to load session file '%s'\n", __func__, path_session.c_str());
return 1;
}
session_tokens.resize(n_token_count_out);
LOG_TEE("%s: loaded a session with prompt size of %d tokens\n", __func__, (int)session_tokens.size());
}
}
And later we have a check for if the session_tokens
is empty:
if (params.interactive_first || params.instruct || params.chatml || !params.prompt.empty() || session_tokens.empty()) {
LOG("tokenize the prompt\n");
if (params.chatml) {
params.prompt = "<|im_start|>system\n" + params.prompt + "<|im_end|>";
}
embd_inp = ::llama_tokenize(ctx, params.prompt, true, true);
} else {
LOG("use session tokens\n");
embd_inp = session_tokens;
}
Notice, that the prompt must still be specified but can be empty.
And notice that embd_inp
is set to the session tokens loaded.
Alright, so upon the next startup if we pass in an empty prompt the session cache will be checked and used.
size_t n_matching_session_tokens = 0;
if (!session_tokens.empty()) {
for (llama_token id : session_tokens) {
if (n_matching_session_tokens >= embd_inp.size() || id != embd_inp[n_matching_session_tokens]) {
break;
}
n_matching_session_tokens++;
}
if (params.prompt.empty() && n_matching_session_tokens == embd_inp.size()) {
LOG_TEE("%s: using full prompt from session file\n", __func__);
} else if (n_matching_session_tokens >= embd_inp.size()) {
LOG_TEE("%s: session file has exact match for prompt!\n", __func__);
} else if (n_matching_session_tokens < (embd_inp.size() / 2)) {
LOG_TEE("%s: warning: session file has low similarity to prompt (%zu / %zu tokens); will mostly be reevaluated\n",
__func__, n_matching_session_tokens, embd_inp.size());
} else {
LOG_TEE("%s: session file matches %zu / %zu tokens of prompt\n",
__func__, n_matching_session_tokens, embd_inp.size());
}
// remove any "future" tokens that we might have inherited from the previous session
llama_kv_cache_seq_rm(ctx, -1, n_matching_session_tokens, -1);
}
And if we pass a prompt which is similar:
main: attempting to load saved session from 'danbev-prompt-cache.txt'
main: loaded a session with prompt size of 3 tokens
main: session file matches 2 / 5 tokens of prompt
Initially if we pass in a prompt to this program, for example:
gdb --args ./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'The answer to 1 + 1 is' -n 5 --verbose-prompt
The prompt passed in will be stored in embd_inp
:
std::vector<llama_token> embd_inp;
And this will be populated using:
embd_inp = ::llama_tokenize(ctx, params.prompt, true, true);
The two last arguments are for add_special
and parse_special
.
(gdb) p embd_inp
$34 = std::vector of length 10, capacity 24 = {1, 450, 1234, 304, 29871, 29896, 718, 29871, 29896, 338}
(gdb) call_log_tokens ctx embd_inp
$35 = "[ '<s>':1, ' The':450, ' answer':1234, ' to':304, ' ':29871, '1':29896, ' +':718, ' ':29871, '1':29896, ' is':338 ]"
And a bit further down in the file we have embd
declared as:
std::vector<llama_token> embd;
The main look looks like this:
int n_remain = params.n_predict;
int n_consumed = 0;
int n_past = 0;
...
while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
...
}
params.n_predict
is set to 5 which is done by the -n
command line argument.
Now, the first block of this look is guarded by
if (!embd.empty()) {
...
}
embd.clear();
The first time through embd
is empty so this will not be entered on the first
iteration. And notice that embd is always cleared at this point.
The next if block is the following:
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
}
We know that embd_inp
size is 10 and n_consumed
is 0 so this will not be
entered, instead the else block of this statement will:
} else {
while ((int) embd_inp.size() > n_consumed) {
embd.push_back(embd_inp[n_consumed]);
llama_sampling_accept(ctx_sampling, ctx, embd_inp[n_consumed], false);
++n_consumed;
if ((int) embd.size() >= params.n_batch) {
break;
}
}
}
Where we are looping over the embd_inp
, the tokens of our prompt, and adding
them to embd
.
After this there is some printing but I'll skip the details of that here.
The next if-block the following and now since we have added all the tokens to
embd it will have a size of 10, and n_consumed
will also be 10:
if ((int) embd_inp.size() <= n_consumed) {
if (!params.antiprompt.empty()) {
...
}
if (llama_token_is_eog(model, llama_sampling_last(ctx_sampling))) {
...
}
if (n_past > 0 && is_interacting) {
...
}
}
I'll skip the antiprompt block as it will not be entered during this session.
(gdb) p llama_token_is_eog(model, (int) llama_sampling_last(ctx_sampling))
$49 = false
After that we have the following block:
if (!embd.empty() && llama_token_is_eog(model, embd.back()) && !(params.instruct || params.interactive || params.chatml)) {
...
}
if (params.interactive && n_remain <= 0 && params.n_predict >= 0) {
...
}
And then we continue with at the beginning of the while loop, and this time
we have a populated embd
vector or tokens, and n_remain
is still 5 and the
other variables are as before.
So this time will will enter the if block that we skipped on the first iteration:
while ((n_remain != 0 && !is_antiprompt) || params.interactive) {
// predict
if (!embd.empty()) {
// Note: (n_ctx - 4) here is to match the logic for commandline prompt handling via
// --prompt or --file which uses the same value.
int max_embd_size = n_ctx - 4;
This n_ctx - 4
is bothering me and I've yet to be able to track down why this
is subtracted.
Next is a check if the size of the embd
vector is larger than max_embd_size
in which case embd
is resized to be that size, truncating the prompt.
if ((int) embd.size() > max_embd_size) {
...
}
Next we have a check of the number of groups for the grouped attention and if this is 1 it like having normal multi-head attention.
if (ga_n == 1) {
if (n_past + (int) embd.size() + std::max<int>(0, guidance_offset) >= n_ctx) {
..
}
}
if (n_session_consumed < (int) session_tokens.size()) {
...
}
if (ctx_guidance) {
...
}
Finally we will iterate for the tokens in embd:
for (int i = 0; i < (int) embd.size(); i += params.n_batch) {
int n_eval = (int) embd.size() - i;
if (n_eval > params.n_batch) {
n_eval = params.n_batch;
}
LOG("eval: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, embd).c_str());
if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval, n_past, 0))) {
LOG_TEE("%s : failed to eval\n", __func__);
return 1;
}
n_past += n_eval;
}
Notice that n_eval
is 10 so we will be passing the complete prompt to decode.
And n_past
is 0 at this point. n_past
specifies the number of key and values
vectors from the kv-cache. This is the first run so this is 0, but notice that
it gets updated after the decode operations finishes.
And that will have performed the forward pass of the module with those tokens.
This will break out take us out of the first if block and we again have what
we went through above where we clear embd:
embd.clear();
This time around we will enter the following block:
if ((int) embd_inp.size() <= n_consumed && !is_interacting) {
const llama_token id = llama_sampling_sample(ctx_sampling, ctx, ctx_guidance);
llama_sampling_accept(ctx_sampling, ctx, id, true);
LOG("last: %s\n", LOG_TOKENS_TOSTR_PRETTY(ctx, ctx_sampling->prev).c_str());
embd.push_back(id);
// echo this to console
input_echo = true;
// decrement remaining sampling budget
--n_remain;
}
This will sample from the context and then add that sampled token id to embd
vector of tokens (only one token this time). And notice that n_remain
is
decremented by one.
Notice that at this stage n_eval
and n_p_eval
have the following values:
(gdb) p ctx->n_eval
$62 = 0
(gdb) p ctx->n_p_eval
$63 = 10
(gdb) p id
$66 = 29871
And we can inspect the ctx_sampling->prev
:
(gdb) set print elements 0
(gdb) call_log_tokens ctx ctx_sampling->prev
$65 = "[ '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '':0, '<s>':1, ' The':450, ' answer':1234, ' to':304, ' ':29871, '1':29896, ' +':718, ' ':29871, '1':29896, ' is':338, ' ':29871 ]"
(gdb) p embd
$67 = std::vector of length 1, capacity 16 = {29871}
(gdb) p n_remain
$68 = 4
(gdb) p n_past
$73 = 10
And then we will again return back to the beginning of the while loop.
And this time it will be processing a single token, the one that was predicted
in the previous iteration.
We pass that token to decode and later sample from the context again and get
a new token id for that prediction. And that is added to embd, and n_remain
is decremented by one again. And this continues until n_remain
is zero of if
someother condition is met.
This section will take a closer look at the llama_decode_internal
function in
llama.cpp.
static int llama_decode_internal(
llama_context & lctx,
llama_batch batch_all) { // TODO: rename back to batch
const uint32_t n_tokens_all = batch_all.n_tokens;
}
Lets set a breakpoint on this function:
gdb --args ./main -m models/tinyllama-1.1b-1t-openorca.Q2_K.gguf --prompt 'Hello world' -n 3 --verbose-prompt --dump-kv-cache
(gdb) br llama_decode_internal(llama_context&, llama_batch)
Breakpoint 1 at 0xace11: file llama.cpp, line 11298.
(gdb) r
Breakpoint 1, llama_decode_internal (lctx=..., batch_all=...) at llama.cpp:11298
11298 llama_batch batch_all) { // TODO: rename back to batc
So lets start by inspecting the batch_all
:
(gdb) p batch_all
$1 = {n_tokens = 2, token = 0x555555dd7260, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0, logits = 0x0, all_pos_0 = 0,
all_pos_1 = 1, all_seq_id = 0}
Now this might be a little surprising as the n_tokens
is 2 and clearly our
prompt above is longer than two tokens. But this is a warmup run and can be
disabled using --warmup=false
. This is done in common/common.cpp:
std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(gpt_params & params) {
...
if (params.warmup) {
LOG("warming up the model with an empty run\n");
std::vector<llama_token> tmp = { llama_token_bos(model), llama_token_eos(model), };
llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
llama_kv_cache_clear(lctx);
llama_synchronize(lctx);
llama_reset_timings(lctx);
}
return std::make_tuple(model, lctx);
}
We can verify this:
(gdb) p batch_all.token[0]
$8 = 1
(gdb) p batch_all.token[1]
$9 = 2
(gdb) p lctx.model.vocab.special_bos_id
$11 = 1
(gdb) p lctx.model.vocab.special_eos_id
$12 = 2
(gdb) p lctx.model.vocab.id_to_token[1]
$4 = {text = "<s>", score = 0, type = LLAMA_TOKEN_TYPE_CONTROL}
(gdb) p lctx.model.vocab.id_to_token[2]
$10 = {text = "</s>", score = 0, type = LLAMA_TOKEN_TYPE_CONTROL}
So lets continue until this break point is hit again:
<s> Hello world
Thread 1 "main" hit Breakpoint 1, llama_decode_internal (lctx=..., batch_all=...) at llama.cpp:11298
11298 llama_batch batch_all) { // TODO: rename back to batch
(gdb) p batch_all
$2 = {n_tokens = 3, token = 0x555555dd7230, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0, logits = 0x0, all_pos_0 = 0,
all_pos_1 = 1, all_seq_id = 0}
(gdb) p lctx.model.vocab.id_to_token[15043]
$5 = {text = "▁Hello", score = -14784, type = LLAMA_TOKEN_TYPE_NORMAL}
(gdb) p lctx.model.vocab.id_to_token[3186]
$7 = {text = "▁world", score = -2927, type = LLAMA_TOKEN_TYPE_NORMAL}
Alright, that was some orientation about the input so lets step through this and see what we can learn.
There are some convience variables created and some asserts that I'm skipping but following those we have:
lctx.n_queued_tokens += n_tokens_all;
So we will update the n_queued_tokens
property of the context with the number
of tokens in the batch which is 3 in our case:
11320 lctx.n_queued_tokens += n_tokens_all;
(gdb) p lctx.n_queued_tokens
$10 = 0
(gdb) s
(gdb) p lctx.n_queued_tokens
$11 = 3
Next we have a ref to the kv_self
property of the context:
auto & kv_self = lctx.kv_self;
(gdb) p kv_self.n
$13 = 256
(gdb) p kv_self.cells.size()
$14 = 512
(gdb) p kv_self.total_size()
$16 = 11534368
(gdb) p kv_self.head
$17 = 0
(gdb) p kv_self.k_l
$18 = std::vector of length 22, capacity 22 = {0x555555a42080, 0x555555a423a0, 0x555555a426c0, 0x555555a429e0, 0x555555a42d00,
0x555555a43020, 0x555555a43340, 0x555555a43660, 0x555555a43980, 0x555555a43ca0, 0x555555a43fc0, 0x555555a442e0, 0x555555a44600,
0x555555a44920, 0x555555a44c40, 0x555555a44f60, 0x555555a45280, 0x555555a455a0, 0x555555a458c0, 0x555555a45be0, 0x555555a45f00,
0x555555a46220}
(gdb) p kv_self.used
$19 = 0
const int64_t n_embd = hparams.n_embd;
const int64_t n_vocab = hparams.n_vocab;
uint32_t n_outputs = 0;
uint32_t n_outputs_prev = 0;
const auto n_ubatch = cparams.n_ubatch;
(gdb) p n_embd
$20 = 2048
(gdb) p n_vocab
$21 = 32000
(gdb) p n_ubatch
$22 = 512
TODO: fill in the middle part of the function.
Lets take a look at the following loop which is going to iterate over the 3 tokens in our batch:
for (uint32_t cur_token = 0; cur_token < n_tokens_all; cur_token += n_ubatch) {
const uint32_t n_tokens = std::min(n_ubatch, n_tokens_all - cur_token);
}
(gdb) p n_ubatch
$29 = 512
(gdb) p n_tokens_all
$30 = 3
(gdb) p cur_token
$31 = 0
Then a new llama_batch
will be created:
llama_batch u_batch = {
/* .n_tokens = */ (int32_t) n_tokens,
/* .token = */ batch_all.token ? batch_all.token + cur_token : nullptr,
/* .embd = */ batch_all.embd ? batch_all.embd + cur_token*n_embd : nullptr,
/* .pos = */ batch_all.pos ? batch_all.pos + cur_token : nullptr,
/* .n_seq_id = */ batch_all.n_seq_id ? batch_all.n_seq_id + cur_token : nullptr,
/* .seq_id = */ batch_all.seq_id ? batch_all.seq_id + cur_token : nullptr,
/* .logits = */ batch_all.logits ? batch_all.logits + cur_token : nullptr,
/* .all_pos_0 = */ batch_all.all_pos_0 + (llama_pos) cur_token*batch_all.all_pos_1,
/* .all_pos_1 = */ batch_all.all_pos_1,
/* .all_seq_id = */ batch_all.all_seq_id,
};
(gdb) p u_batch
$33 = {n_tokens = 3, token = 0x555555dd7230, embd = 0x0, pos = 0x0, n_seq_id = 0x0, seq_id = 0x0, logits = 0x0, all_pos_0 = 0,
all_pos_1 = 1, all_seq_id = 0}
if (hparams.causal_attn) {
llama_kv_cache_update(&lctx);
This actually does nothing in our case.
if (kv_self.head > kv_self.used + 2*n_tokens) {
kv_self.head = 0;
}
That will boil down to 0 > 6.
if (!llama_kv_cache_find_slot(kv_self, u_batch)) {
return 1;
}
Lets take a closer look at llama_kv_cache_find_slot
and lets just look at the
type of kv_self
first:
(gdb) ptype kv_self
type = struct llama_kv_cache {
bool has_shift;
bool do_defrag;
bool do_copy;
bool recurrent;
bool v_trans;
uint32_t head;
uint32_t size;
uint32_t used;
uint32_t n;
ggml_type type_k;
ggml_type type_v;
std::vector<llama_kv_cell> cells;
std::vector<ggml_tensor*> k_l;
std::vector<ggml_tensor*> v_l;
std::vector<ggml_context*> ctxs;
std::vector<ggml_backend_buffer*> bufs;
public:
size_t total_size(void) const;
~llama_kv_cache();
} &
static bool llama_kv_cache_find_slot(struct llama_kv_cache & cache,
const struct llama_batch & batch) {
bool found = true;
for (uint32_t i = 0; i < n_tokens; i++) {
if (cache.cells[cache.head + i].pos >= 0) {
found = false;
cache.head += i + 1;
n_tested += i + 1;
break;
}
}
(gdb) p i
$62 = 0
(gdb) p cache.cells.size()
$63 = 512
(gdb) p cache.cells[0]
$66 = {pos = -1, delta = 0, src = 0, seq_id = std::set with 0 elements}
(gdb) ptype cache.cells[0]
type = struct llama_kv_cell {
llama_pos pos;
llama_pos delta;
int32_t src;
std::set<int> seq_id;
public:
bool has_seq_id(const llama_seq_id &) const;
bool is_empty(void) const;
bool is_same_seq(const llama_kv_cell &) const;
}
In our case pos is -1 for all of the cells checked 0, 1, 2, so found will remain true.
if (found) {
break;
}
Then we will iterate over the 3 tokens:
for (uint32_t i = 0; i < n_tokens; i++) {
cache.cells[cache.head + i].pos = batch.pos[i];
for (int32_t j = 0; j < batch.n_seq_id[i]; j++) {
cache.cells[cache.head + i].seq_id.insert(batch.seq_id[i][j]);
}
}
So we are setting the first three cells positions to 0, 1, 2.
(gdb) p batch.pos[0]
$75 = 0
(gdb) p batch.pos[1]
$76 = 1
(gdb) p batch.pos[2]
$77 = 2
The cache.cells sequence ids will also be populated using the batches sequence ids.
And finally cache.uses is set to 3:
cache.used += n_tokens;
return true;
static struct ggml_tensor * llm_build_inp_embd(
struct ggml_context * ctx,
struct llama_context & lctx,
const llama_hparams & hparams,
const llama_batch & batch,
struct ggml_tensor * tok_embd,
const llm_build_cb & cb) {
const int64_t n_embd = hparams.n_embd;
struct ggml_tensor * inpL;
if (batch.token) {
lctx.inp_tokens = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, batch.n_tokens);
cb(lctx.inp_tokens, "inp_tokens", -1);
ggml_set_input(lctx.inp_tokens);
inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens);
} else {
lctx.inp_embd = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, batch.n_tokens);
inpL = lctx.inp_embd;
ggml_set_input(lctx.inp_embd);
}
cb(inpL, "inp_embd", -1);
return inpL;
}
Notice that the 1d tensor is does not contain any data at this point. The passed
in batch is only used to specify then number of tokens. Later when these tokens
have values they will be used a indices into the tok_embd
tensor via the
ggml_get_rows
operation. The data will be set in the call
llama_set_inputs
I think.
If we take a look at the batch:
(gdb) p batch
$149 = (const llama_batch &) @0x7fffffffbd10: {n_tokens = 3, token = 0x555555dd7230, embd = 0x0, pos = 0x555555db28c0,
n_seq_id = 0x555555dd7310, seq_id = 0x555555dd72f0, logits = 0x0, all_pos_0 = 0, all_pos_1 = 1, all_seq_id = 0}
Notice that this is our initial prompt of three tokens. And the first things
is that a 1d tensor is created. Recall that the tokens are just ids into the
models vocabulary. This 1d tensor, which we can think of a list of ids, is
then used as an indices tensor to ggml_get_rows
and the tensor what we want to
get/extract rows from is the embeddings tensor. This is how we go from tokens
ids to embedding. This is setting up the operation that does this.
And the tok_embd
looks like this
(gdb) p *model.tok_embd
$154 = {type = GGML_TYPE_Q2_K, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555b87370, ne = {2048, 32000, 1, 1}, nb = {84, 672,
21504000, 21504000}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x7fffda2b39e0, name = "token_embd.weight", '\000' <repeats 46 times>, extra = 0x0,
padding = "\000\000\000\000\000\000\000"}
So just to remind ourselves that we have a embedding length of 2048 and the
the total number of tokens in the vocabulary is 32000. This is mainly because
one has to recall that the x dimension is first, followed by y, and possibly
other dimensions in ggml which is why ne = {2048, 32000, 1, 1}
. We can
visualize this as:
y x
| ----------------->
|[0] [0 .... 2047]
|[1] [0 .... 2047]
|...
|[32998][0 .... 2047]
|[32999][0 .... 2047]
V
So what ggml_get_rows
will do is it will use the 1d tensor and use them as
indices into the tok_embd
tensor and return the rows for those indices:
(gdb) p batch.token[0]
$161 = 1
(gdb) p batch.token[1]
$162 = 15043
(gdb) p batch.token[2]
$163 = 3186
[1] [0 .... 2047]
[15043] [0 .... 2047]
[3186] [0 .... 2047]
And again just to get used to the ggml axis ordering, x comes first which is 2048 (features), followed by y which is 3 (3 tokens):
(gdb) p *inpL
$165 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x0, ne = {2048, 3, 1, 1}, nb = {4, 8192, 24576, 24576},
op = GGML_OP_GET_ROWS, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0, src = {0x555555dda9a0, 0x7fffd9491180, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0, data = 0x0,
name = '\000' <repeats 63 times>, extra = 0x0, padding = "\000\000\000\000\000\000\000"}
// KQ_mask (mask for 1 head, it will be broadcasted to all heads)
struct ggml_tensor * KQ_mask = build_inp_KQ_mask();
struct ggml_tensor * build_inp_KQ_mask(bool causal = true) {
if (causal) {
lctx.inp_KQ_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_kv, GGML_PAD(n_tokens, GGML_KQ_MASK_PAD));
} else {
lctx.inp_KQ_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_tokens, GGML_PAD(n_tokens, GGML_KQ_MASK_PAD));
}
cb(lctx.inp_KQ_mask, "KQ_mask", -1);
ggml_set_input(lctx.inp_KQ_mask);
return flash_attn ? ggml_cast(ctx0, lctx.inp_KQ_mask, GGML_TYPE_F16) : lctx.inp_KQ_mask;
}
(gdb) p n_kv
$167 = 256
(gdb) p n_tokens
$168 = 3
#define GGML_KQ_MASK_PAD 32
#define GGML_PAD(x, n) (((x) + (n) - 1) & ~((n) - 1))
GGML_PAD(n_tokens, GGML_KQ_MASK_PAD);
GGML_PAD(3, 32);
GGML_PAD(3, 32) (((3) + (32) - 1) & ~((32) - 1))
= 32
So this becomes:
lctx.inp_KQ_mask = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, 256, 32);
So this would look something like this:
0 [0 ... 255]
1 [0 ... 255]
...
30 [0 ... 255]
31 [0 ... 255]
And this is for the Query matrix masking to prevent the model from looking at tokens "in the future".
for (int il = 0; il < n_layer; ++il) {
...
cur = llm_build_kv(ctx0, model, hparams, cparams, kv_self, gf,
model.layers[il].wo, model.layers[il].bo,
Kcur, Vcur, Qcur, KQ_mask, nullptr, n_tokens, kv_head, n_kv, 1.0f/sqrtf(float(n_embd_head)), cb, il);
}
static void llm_build_kv_store(
struct ggml_context * ctx,
const llama_hparams & hparams,
const llama_cparams & cparams,
const llama_kv_cache & kv,
struct ggml_cgraph * graph,
struct ggml_tensor * k_cur,
struct ggml_tensor * v_cur,
int32_t n_tokens,
int32_t kv_head,
const llm_build_cb & cb,
int64_t il) {
const int64_t n_ctx = cparams.n_ctx;
const int64_t n_embd_k_gqa = hparams.n_embd_k_gqa();
const int64_t n_embd_v_gqa = hparams.n_embd_v_gqa();
(gdb) p n_ctx
$175 = 512
(gdb) p n_embd_k_gqa
$176 = 256
(gdb) p n_embd_v_gqa
$177 = 256
struct ggml_tensor * k_cache_view = ggml_view_1d(ctx, kv.k_l[il], n_tokens*n_embd_k_gqa,
(ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa))*kv_head);
The arguments ggml_view_id
are the context, the tensor that we want to create
a view (think slice of), the number of elements that should be in our view, and
the offset in bytes where we want to start from.
The source tensor of the view will be:
(gdb) p *kv.k_l[il]
$32 = {type = GGML_TYPE_F16, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555bf0730,
ne = {131072, 1, 1, 1}, nb = {2, 262144, 262144, 262144},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 0, grad = 0x0,
src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0,
perf_cycles = 0, perf_time_us = 0, view_src = 0x0, view_offs = 0,
data = 0x7fffd97b2020, name = "cache_k_l0", '\000' <repeats 53 times>,
extra = 0x0, padding = "\000\000\000\000\000\000\000"}
And the number of elements will be:
(gdb) p n_tokens * n_embd_k_gqa
$29 = 131072
And the offset to start from will be:
(gdb) p ggml_row_size(kv.k_l[il]->type, n_embd_k_gqa)
$33 = 512
(gdb) p kv_head
$34 = 0
So that will start at the beginning.
llama.cpp can be configured to compile in support for SIMD using the following options.
if (LLAMA_NATIVE)
set(INS_ENB OFF)
else()
set(INS_ENB ON)
endif()
option(LLAMA_AVX "llama: enable AVX" ${INS_ENB})
option(LLAMA_AVX2 "llama: enable AVX2" ${INS_ENB})
option(LLAMA_AVX512 "llama: enable AVX512" OFF)
option(LLAMA_AVX512_VBMI "llama: enable AVX512-VBMI" OFF)
option(LLAMA_AVX512_VNNI "llama: enable AVX512-VNNI" OFF)
option(LLAMA_FMA "llama: enable FMA" ${INS_ENB})
option(LLAMA_F16C "llama: enable F16C" ${INS_ENB})
If LLAMA_NATIVE
then the following will be set:
if (LLAMA_NATIVE)
list(APPEND ARCH_FLAGS -march=native)
endif()
-march=native
will cause the compiler to query/probe the current cpu for
features that it supports, using something like cpuid
on x86. Depening on what
is supported the compiler will add flags to enable those features.
$ g++ -mno-f16c -mno-fma -mavx -mno-avx2 -O0 -g -E ggml.c -o ggml.pre
This section will use the batch
example to better understand the llama_batch
struct and how it is used in
llama_decode_internal
in llama.cpp.
$ make batch
$ gdb --args ./batch
(gdb) br llama_decode_internal
Breakpoint 1 at 0xfc208: file src/llama.cpp, line 17137.
The example will create a batch with two sequences (though currently they are completely separate sequenes, that is they don't share any tokens between them to simplify things a little bit).
(gdb) r
...
prompt1: What is LoRA?
prompt2: Dan loves ice cream
Creating new llama_batch with 2 sequences
Processing prompt 0, size = 2, batch_n_tokens: 0
idx: 0, token: 1, seq_id: 0
idx: 1, token: 1724, seq_id: 0
idx: 2, token: 338, seq_id: 0
idx: 3, token: 4309, seq_id: 0
idx: 4, token: 4717, seq_id: 0
idx: 5, token: 29973, seq_id: 0
Processing prompt 1, size = 2, batch_n_tokens: 6
idx: 6, token: 1, seq_id: 1
idx: 7, token: 3951, seq_id: 1
idx: 8, token: 12355, seq_id: 1
idx: 9, token: 267, seq_id: 1
idx: 10, token: 14890, seq_id: 1
idx: 11, token: 907, seq_id: 1
idx: 12, token: 314, seq_id: 1
batch.n_tokens: 13
batch.tokens: [1, 1724, 338, 4309, 4717, 29973, 1, 3951, 12355, 267, 14890, 907, 314, ]
Breakpoint 1, llama_decode_internal (lctx=..., batch_all=...) at src/llama.cpp:17137
17137 llama_batch batch_all) { // TODO: rename back to batch
llama_decode
takes a llama_batch
as input:
int32_t llama_decode(
struct llama_context * ctx,
struct llama_batch batch) {
const int ret = llama_decode_internal(*ctx, batch);
if (ret < 0) {
LLAMA_LOG_ERROR("%s: failed to decode, ret = %d\n", __func__, ret);
}
return ret;
}
static int llama_decode_internal(
llama_context & lctx,
llama_batch batch_all) { // TODO: rename back to batch
...
const uint32_t n_tokens_all = batch_all.n_tokens;
Note that batch_all
will be renamed by ggerganov/llama.cpp#8881
so I've updated this for clarity above.
(gdb) p n_tokens_all
$7 = 13
So we have a total of 13 tokens in the batch, and there will be a check that the tokens in this batch are less than the maximum value in the models vocabulary but I'm not showing this for loop.
Next we have:
lctx.n_queued_tokens += n_tokens_all;
auto & kv_self = lctx.kv_self;
const int64_t n_embd = hparams.n_embd;
const int64_t n_vocab = hparams.n_vocab;
uint32_t n_outputs = 0;
uint32_t n_outputs_prev = 0;
const auto n_ubatch = cparams.n_ubatch;
(gdb) p n_embd
$9 = 4096
(gdb) p n_vocab
$10 = 32000
(gdb) p cparams.n_ubatch
$8 = 512
This ubatch value is used to enable pipeline parallelism. If an a batch of size
4096 is passed to llama_decode
then it will be split into 8 batches and
evaluated as a pipeline in parallel between the available GPUs.
The default size of an ubatch is 512, but this a context parameter which can be changed.
(gdb) p n_ubatch
$43 = 512
This compute parameter is set in
struct llama_context * llama_new_context_with_model(
struct llama_model * model,
struct llama_context_params params) {
...
cparams.n_ubatch = std::min(cparams.n_batch, params.n_ubatch == 0 ? params.n_batch : params.n_ubatch);
..
}
The default value is set in llama_context_default_params
:
struct llama_context_params llama_context_default_params() {
struct llama_context_params result = {
/*.n_ctx =*/ 512,
/*.n_batch =*/ 2048,
/*.n_ubatch =*/ 512,
/*.n_seq_max =*/ 1,
/*.n_threads =*/ GGML_DEFAULT_N_THREADS, // TODO: better default
/*.n_threads_batch =*/ GGML_DEFAULT_N_THREADS,
/*.rope_scaling_type =*/ LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED,
/*.pooling_type =*/ LLAMA_POOLING_TYPE_UNSPECIFIED,
/*.attention_type =*/ LLAMA_ATTENTION_TYPE_UNSPECIFIED,
/*.rope_freq_base =*/ 0.0f,
/*.rope_freq_scale =*/ 0.0f,
/*.yarn_ext_factor =*/ -1.0f,
/*.yarn_attn_factor =*/ 1.0f,
/*.yarn_beta_fast =*/ 32.0f,
/*.yarn_beta_slow =*/ 1.0f,
/*.yarn_orig_ctx =*/ 0,
/*.defrag_thold =*/ -1.0f,
/*.cb_eval =*/ nullptr,
/*.cb_eval_user_data =*/ nullptr,
/*.type_k =*/ GGML_TYPE_F16,
/*.type_v =*/ GGML_TYPE_F16,
/*.logits_all =*/ false,
/*.embeddings =*/ false,
/*.offload_kqv =*/ true,
/*.flash_attn =*/ false,
/*.no_perf =*/ true,
/*.abort_callback =*/ nullptr,
/*.abort_callback_data =*/ nullptr,
};
return result;
}
Next in llama_decode_internal
we have:
// count outputs
if (batch_all.logits && !embd_pooled) {
for (uint32_t i = 0; i < n_tokens_all; ++i) {
n_outputs += batch_all.logits[i] != 0;
}
} else if (lctx.logits_all || embd_pooled) {
n_outputs = n_tokens_all;
} else {
// keep last output only
n_outputs = 1;
}
So this will (in our case) iterate over all the n_tokens
(which is 13) and
increment n_outputs
by one only if that tokens logit is not zero. So the value
is the result of the expression batch_all.logits[i] != 0
which is then cast
to an int. In this case I had only specified that logits for the last token in
the batch, but I really want to see how multiple sequences are handled and I
think that therefor the last token in each sequence should have this logits set.
I've updated the example so that the last token in each sequence now has its
logits boolean entry set to true. So lets find out how this works when decoding.
So n_outputs
will be 2 in our case.
Next we have the following code:
...
lctx.sbatch.from_batch(batch, n_embd,
/* simple_split */ !kv_self.recurrent,
/* logits_all */ n_outputs == n_tokens_all);
An sbatch is a sequence-length-aware batch. sbatch from_batch
is a function
of the llama_sbatch
struct so lets get an overview of it as this function will
fields/members of this struct:
// sequence-length-aware batch splitting
struct llama_sbatch {
// tokens left in this batch
size_t n_tokens;
size_t n_embd;
bool logits_all; // TODO: remove once lctx.logits_all is removed too
// sorted indices into the batch
std::vector<size_t> ids;
// batch indices of the output
std::vector<size_t> out_ids;
std::vector<llama_sbatch_seq> seq;
const llama_batch * batch = nullptr;
// buffers for the ubatch
std::vector<llama_token> ubatch_token;
std::vector<float> ubatch_embd;
std::vector<llama_pos> ubatch_pos;
std::vector<int32_t> ubatch_n_seq_id;
std::vector<llama_seq_id *> ubatch_seq_id;
std::vector<int8_t> ubatch_output;
...
When we call from_batch
we are setting the simple_split
argument to true
as the model we are using is not recurrent. And n_outputs
is only 2 in our
case and n_tokens_all
is 13 so logits_all
will be false.
So lets now looks closer at from_batch
:
void from_batch(const llama_batch & batch,
const size_t n_embd,
const bool simple_split = false,
const bool logits_all = false) {
GGML_ASSERT(batch.n_tokens >= 0);
this->batch = &batch;
this->n_embd = n_embd;
this->logits_all = logits_all;
So is just setting the members. this
is used because these fields have the
same name as the parameters that this function takes.
Notice that this is the sbatch as a pointer to the original batch which is
stored inte batch field. And this happens for each decode call.
If this is the first time we call llama_decode_internal
then the sbatch
will have been default initilized as part of the llama_context
struct.
int ret = llama_decode_internal(llama_context & lctx, llama_batch batch);
...
lctx.sbatch.from_batch(batch, n_embd,
/* simple_split */ !kv_self.recurrent,
/* logits_all */ n_outputs == n_tokens_all);
So the sbatch will get updated with a reference to the original batch:
lctx.sbatch llama_batch
+-----------------+ +----------------+
| batch | ------------>| |
| | +----------------+
| | +----------------------------------+
| seq | ------------>| std::vector<llama_sbatch_seq> |
+-----------------+ +----------------------------------+
| n_seq_id |
| seq_id* |
| offset |
| length |
+----------------------------------+
Now, lets try a prompt with two sequences and we how the sbatch is updated. While the
sbatch is a member of the context we can think of it as being local to the
llama_decode_internal
function as they values are reset for each new call.
GGML_ASSERT(batch.n_tokens >= 0);
this->batch = &batch;
this->n_embd = n_embd;
this->logits_all = logits_all;
n_tokens = batch.n_tokens;
ids.resize(n_tokens);
out_ids.clear();
// TODO: reserve out_ids and seq
for (size_t i = 0; i < n_tokens; ++i) {
ids[i] = i;
}
For all calls to decode if the model is non-recursive then this will be a "simple split":
if (simple_split) {
seq.resize(1);
llama_sbatch_seq & s = seq[0];
s.n_seq_id = 0;
s.seq_id = nullptr;
s.offset = 0;
s.length = n_tokens;
return;
}
So the seq
vector willl be resized to 1 in this case.
lctx.sbatch llama_batch
+-----------------+ +----------------+
| batch | ------------>| |
| | +----------------+
| | +----------------------------------+
| seq | ------------>| std::vector<llama_sbatch_seq> |
+-----------------+ +----------------------------------+
| n_seq_id = 0 |
| seq_id* = nullprt |
| offset = 0 |
| length = 13 |
+----------------------------------+
(this is a simplified view or the sbatch to just look at the seq)
The offset is an offset into the tokens of the batch. And the length is the
number of tokens for this sbatch. After this returns the ids of the sbatch, the
number of tokens, the embeddings size/dimension, the ids, adn the original llama_batch
have been stored on the sbatch struct.
Now, the sbatch will iterator over all the tokens in the batch:
while (lctx.sbatch.n_tokens > 0) {
llama_ubatch ubatch;
...
ubatch = lctx.sbatch.split_simple(n_ubatch);
The above call will produce an update batch (ubatch), which is what will be used to
build the computation graph and also to set the inputs to the model (llama_set_inputs
).
Now in this case I've set the ctx_params.n_ubatch = 4
:
llama_ubatch split_simple(size_t n_ubatch) {
n_ubatch = n_tokens < n_ubatch ? n_tokens : n_ubatch;
llama_ubatch ubatch = reserve_ubatch(n_ubatch, /* has_embd */ batch->embd != nullptr);
ubatch.equal_seqs = false;
if (!seq.empty()) {
llama_sbatch_seq & s = seq[0];
size_t length = s.length < n_ubatch ? s.length : n_ubatch;
GGML_ASSERT(seq.size() == 1 && s.n_seq_id == 0); // don't mix with other splits
add_seq_to_ubatch(ubatch, s, length);
}
return ubatch;
}
llama_ubatch reserve_ubatch(size_t n_ubatch, bool has_embd = false) {
// clear empty sequences
// the previous ubatch is assumed to be gone,
// so nothing should refer to values in these sequences anymore.
for (size_t i = seq.size(); i-- > 0;) {
if (seq[i].length == 0) {
seq.pop_back();
} else {
break;
}
}
ubatch_token.resize(!has_embd ? n_ubatch : 0);
ubatch_embd.resize(has_embd ? n_embd * n_ubatch : 0);
ubatch_pos.resize(n_ubatch);
ubatch_n_seq_id.resize(n_ubatch);
ubatch_seq_id.resize(n_ubatch);
ubatch_output.resize(n_ubatch);
So all of the above vectors will be resized to 4 in this case. Next,
llama_ubatch ubatch = {
/*equal_seqs =*/ true,
/*n_tokens =*/ 0,
/*n_seq_tokens =*/ 0,
/*n_seqs =*/ 0,
/*token =*/ !has_embd ? ubatch_token.data() : nullptr,
/*embd =*/ has_embd ? ubatch_embd.data() : nullptr,
/*pos =*/ ubatch_pos.data(),
/*n_seq_id =*/ ubatch_n_seq_id.data(),
/*seq_id =*/ ubatch_seq_id.data(),
/*output =*/ ubatch_output.data(),
};
return ubatch;
Notice that the entries of the ubatch fields that are pointers are pointers to the data of the above vectors that were resized. After that we have:
ubatch.equal_seqs = false;
if (!seq.empty()) {
llama_sbatch_seq & s = seq[0];
size_t length = s.length < n_ubatch ? s.length : n_ubatch;
GGML_ASSERT(seq.size() == 1 && s.n_seq_id == 0); // don't mix with other splits
add_seq_to_ubatch(ubatch, s, length);
}
return ubatch;
void add_seq_to_ubatch(llama_ubatch & ubatch, llama_sbatch_seq & seq, size_t length) {
...
ubatch.token = batch->token + seq.offset;
...
ubatch.n_tokens += length;
ubatch.n_seqs += ubatch.equal_seqs ? 1 : length; // virtual sequences for simple splits
seq.offset += length;
seq.length -= length;
We can see that the seq offset is incremented by the length of the ubatch and the
the total length is decremented (0->4, and 13->9), and n_tokens
is also decremented from
13-9.
So this the the actual batch that will be processed by the model, used to build the computation graph and set the actual input to the model be for computing the graph.
After the graph has been computed we will again entery the while loop again and this time the seq.offset will be 4.
Next the vector of ids (token indices not sequence ids) is resized to the number of tokens (so 13 in our case):
n_tokens = batch.n_tokens;
ids.resize(n_tokens);
std::vector<size_t> out_ids;
std::vector<llama_sbatch_seq> seq;
Now, seq
is a vector of sequences. A batch can have tokens that belong to
different sequences. So for each sequence in the batch there will be an entry in
the seq
vector of type:
struct llama_sbatch_seq {
int32_t n_seq_id;
llama_seq_id * seq_id;
size_t offset;
size_t length;
// helper for smoother batch API transition -- can be deprecated in the future
llama_seq_id all_seq_id; // used if seq_id == NULL
};
After that the ids vector will be populated with the indices of the tokens in the batch:
for (size_t i = 0; i < n_tokens; ++i) {
ids[i] = i;
}
(gdb) p ids
$12 = std::vector of length 13, capacity 13 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}
And by the way the llama_batch
in this case looks like this:
(gdb) p batch
$13 = (const llama_batch &) @0x7fffffffd640: {n_tokens = 13, token = 0x555555b329f0, embd = 0x0, pos = 0x555555b2cff0,
n_seq_id = 0x555555b2d800, seq_id = 0x555555dab230, logits = 0x5555559d30d0 "", all_pos_0 = 0, all_pos_1 = 0,
all_seq_id = 0}
}
Next we have the following, and in this case simple_split
is true:
if (simple_split) {
seq.resize(1);
llama_sbatch_seq & s = seq[0];
s.n_seq_id = 0;
s.seq_id = nullptr;
s.offset = 0;
s.length = n_tokens;
s.all_seq_id = batch.all_seq_id;
return;
}
So this will resize the seq vector to 1, and then populate that first entry with
the llama_sbatch_seq
struct:
(gdb) p s
$17 = (llama_sbatch_seq &) @0x5555559d46d0: {n_seq_id = 0, seq_id = 0x0, offset = 0, length = 13, all_seq_id = 0}
Following this we have:
// reserve output buffer
if (llama_output_reserve(lctx, n_outputs) < n_outputs) {
LLAMA_LOG_ERROR("%s: could not reserve space for batch with %u outputs\n", __func__, n_outputs);
return -2;
};
TODO: link to llama_output_reserve
or explain this in a separate section.
Next we will iterate over the number of tokens in the sequence aware batch:
while (lctx.sbatch.n_tokens > 0) {
llama_ubatch ubatch;
if (kv_self.recurrent) {
if (embd_pooled) {
// Pooled embeddings cannot be split across ubatches (yet)
ubatch = lctx.sbatch.split_seq(n_ubatch);
} else {
// recurrent model architectures are easier to implement
// with equal-length sequences
ubatch = lctx.sbatch.split_equal(n_ubatch);
}
} else {
ubatch = lctx.sbatch.split_simple(n_ubatch);
}
llama_ubatch
is a struct similar to the llama_batch
struct:
struct llama_batch {
int32_t n_tokens;
llama_token * token;
float * embd;
llama_pos * pos;
int32_t * n_seq_id;
llama_seq_id ** seq_id;
int8_t * logits; // TODO: rename this to "output"
// NOTE: helpers for smooth API transition - can be deprecated in the future
// for future-proof code, use the above fields instead and ignore everything below
//
// pos[i] = all_pos_0 + i*all_pos_1
//
llama_pos all_pos_0; // used if pos == NULL
llama_pos all_pos_1; // used if pos == NULL
llama_seq_id all_seq_id; // used if seq_id == NULL
}
struct llama_ubatch {
bool equal_seqs;
// TODO: whole_seqs for embeddings?
uint32_t n_tokens; // total tokens (n_seq_tokens * n_seqs)
uint32_t n_seq_tokens; // tokens per sequence
uint32_t n_seqs;
llama_token * token; // [n_tokens]
float * embd; // [n_embd, n_tokens]
llama_pos * pos; // [n_tokens]
int32_t * n_seq_id; // [n_seqs]
llama_seq_id ** seq_id; // [n_seqs]
int8_t * output; // [n_tokens]
};
Like we mentioned earlier an ubatch is used to enable pipeline parallelism. If
a batch of size 4096 is passed to llama_decode
then it will be split into 8
batches and evaluated as a pipeline in parallel between the available GPUs.
In this session we are not using a recurrent model (like Mamba, RWKV etc) so
lets look closer at split_simple
which I've not looked at before:
// simple split, unknown number of sequences of unequal lengths
llama_ubatch split_simple(size_t n_ubatch) {
n_ubatch = n_tokens < n_ubatch ? n_tokens : n_ubatch;
llama_ubatch ubatch = reserve_ubatch(n_ubatch, /* has_embd */ batch->embd != nullptr);
ubatch.equal_seqs = false;
if (!seq.empty()) {
llama_sbatch_seq & s = seq[0];
size_t length = s.length < n_ubatch ? s.length : n_ubatch;
GGML_ASSERT(seq.size() == 1 && s.n_seq_id == 0); // don't mix with other splits
add_seq_to_ubatch(ubatch, s, length);
}
return ubatch;
}
n_ubatch
will be set to 13 as n_tokens
is less that n_ubatch
which is
512 in this case. Then a ubatch will be reserved, so lets take a closer look
at what that does. And just to clarify that this is a function of the
llama_sbatch
struct. Also keep in mind that the seq
vector will have one
entry in our case as this was set in from_batch
earlier (which has a length of
13) previously:
llama_ubatch reserve_ubatch(size_t n_ubatch, bool has_embd = false) {
// clear empty sequences
// the previous ubatch is assumed to be gone,
// so nothing should refer to values in these sequences anymore.
for (size_t i = seq.size(); i-- > 0;) {
if (seq[i].length == 0) {
seq.pop_back();
} else {
break;
}
}
ubatch_token.resize(!has_embd ? n_ubatch : 0);
ubatch_embd.resize(has_embd ? n_embd * n_ubatch : 0);
ubatch_pos.resize(n_ubatch);
ubatch_n_seq_id.resize(n_ubatch);
ubatch_seq_id.resize(n_ubatch);
ubatch_output.resize(n_ubatch);
llama_ubatch ubatch = {
/*equal_seqs =*/ true,
/*n_tokens =*/ 0,
/*n_seq_tokens =*/ 0,
/*n_seqs =*/ 0,
/*token =*/ !has_embd ? ubatch_token.data() : nullptr,
/*embd =*/ has_embd ? ubatch_embd.data() : nullptr,
/*pos =*/ ubatch_pos.data(),
/*n_seq_id =*/ ubatch_n_seq_id.data(),
/*seq_id =*/ ubatch_seq_id.data(),
/*output =*/ ubatch_output.data(),
};
return ubatch;
}
So this will resize the vector to the size of the ubatch (I'm still not 100%
what u
stand for, update perhaps?). Notice that if we are genearating
embeddings the size is set to the embedding dimension size times the number
of tokens in the ubatch. And note that output and logits are the same thing in
this context. Then an ubatch struct is created and it is populated with pointer
to the data section of the vectors of sbatch.
Lets inspect these values:
(gdb) p ubatch
$32 = {equal_seqs = true, n_tokens = 0, n_seq_tokens = 0, n_seqs = 0, token = 0x5555559c6fd0, embd = 0x0,
pos = 0x555555d7c5f0, n_seq_id = 0x555555d7c630, seq_id = 0x5555559d3300, output = 0x555555da76e0 ""}
(gdb) p ubatch_token
$33 = std::vector of length 13, capacity 13 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
(gdb) p ubatch_embd
$34 = std::vector of length 0, capacity 0
(gdb) p ubatch_pos
$35 = std::vector of length 13, capacity 13 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
(gdb) p ubatch_n_seq_id
$36 = std::vector of length 13, capacity 13 = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}
(gdb) p ubatch_seq_id
$37 = std::vector of length 13, capacity 13 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}
(gdb) p ubatch_output
$38 = std::vector of length 13, capacity 13 = {0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000',
0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000', 0 '\000'}
So the sizes have been set for the vectors but nothing has been populated at
this point.
Back in simple_split
we then have:
ubatch.equal_seqs = false;
if (!seq.empty()) {
llama_sbatch_seq & s = seq[0];
size_t length = s.length < n_ubatch ? s.length : n_ubatch;
GGML_ASSERT(seq.size() == 1 && s.n_seq_id == 0); // don't mix with other splits
add_seq_to_ubatch(ubatch, s, length);
}
In our case seq
is not empty and we have one sequence:
(gdb) p s
$45 = (llama_sbatch_seq &) @0x5555559d46d0:
{n_seq_id = 0, seq_id = 0x0, offset = 0, length = 13, all_seq_id = 0}
Now, lets take a look at add_seq_to_ubatch
and length here will be 13:
void add_seq_to_ubatch(llama_ubatch & ubatch, llama_sbatch_seq & seq, size_t length) {
GGML_ASSERT(batch != nullptr);
GGML_ASSERT(length <= seq.length);
// Can only add sequences of equal lengths to a batch,
// otherwise it isn't clear to which sequence a token belongs
GGML_ASSERT(seq.n_seq_id == 0 || ubatch.n_seqs == 0 || length == (size_t) ubatch.n_tokens / ubatch.n_seqs);
GGML_ASSERT((seq.n_seq_id != 0) == ubatch.equal_seqs);
// NOTE: loops are separated for cache-friendliness
if (batch->token) {
if (ubatch.equal_seqs) {
for (size_t i = 0; i < length; ++i) {
ubatch.token[ubatch.n_tokens + i] = batch->token[ids[seq.offset + i]];
}
} else {
// simple split
ubatch.token = batch->token + seq.offset;
}
} else {
ubatch.token = nullptr;
}
Recall that batch
in this case the the llama_batch
that was passed to
from_batch
earlier.
In this case ubatch.equal_seqs
is false so the else block will be executed and
notice that this is setting ubatch.token
to point to the token data in the
batch (the offset is 0 in this case). So this has populated the token data for
the ubatch now:
(gdb) p ubatch.token
$56 = (llama_token *) 0x555555b329f0
(gdb) p ubatch.token[0]
$57 = 1
(gdb) p ubatch.token[1]
$58 = 1724
(gdb) p ubatch.token[2]
$59 = 338
(gdb) p ubatch.token[3]
$60 = 4309
(gdb) up
#1 0x00005555556789cc in llama_sbatch::split_simple (this=0x5555559c5d10, n_ubatch=13) at src/llama.cpp:3134
3134 add_seq_to_ubatch(ubatch, s, length);
(gdb) p lctx
No symbol "lctx" in current context.
(gdb) up
#2 0x000055555565074e in llama_decode_internal (lctx=..., batch_all=...) at src/llama.cpp:17220
17220 ubatch = lctx.sbatch.split_simple(n_ubatch);
(gdb) p lctx.model.vocab.id_to_token[1724]
$61 = {text = "▁What", score = -1465, attr = LLAMA_TOKEN_ATTR_NORMAL}
(gdb) p lctx.model.vocab.id_to_token[338]
$62 = {text = "▁is", score = -79, attr = LLAMA_TOKEN_ATTR_NORMAL}
(gdb) p lctx.model.vocab.id_to_token[4309]
$63 = {text = "▁Lo", score = -4050, attr = LLAMA_TOKEN_ATTR_NORMAL}
Following that we have a block if embeddings should be generated (which is NULL in our case so this block will be skipped:
if (batch->embd) {
if (ubatch.equal_seqs) {
for (size_t i = 0; i < length; ++i) {
memcpy(
ubatch.embd + n_embd * (ubatch.n_tokens + i),
batch->embd + n_embd * ids[seq.offset + i],
n_embd * sizeof(float)
);
}
} else {
// simple split
ubatch.embd = batch->embd + (n_embd * seq.offset);
}
} else {
ubatch.embd = nullptr;
}
// from here on, the else branches are deprecated;
// they are helpers for smoother batch API transition
if (batch->pos) {
if (ubatch.equal_seqs) {
for (size_t i = 0; i < length; ++i) {
ubatch.pos[ubatch.n_tokens + i] = batch->pos[ids[seq.offset + i]];
}
} else {
// simple split
ubatch.pos = batch->pos + seq.offset;
}
} else {
for (size_t i = 0; i < length; ++i) {
llama_pos bi = ids[seq.offset + i];
ubatch.pos[ubatch.n_tokens + i] = batch->all_pos_0 + (bi * batch->all_pos_1);
}
}
In our case the above will updated ubatch.pos
to the batch (llama_batch
)
position data.
if (ubatch.equal_seqs) {
ubatch.n_seq_id[ubatch.n_seqs] = seq.n_seq_id;
if (seq.seq_id) {
ubatch.seq_id[ubatch.n_seqs] = seq.seq_id;
} else {
GGML_ASSERT(seq.n_seq_id == 1);
ubatch.seq_id[ubatch.n_seqs] = &seq.all_seq_id;
}
} else {
// simple split
if (batch->n_seq_id) {
ubatch.n_seq_id = batch->n_seq_id + seq.offset;
} else {
for (size_t i = 0; i < length; ++i) {
ubatch.n_seq_id[ubatch.n_seqs + i] = 1;
}
}
if (batch->seq_id) {
ubatch.seq_id = batch->seq_id + seq.offset;
} else {
for (size_t i = 0; i < length; ++i) {
ubatch.seq_id[ubatch.n_seqs + i] = &seq.all_seq_id;
}
}
}
This will update ubatch.n_seq_id
to the batch seq_id vector data and using
an offset if any. And likewise for the actual seq vector.
Next we have the logits:
if (logits_all) {
for (size_t i = 0; i < length; ++i) {
ubatch.output[ubatch.n_tokens + i] = 1;
out_ids.push_back(ids[seq.offset + i]);
}
} else if (batch->logits) {
if (ubatch.equal_seqs) {
for (size_t i = 0; i < length; ++i) {
size_t id = ids[seq.offset + i];
int8_t is_output = batch->logits[id];
ubatch.output[ubatch.n_tokens + i] = is_output;
if (is_output) { out_ids.push_back(id); }
}
} else {
// simple split
ubatch.output = batch->logits + seq.offset;
for (size_t i = 0; i < length; ++i) {
if (ubatch.output[i] != 0) { out_ids.push_back(seq.offset + i); }
}
}
} else {
// only get last output
for (size_t i = 0; i < length; ++i) {
size_t id = ids[seq.offset + i];
int8_t is_last = id == ids.size() - 1;
ubatch.output[ubatch.n_tokens + i] = is_last;
if (is_last) { out_ids.push_back(id); }
}
}
In our case the batch has logits so the else if block will be taken and this
is a simple split to the else block will be taken. This will update the ubatch
logits/output vector.
This is followed by a for loop over all the tokens and if any of the tokens
has their output/logit set to true then they will be added to the out_ids
vector. In our case there should be two added:
(gdb) p out_ids
$87 = std::vector of length 2, capacity 2 = {5, 12}
if (ubatch.n_tokens == 0 && ubatch.n_seqs == 0) {
ubatch.n_seq_tokens = ubatch.equal_seqs ? length : 1;
}
In this case ubatch.n_seq_tokens
will be set to 1.
ubatch.n_tokens += length;
ubatch.n_seqs += ubatch.equal_seqs ? 1 : length; // virtual sequences for simple splits
seq.offset += length;
seq.length -= length;
n_tokens -= length;
GGML_ASSERT(ubatch.n_tokens == ubatch.n_seq_tokens * ubatch.n_seqs);
}
Here we can see that the offset is incremented by the number of tokens, and
seq.length
is decremented by the number of tokens. And n_tokens
is also
decremented by the number of tokens. TODO: create an example where with
larger batches or perhaps lower the batch sizes to show how the offset works.
That will return the ubatch
back to llama_decode_internal
const uint32_t n_tokens = ubatch.n_tokens;
// count the outputs in this u_batch
{
int32_t n_outputs_new = 0;
if (n_outputs == n_tokens_all) {
n_outputs_new = n_tokens;
} else {
GGML_ASSERT(ubatch.output);
for (uint32_t i = 0; i < n_tokens; i++) {
n_outputs_new += (int32_t) (ubatch.output[i] != 0);
}
}
// needs to happen before the graph is built
lctx.n_outputs = n_outputs_new;
}
In our case n_outputs
is 2 so the above will update n_outputs_new
:
(gdb) p n_outputs_new
$103 = 2
(gdb) p lctx.n_outputs
$105 = 2
And notice that this will also update the context's n_outputs
field.
After this will have the KV-Cache handling which I've already covered.
ggml_backend_sched_reset(lctx.sched);
ggml_backend_sched_set_eval_callback(lctx.sched, lctx.cparams.cb_eval, lctx.cparams.cb_eval_user_data);
ggml_cgraph * gf = llama_build_graph(lctx, ubatch, false);
// the output is always the last tensor in the graph
struct ggml_tensor * res = ggml_graph_node(gf, -1);
struct ggml_tensor * embd = ggml_graph_node(gf, -2);
if (lctx.n_outputs == 0) {
// no output
res = nullptr;
embd = nullptr;
} else if (cparams.embeddings) {
res = nullptr; // do not extract logits for embedding case
embd = nullptr;
for (int i = ggml_graph_n_nodes(gf) - 1; i >= 0; --i) {
if (strcmp(ggml_graph_node(gf, i)->name, "result_embd_pooled") == 0) {
embd = ggml_graph_node(gf, i);
break;
}
}
GGML_ASSERT(embd != nullptr && "missing embeddings tensor");
} else {
embd = nullptr; // do not extract embeddings when not needed
GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor");
}
I'm trying to understand what a use case might be to not have any outputs,
neither logits nor embeddings?
Perhaps for a a warmup phase (warming up the caches) this might be an option.
Another might be when measuring the performance of the model and you want to
measure the computation time without the overhead of producing and storing
outputs. Also I noticed that if we have a u_nbatch
size that is less that
a sequence then the first ubatch
will not have any outputs, for example if
the u_nbatch
size is 32 and we have sequence of 69 tokens then the first
two ubatch
's will not have any outputs.
ggml_backend_sched_alloc_graph(lctx.sched, gf);
llama_set_inputs(lctx, ubatch);
Now, here we can see that the ubatch
is being passed in to llama_set_inputs
:
static void llama_set_inputs(llama_context & lctx, const llama_ubatch & batch) {
const auto & hparams = lctx.model.hparams;
const auto & cparams = lctx.cparams;
const auto & kv_self = lctx.kv_self;
if (batch.token) {
const int64_t n_tokens = batch.n_tokens;
ggml_backend_tensor_set(lctx.inp_tokens, batch.token, 0, n_tokens*ggml_element_size(lctx.inp_tokens));
}
Where we can see that the tensor lctx.inp_tokens
is being to the data in
the batch tensor (hmm, I wonder if it would be alright to rename this parameter
to ubatch for clarity and not confuse it with the llama_batch
struct parameter
passed to llama_decode_internal
). And the offset is 0, and the number of bytes
to copy is:
(gdb) p n_tokens
$108 = 13
(gdb) p ggml_element_size(lctx.inp_tokens)
$109 = 4
(gdb) p ggml_element_size(lctx.inp_tokens) * n_tokens
$110 = 52
(gdb) p *lctx.inp_tokens
$112 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x555555a655e0,
ne = {13, 1, 1, 1}, nb = {4, 52, 52, 52}, op = GGML_OP_NONE,
op_params = {0 <repeats 16 times>}, flags = 1, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffeda400040,
name = "inp_tokens", '\000' <repeats 53 times>, extra = 0x0}
The actual tensor was build by llama_build_graph
but I've been through this
function before. If we were using a GPU backend this would cause the data
to be copyied from the host to the device.
One things I notice in this function is that many of the blocks set the
n_tokens
variable to the number of tokens in the ubatch. I'm curious why this
is not a variable set before the if statements instead of repeating this.
Next we have something similar but for embeddings (which is not the case in our example):
if (batch.embd) {
const int64_t n_embd = hparams.n_embd;
const int64_t n_tokens = batch.n_tokens;
ggml_backend_tensor_set(lctx.inp_embd, batch.embd, 0, n_tokens*n_embd*ggml_element_size(lctx.inp_embd));
}
Following that we have the token positions:
if (batch.pos && lctx.inp_pos) {
const int64_t n_tokens = batch.n_tokens;
ggml_backend_tensor_set(lctx.inp_pos, batch.pos, 0, n_tokens*ggml_element_size(lctx.inp_pos));
}
(gdb) p batch.pos[0]
$117 = 0
(gdb) p batch.pos[1]
$118 = 1
(gdb) p batch.pos[2]
$119 = 2
(gdb) p batch.pos[3]
$120 = 3
(gdb) p *lctx.inp_pos
$116 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU, buffer = 0x555555a655e0, ne = {13, 1, 1, 1}, nb = {4, 52,
52, 52}, op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffeda400840,
name = "inp_pos", '\000' <repeats 56 times>, extra = 0x0}
Next we have (and causal_att
is true):
if (hparams.causal_attn || cparams.pooling_type == LLAMA_POOLING_TYPE_NONE) {
GGML_ASSERT(lctx.inp_out_ids && "every model that can must skip unused outputs");
const int64_t n_tokens = batch.n_tokens;
GGML_ASSERT(ggml_backend_buffer_is_host(lctx.inp_out_ids->buffer));
int32_t * data = (int32_t *) lctx.inp_out_ids->data;
if (lctx.n_outputs == n_tokens) {
for (int i = 0; i < n_tokens; ++i) {
data[i] = i;
}
} else if (batch.output) {
int32_t n_outputs = 0;
for (int i = 0; i < n_tokens; ++i) {
if (batch.output[i]) {
data[n_outputs++] = i;
}
}
// the graph needs to have been passed the correct number of outputs
GGML_ASSERT(lctx.n_outputs == n_outputs);
} else if (lctx.n_outputs == 1) {
// only keep last output
data[0] = n_tokens - 1;
} else {
GGML_ASSERT(lctx.n_outputs == 0);
}
}
(gdb) p *lctx.inp_out_ids
$123 = {type = GGML_TYPE_I32, backend = GGML_BACKEND_TYPE_CPU,
buffer = 0x555555a655e0, ne = {2, 1, 1, 1}, nb = {4, 8, 8, 8},
op = GGML_OP_NONE, op_params = {0 <repeats 16 times>}, flags = 1, grad = 0x0, src = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0}, view_src = 0x0, view_offs = 0, data = 0x7ffeda601040,
name = "inp_out_ids", '\000' <repeats 52 times>, extra = 0x0}
Notice there is a check to make sure that the buffer is on the host and I think this makes sense as it only stores the indices of the tokens that are outputs. So the above will set the tensor data to the indices of the output tokens.
I've updated the example so that the initial back consists of two sequences,
one that is 69 tokens long, and the second one is 6 tokens long. And I've
set the llama_context_params
params
ctx_params.n_batch = 80;
ctx_params.n_ubatch = 32;
There is a lower limit for these values which can be seen in
llama_new_context_with_model
which is why I chose the values (and also it
makes for some interesting paths in the code):
// the batch has to be at least GGML_KQ_MASK_PAD because we will be padding the KQ_mask
// this is required by GPU kernels in order to avoid out-of-bounds accesses (e.g. ggml_flash_attn_ext)
// ref: https://github.com/ggerganov/llama.cpp/pull/5021
if (cparams.n_batch < GGML_KQ_MASK_PAD) {
LLAMA_LOG_WARN("%s: n_batch is less than GGML_KQ_MASK_PAD - increasing to %d\n", __func__, GGML_KQ_MASK_PAD);
cparams.n_batch = GGML_KQ_MASK_PAD;
}
cparams.n_ubatch = std::min(cparams.n_batch, params.n_ubatch == 0 ? params.n_batch : params.n_ubatch);
With that set lets inspect the batch that is passed to llama_decode
:
$31 = {n_tokens = 75, token = 0x5555559da860, embd = 0x0, pos = 0x5555559d1c40, n_seq_id = 0x555555b3cbe0,
seq_id = 0x5555559db7b0, logits = 0x5555559db500 "", all_pos_0 = 0, all_pos_1 = 0, all_seq_id = 0}
The last token in both sequences have their logits elements set to true.
We know that this llama_batch
will be used to create/initialize a
llama_sbatch
:
lctx.sbatch.from_batch(batch_all, n_embd,
/* simple_split */ !kv_self.recurrent,
/* logits_all */ n_outputs == n_tokens_all);
Now, the llama_batch
will be stored as a member of the llama_sbatch
so it
as access to the original batch data:
this->batch = &batch;
And the ubatch's n_tokens
will be initialized to same field of the batch:
n_tokens = batch.n_tokens;
(gdb) p n_tokens
$7 = 73
This is done because this will be updated as we will see later.
ids.resize(n_tokens);
out_ids.clear();
So ids will be of size 73 in this case. And then all the ids will be populated using a for loop over the all the 73 tokens.
for (size_t i = 0; i < n_tokens; ++i) {
ids[i] = i;
}
if (simple_split) {
seq.resize(1);
llama_sbatch_seq & s = seq[0];
s.n_seq_id = 0;
s.seq_id = nullptr;
s.offset = 0;
s.length = n_tokens;
s.all_seq_id = batch.all_seq_id;
return;
}
This is where seq
is resized to 1. So there is one llama_sbatch_seq
sequence is how I'm thinking of this and not that we have two sequences
in the llama_batch
. And the length of the sequence is set to 73 initially and
the rest of the fields initialized to 0 or null.
Back in llama_decode_internal
there is the following while loop which is using
the n_tokens
field of the llama_sbatch
which was set to 73 above:
while (lctx.sbatch.n_tokens > 0) {
llama_ubatch ubatch;
...
ubatch = lctx.sbatch.split_simple(n_ubatch);
}
Note that a new ubatch is created for each iteration of this loop!
And, notice split_simple
is a function/method on sbatch, so lets take a look
at the state of this instance:
(gdb) ptype *this
type = struct llama_sbatch {
size_t n_tokens;
size_t n_embd;
bool logits_all;
std::vector<unsigned long> ids;
std::vector<unsigned long> out_ids;
std::vector<llama_sbatch_seq> seq;
const llama_batch *batch;
std::vector<int> ubatch_token;
std::vector<float> ubatch_embd;
std::vector<int> ubatch_pos;
std::vector<int> ubatch_n_seq_id;
std::vector<int*> ubatch_seq_id;
std::vector<signed char> ubatch_output;
llama_ubatch reserve_ubatch(size_t, bool);
void add_seq_to_ubatch(llama_ubatch &, llama_sbatch_seq &, size_t);
llama_ubatch split_simple(size_t);
llama_ubatch split_equal(size_t);
llama_ubatch split_seq(size_t);
void from_batch(const llama_batch &, size_t, bool, bool);
}
(gdb) p *this
$20 = {
n_tokens = 73,
n_embd = 4096,
logits_all = false,
ids = std::vector of length 73, capacity 73 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72},
out_ids = std::vector of length 0, capacity 0,
seq = std::vector of length 1, capacity 1 = {{ n_seq_id = 0, seq_id = 0x0, offset = 0, length = 73, all_seq_id = 0}},
batch = 0x7fffffffd670,
ubatch_token = std::vector of length 0, capacity 0,
ubatch_embd = std::vector of length 0, capacity 0,
ubatch_pos = std::vector of length 0, capacity 0,
ubatch_n_seq_id = std::vector of length 0, capacity 0,
ubatch_seq_id = std::vector of length 0, capacity 0,
ubatch_output = std::vector of length 0, capacity 0}
So an sbatch is initialized in llama_decode_internal
by using information for
the batch that was passed in. This will keep track of the number of tokens that
have been processed and will be used to create ubatches.
In this case n_ubatch
is 32.
// simple split, unknown number of sequences of unequal lengths
llama_ubatch split_simple(size_t n_ubatch) {
n_ubatch = n_tokens < n_ubatch ? n_tokens : n_ubatch;
llama_ubatch ubatch = reserve_ubatch(n_ubatch, /* has_embd */ batch->embd != nullptr);
ubatch.equal_seqs = false;
if (!seq.empty()) {
llama_sbatch_seq & s = seq[0];
size_t length = s.length < n_ubatch ? s.length : n_ubatch;
GGML_ASSERT(seq.size() == 1 && s.n_seq_id == 0); // don't mix with other splits
add_seq_to_ubatch(ubatch, s, length);
}
return ubatch;
}
This time (compared to the previous example where the sequences/propts were much
smaller) the n_ubatch
will be changed from 72 to 32 (just the local variable
that is).
(gdb) p n_ubatch
$18 = 32
In add_seq_to_ubatch
we then have:
ubatch.token = batch->token + seq.offset;
This is using the llama_batch
token data and the offset is 0 in this case.
...
if (ubatch.n_tokens == 0 && ubatch.n_seqs == 0) {
ubatch.n_seq_tokens = ubatch.equal_seqs ? length : 1;
}
ubatch.n_tokens += length;
ubatch.n_seqs += ubatch.equal_seqs ? 1 : length; // virtual sequences for simple splits
seq.offset += length;
seq.length -= length;
n_tokens -= length;
So the ubatch n_tokens
will be updated to 32 and notice that seq
which is
a reference passed in will be updated to keep track of the number of tokens
that have been added to the updata batch but incrementing the offset, and
decrementing the lenght which is currently 73.
Also notice that the sbatch n_tokens
will also be decremented and recall
that this is the field that is used in the while loop in
llama_decode_internal
.
struct llama_sbatch_seq {
int32_t n_seq_id;
llama_seq_id * seq_id;
size_t offset;
size_t length;
// helper for smoother batch API transition -- can be deprecated in the future
llama_seq_id all_seq_id; // used if seq_id == NULL
};
Now, lets take a look at the second iteration of the while loop.
ubatch = lctx.sbatch.split_simple(n_ubatch);
This time around n_tokens
is 41:
(gdb) p n_tokens
$33 = 41
(gdb) p length
$34 = 32
(gdb) p seq.offset
$67 = 32
So this time when the updates take place in add_seq_to_ubatch
the offset will
be an offset into the original llama_batch
:
ubatch.token = batch->token + seq.offset;
The last things to happen in add_seq_to_ubatch
are:
ubatch.n_tokens += length;
ubatch.n_seqs += ubatch.equal_seqs ? 1 : length; // virtual sequences for simple splits
(gdb) p ubatch.n_tokens
$75 = 32
(gdb) p ubatch.n_seqs
$74 = 32
For some reason I'm not understanding what n_seqs
is supposed to represent and
this might be because I've only stepped through an example that uses the
simple split. TODO: try a recurrent model with the same setting and see if this
makes more sense then. To me this field seems to specify the number of tokens
but there already is a field for that n_tokens
so I'm not sure what this
field is for.
Then we have the updating of the seq
reference, and n_tokens
field of
sbatch
:
seq.offset += length;
seq.length -= length;
n_tokens -= length;
GGML_ASSERT(ubatch.n_tokens == ubatch.n_seq_tokens * ubatch.n_seqs);
The value of seq.offset
before the update is:
(gdb) p seq.offset
$40 = 32
(gdb) p seq.length
$79 = 41
(gdb) p this.n_tokens
$81 = 41
And after the updates:
(gdb) p seq.offset
$44 = 64
(gdb) p seq.length
$84 = 9
(gdb) p this.n_tokens
$83 = 9
Now, lets try the same example but using a Mamba model which is a recurrent
model to see the difference and perhaps gain a better understanding of the
sbatch/ubatch fields in the process.
Using mamba-1.4b-f16.gguf
we can see that simple_split
is now false when
calling llama_sbatch::from_batch
. This will then cause the following block
to be executed. And lets first inspect ids which will be used in this block:
(gdb) p ids
$5 = std::vector of length 64, capacity 64 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,
22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
std::sort(ids.begin(), ids.end(),
[&batch](size_t a, size_t b) {
int32_t n_seq_a = batch.n_seq_id ? batch.n_seq_id[a] : 1;
int32_t n_seq_b = batch.n_seq_id ? batch.n_seq_id[b] : 1;
// sort by seq_id, then by pos
if (n_seq_a == n_seq_b) {
if (batch.seq_id) {
for (int32_t i = 0; i < n_seq_a; ++i) {
llama_seq_id seq_id_a = batch.seq_id[a][i];
llama_seq_id seq_id_b = batch.seq_id[b][i];
// smaller seq_ids go first
if (seq_id_a != seq_id_b) {
return seq_id_a < seq_id_b;
}
}
}
// when all else is equal, sort by pos
if (batch.pos) {
return batch.pos[a] < batch.pos[b];
}
// no pos, sort by id (assuming batch.all_pos_1 is positive)
return a < b;
}
// shared prompts go first
return n_seq_a > n_seq_b;
}
);
So this will sort the ids vector using the custom lambda function passing in
a referece to the llama_batch
. This is first checking if the number of
sequence ids that a tokens have are the same and if that is not the case the
ones with more tokens that belong to more sequences will come before others, the
lambda will return true in this case. If both tokens have the same number of
sequences they belong to, then the above will iterate over all the sequences
that a
belongs to. If these differ that is either the first token has tokens
that belong to other sequences, of the second token has tokens that belong to
other sequences then the one with the smaller sequence id will come first.
Else we will sort by the position of the tokens in the batch. And if there are
not positions then we sort by token id.
Note that this is an inplace sort so the ids
vector will be sorted after this
call to sort.
This will do nothing for us as we have not set/configured any tokens so the
belong to multiple sequences:
(gdb) p ids
$23 = std::vector of length 64, capacity 64 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63}
Lets try updating the batch example to see how this may work.
// init seq
llama_sbatch_seq * last_seq = nullptr;
if (batch.n_seq_id != nullptr && batch.seq_id != nullptr) {
for (size_t i = 0; i < n_tokens; ++i) {
const size_t bi = ids[i];
const int32_t n_seqs = batch.n_seq_id[bi];
llama_seq_id * seq_ids = batch.seq_id[bi];
if (last_seq != nullptr) {
bool same = n_seqs == last_seq->n_seq_id;
for (int32_t j = 0; same && j < n_seqs; ++j) {
if (seq_ids[j] != last_seq->seq_id[j]) {
same = false;
}
}
if (same) {
last_seq->length += 1;
continue;
}
}
llama_sbatch_seq new_seq = {n_seqs, seq_ids, i, 1, batch.all_seq_id};
seq.push_back(new_seq);
last_seq = &seq.back();
}
} else {
llama_sbatch_seq new_seq = {1, nullptr, 0, n_tokens, batch.all_seq_id};
seq.push_back(new_seq);
}
Lets got through the first token in this case:
(lldb) p ids[0]
(std::vector<unsigned long>::value_type) 1
(lldb) p batch.n_seq_id[1]
(int32_t) 2
(lldb) p last_seq
(llama_sbatch_seq *) nullptr
So in this case there is now last_seq
and it will be populated by:
llama_sbatch_seq new_seq = {n_seqs, seq_ids, i, 1, batch.all_seq_id};
seq.push_back(new_seq);
last_seq = &seq.back();
This will create a new llama_sbatch_seq
with 2 as the number of sequences, the pointer
to the sequenced is that this tokens has, a length of 1 (initially):
(lldb) p new_seq
(llama_sbatch_seq) {
n_seq_id = 2
seq_id = 0x00006000032257d0
offset = 0
length = 1
all_seq_id = 0
}
``
And then add it to the `seq` vector and then set `last_seq` to point to the last element
in the vector.
```c++
struct llama_sbatch_seq {
int32_t n_seq_id;
llama_seq_id * seq_id;
size_t offset;
size_t length;
// helper for smoother batch API transition -- can be deprecated in the future
llama_seq_id all_seq_id; // used if seq_id == NULL
};
wip
So a batch, or rather a llama_batch
is what we pass into the llama_decode
function and is really the only thing that an external caller knows anything
about (well that is not entirely true as they can set a ubatch value but the
as a command line argumument -ub/--ubatch-size
).
The internal decode operation is/can be split into smaller units called ubatches.
The internal decode will create an sbatch, sequence-aware that manages the ubatches. I
was not sure what this actually meant, the sequence-aware batch and there was a
disussion where this was also asked where I replied with the following:
sbatch stands for sequence-aware batch which can be found from this comment. My understanding of what this means is that for recurrent models they can benefit from processing batches with sequences of equal length. The way this works is that if there are multiple sequences in a batch (llama_batch that is passed to llama_decode) they will be split into ubatches of equal sequence length.
The sbatch will manage this (hence the name sequence aware batch) and produce an ubatch which will then be used to build the computation graph, set the inputs, and then compute the graph (the forward pass). So the ubatch is what is actually passed to the model and I think it stands for micro batch.
This process will then continue by handling the remaining tokens for the sequences, again making sure that they are of equal size.
There are some more details in the following comment which also contains an example:
#7531 (comment)