llama : refactor sampling v2 #9294

ggerganov · 2024-09-03T09:10:47Z

alt: #8643

ref: #5214

Overview

Replace llama_sampling_ and llama_grammar_ with new llama_sampler_
Overhaul common: struct llama_sampling_context in common: struct gpt_sampler
Support for user-defined samplers via struct llama_sampler_i interface

API Changes

Add struct llama_sampler and struct llama_sampler_i
Add llama_sampler_ API
Add llama_sampler_chain_ API for chaining multiple samplers
Remove LLAMA_API_INTERNAL
Remove Classifier-Free Guidance related stuff
Remove Prompt Penalty support
Add llama_perf_ API and remove old llama_print_timings and llama_reset_timings

Implementation details

Move common/grammar-parser in src/llama-grammar
The llama_context no longer comes with a built-in sampling context. The user code is responsible for creating, using, saving and loading the llama_sampler objects as needed. As a consequence, the llama_state no longer serializes the RNG
The grammar code has been refactored, hopefully it is a bit easier to read. No functional changes.
The samplers implemented in llama-sampling.cpp can be used as examples for implementing custom samplers in user code

Example

Comparison of user sampling code before and after:

before

// decoding loop:
auto   n_vocab = llama_n_vocab(model);
auto * logits  = llama_get_logits_ith(ctx, i_batch[i]);

std::vector<llama_token_data> candidates;
candidates.reserve(n_vocab);

for (llama_token token_id = 0; token_id < n_vocab; token_id++) {
    candidates.emplace_back(llama_token_data{ token_id, logits[token_id], 0.0f });
}

llama_token_data_array candidates_p = { candidates.data(), candidates.size(), false };

llama_sample_top_k(ctx, &candidates_p, top_k, 1);
llama_sample_top_p(ctx, &candidates_p, top_p, 1);
llama_sample_temp (ctx, &candidates_p, temp);

const llama_token new_token_id = llama_sample_token(ctx, &candidates_p);

after

// prepare the sampling chain at the start
auto sparams = llama_sampler_chain_default_params();

llama_sampler * smpl = llama_sampler_chain_init(sparams);

llama_sampler_chain_add(smpl, llama_sampler_init_top_k(top_k));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(top_p, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp (temp));
llama_sampler_chain_add(smpl, llama_sampler_init_dist (seed));

...

// decoding loop:
const llama_token new_token_id = llama_sampler_sample(smpl, ctx, i_batch[i]);

Future plan and ideas

Attach samplers to the decoding graphs for better performance (e.g. llama_decode_with_sampler())
Extend llama_decode_ API to support multiple decoding runs (e.g. llama_decode_n())
Existing samplers implementation in llama-sampling.cpp could be split into separate source files
Expose struct llama_vocab through the public API and change calls that currently use struct llama_model to use it when appropriate
Deduplicate the ring_buffer code by implementing ggml_ring_buffer for fixed-size objects
Measure and report the performance of the grammar
Simplify llama_token_data llama : refactor sampling v2 #9294 (review)

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ExtReMLapin · 2024-09-03T15:41:42Z

Not sure it's the right place or time to talk about this but in another issue a guy had the idea of, "if the grammar says the character/word can only be "xxx" and nothing else, don't bother asking the LLM what to say for the X next tokens".

As there is a refactoring going on, maybe it's the right time to implement it.

ggerganov · 2024-09-03T17:50:55Z

It's not in the scope of this change, but also the grammar never fits only one token. For example, all three tokens "x", "xx" and "xxx" would fit the grammar in that case. One way would be to use the longest token. Another way would be to tokenize "xxx" and use the resulting tokens. Not sure

ggerganov · 2024-09-04T14:41:39Z

This is getting close to ready. Later today will add detailed description of the changes, some comments in the code and do a bit more testing.

@slaren PTAL - any comments and suggestions are welcome.

include/llama.h

src/llama-sampling.cpp

slaren · 2024-09-04T19:11:17Z

include/llama.h

+    LLAMA_API struct llama_constraint * llama_constraint_init_top_k      (int32_t k, int32_t min_keep);
+    LLAMA_API struct llama_constraint * llama_constraint_init_top_p      (float   p, int32_t min_keep);
+    LLAMA_API struct llama_constraint * llama_constraint_init_min_p      (float   p, int32_t min_keep);
+    LLAMA_API struct llama_constraint * llama_constraint_init_tail_free  (float   z, int32_t min_keep);
+    LLAMA_API struct llama_constraint * llama_constraint_init_typical    (float   p, int32_t min_keep);


I don't know what's the history of the min_keep parameter in all of these samplers, from what I can tell parameter is not used in the examples except by the server, but it seems very suspect to me.

Edit: looks like it has been there since the beginning (#1126), and there was never any explanation of why it is needed.

Removed min_keep from the top_k sampler as it didn't make sense. For the p-based samplers, I think it makes sense to guarantee minimum number of candidate results, regardless of the p value.

include/llama.h

common/sampling.cpp

ggerganov · 2024-09-04T20:16:20Z

Thanks for the review. I got side tracked a bit with a bug in the speculative example (should be fixed now). Will apply the review tomorrow and prepare this for merging.

clarismiranda · 2024-09-06T00:14:14Z

src/llama-grammar.h

+// be positioned at a character range (see `llama_grammar_advance_stack`), and
+// produces the N possible stacks if the given char is accepted at those
+// positions
+llama_grammar_stacks llama_grammar_accept(


Hello! Why does llama_grammar_accept return the stacks? It was previously passed by reference

Thanks for noticing. I changed it because I thought it improves the signature of the function, but I missed that it would lead to extra memory allocations. So I restored the original signature. f9762c6

Edit: though after one more look, I think it does not matter since we move stacks_new either way.

include/llama.h

slaren

Looks good.

In the future we should probably simplify llama_token_data (or remove it entirely) to keep only one value per token and add a flag to llama_token_data_array to indicate whether the current values are probabilities (ie. normalized to sum 1) or not, so that samplers that can only operate with probabilities can know if they need to call softmax. Having two values per token is very confusing to me because some samplers operate on one or other, and this can lead to situations where a sampler modifies the probabilities, and the next one calls softmax which discards all the changes to the probabilities and computes them from the logits again. I cannot tell if there are already situations like that which end with some samplers that operate on probabilities being ignored.

ggml-ci

ggerganov · 2024-09-08T10:44:50Z

Sure, feel free to update it. Exposing some common functions through the API also sounds good.

giladgd · 2024-09-09T01:40:01Z

Is there a way to provide llama_sampler_init_penalties with a pointer to where to look for the last tokens?
In the previous implementation, I customized the repeat penalty sampling with an array of last tokens that weren't always the same as the last context tokens.

mudler · 2024-09-12T06:32:08Z

Somehow after this change I see breakage in llava sampling - I'm still OOO so I didn't deep dive yet, see mudler/LocalAI#3497 for reference on how it breaks in LocalAI.

Would be really appreciated if anyone have an idea/pointers of what's going on ! Thank you!


10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stderr /home/mudler/_git/LocalAI/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml.c:13835: GGML_ASSERT(i01 >= 0 &
& i01 < ne01) failed


10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout [Thread debugging using libthread_db enabled]
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout Using host libthread_db library "/lib64/libthread_db.so.1".
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout 0x00007f989b8e94a3 in ?? () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #0  0x00007f989b8e94a3 in ?? () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #1  0x00000000008222e5 in ggml_graph_compute_thread.isra ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #2  0x00007f989b8dcd16 in GOMP_parallel () from /lib64/libgomp.so.1
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #3  0x0000000000825a2a in ggml_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #4  0x0000000000834010 in ggml_backend_cpu_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #5  0x000000000083784c in ggml_backend_graph_compute ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #6  0x0000000000652b63 in clip_image_batch_encode.constprop ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #7  0x0000000000653553 in clip_image_encode ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #8  0x0000000000657ac8 in llava_image_embed_make_with_clip_img ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #9  0x00000000004e2c09 in llama_server_context::update_slots() [clone .isra.0] ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #10 0x00000000004d7629 in llama_server_queue::start_loop() ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout #11 0x000000000048b040 in main ()
10:25PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:42747): stdout [Inferior 1 (process 13029) detached]

ggerganov · 2024-09-12T06:35:00Z

Which commit are you using? I think you need to update to 1b28061

mudler · 2024-09-12T07:41:25Z

Which commit are you using? I think you need to update to 1b28061

Thanks for the quick reply! I was at daa9623, I'll try with that and let you know ASAP

mudler · 2024-09-12T16:25:53Z

mmh, tried with the latest commit (e6b7801) but still crashing with:

6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stderr /home/mudler/_git/LocalAI/backend/cpp/llama-avx2/llama.cpp/ggml/src/ggml.c:13853: GGML_ASSERT(i01 >= 0 && i01 < ne01) failed
...
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout [Thread debugging using libthread_db enabled]
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout Using host libthread_db library "/lib64/libthread_db.so.1".
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout 0x00007fd8a45ee4a3 in ?? () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #0  0x00007fd8a45ee4a3 in ?? () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #1  0x00000000007dd4b5 in ggml_graph_compute_thread.isra ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #2  0x00007fd8a45e1d16 in GOMP_parallel () from /lib64/libgomp.so.1
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #3  0x00000000007e0cca in ggml_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #4  0x00000000007ef340 in ggml_backend_cpu_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #5  0x00000000007f2b7c in ggml_backend_graph_compute ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #6  0x000000000060d8b3 in clip_image_batch_encode.constprop ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #7  0x000000000060e2a3 in clip_image_encode ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #8  0x0000000000612818 in llava_image_embed_make_with_clip_img ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #9  0x00000000004dd269 in llama_server_context::update_slots() [clone .isra.0] ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #10 0x00000000004d1ce9 in llama_server_queue::start_loop() ()
6:24PM DBG GRPC(moondream2-text-model-f16.gguf-127.0.0.1:44339): stdout #11 0x0000000000486a10 in main ()

slaren · 2024-09-12T16:30:11Z

That's not related to the sampling changes, the only difference is that get_rows operations are bound-checked in all builds now, while previously it was only checked in debug builds. The clip implementation is broken and needs to be fixed: #9066 (comment)

mudler · 2024-09-12T16:41:15Z

That's not related to the sampling changes, the only difference is that get_rows operations are bound-checked in all builds now, while previously it was only checked in debug builds. The clip implementation is broken and needs to be fixed: #9066 (comment)

Thanks for that bit, totally missed it. What it's weird is that now for me it's a 100% hit since I started pinning new version of llama.cpp - as I have test suites running vision tests this never popped up until now. It's not sporadic at all, but really consistent and can't get a single time passing

Commit still working here: 815b1fb
Commit which is not working: e6b7801 (which includes #9082 )

slaren · 2024-09-12T16:45:22Z

You would need to run the test suite with a debug build to be able to hit the assert. If you want the previous behavior you can revert #9354 in your build, but that still does not make it any less broken, it just hides the issue.

mudler · 2024-09-12T16:49:49Z

You would need to run the test suite with a debug build to be able to hit the assert. If you want the previous behavior you can revert #9354 in your build, but that still does not make it any less broken, it just hides the issue.

Thanks for the hints @slaren - I actually tried to comment the assert at all as well to double check - but as you suggested it "hides" it, and crashes in the same way.

another datapoint from my side - seems the suggestion in the comment, to edit

diff --git a/examples/llava/clip.cpp b/examples/llava/clip.cpp
index 342042ff..224db9b5 100644
--- a/examples/llava/clip.cpp
+++ b/examples/llava/clip.cpp
@@ -2419,7 +2419,7 @@ bool clip_image_batch_encode(clip_ctx * ctx, const int n_threads, const clip_ima
             struct ggml_tensor * patches = ggml_graph_get_tensor(gf, "patches");
             int* patches_data = (int*)malloc(ggml_nbytes(patches));
             for (int i = 0; i < num_patches; i++) {
-                patches_data[i] = i + 1;
+                patches_data[i] = i;
             }
             ggml_backend_tensor_set(patches, patches_data, 0, ggml_nbytes(patches));
             free(patches_data);

actually "fixes" the issue here - probably it's not ideal, but at least seems indeed something is off in the clip implementation when loading images. Sorry for making noise here - is there already an issue open for this issue or shall I open one? I can't find one for this specific issue

slaren · 2024-09-12T16:55:50Z

I don't think there is an open issue about this currently, I know that it was briefly discussed in #9066, but that one is already closed.

ngxson · 2024-09-24T09:03:21Z

Maybe worth to notice: signature of llama_sampling_sample is slightly different from the new gpt_sampler_sample:

llama_token llama_sampling_sample(
        struct llama_sampling_context * ctx_sampling,
        struct llama_context * ctx_main,
        struct llama_context * ctx_cfg,
        int idx = -1);

llama_token gpt_sampler_sample(
    struct gpt_sampler * gsmpl,
    struct llama_context * ctx,
    int idx,
    bool grammar_first = false);

Compiler may not though any error or warning because values set to llama_sampling_sample can also be cast to new values required by gpt_sampler_sample

ggerganov · 2024-09-24T09:48:34Z

Compiler may not though any error or warning because values set to llama_sampling_sample can also be cast to new values required by gpt_sampler_sample

Hm, it will generate an error. Don't think there is an issue.

zhaoyinglia · 2024-10-10T07:54:51Z

@ggerganov Hi, I'm a bit confused. Why was Classifier-Free Guidance removed? Are there any issues with it?

ggerganov · 2024-10-10T18:03:39Z

The implementation does not need to be part of libllama because it works on the logits and is trivial to implement in user code. I removed it also from the examples, because it was much simpler this way and refactoring the sampling was with higher priority than keeping CFG functional. We can reintroduce this functionality in one of the examples or in a new dedicated example if there is interest. PRs welcome.

The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]>

* fix(ext_server): Port llama.cpp sampling refactors to ext_server This was a fairly large changeset. I closely followed the changes here: ggerganov/llama.cpp@df270ef Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(server.cpp): Refactor server.cpp logging for llama.cpp overhaul Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat: Bump llama.cpp to the latest master with `granite` support This does not yet have granite MoE support, but that can come in a follow up PR Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(patches): Update all patches (except solar-pro) to work with bumped llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(solar): Update solar patch for llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama.cpp): Bump llama.cpp for granitemoe support Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(solar): Update the solar-pro patch for latest llama.cpp bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama.cpp): Bump to the latest master of llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(patches): Update all patches for latest bump Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama): Always run sync.sh from the right directory Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama/patches): Update llama patches Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * feat(llama)!: Rough sync with llama.cpp submodule There are a number of changes that will need to be propagated to llama.go before any of this works! Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama/patches): Add a patch and update for missing ggml-impl.h include This include is where the ggml_cgraph struct is defined. It is included in many of the .c files to define the forward declartion in ggml.h. It seems that with the subset of code included here, the import was somehow lost (or out-of-order) when building, so adding this include to llama.cpp fixes the missing definition. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama/sync): Add missing ggml-cpu-impl.h copy-over in sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Add missing log.cpp This was added as part of the logging overhaul done in llama.cpp Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Overhaul use of sampling module for llama.cpp changes The changes here reflect the changes made in the big llama.cpp sampling PR ggerganov/llama.cpp#9294 The sampling functionality is now broken into the base interface (llama_sampler) and the generation implementation (gpt_sampler). The changes here reflect that. Since the sampling.h/sampling.cpp code uses c++ STL headers, the sampling_ext.[h|cpp] wrapper is maintained to allow go to access a pure-C interface. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Fix the impl of SampleTokenGreedy for new sampling I don't think this method is currently used, so it could probably just be removed so that all sampling goes through the GPT interface, but in the interest of doing no harm, this should keep the method working as expected. Branch: IBMGraniteArchitectureSupport * fix(llama): Remove unused SampleTokenGreedy Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(sync): Remove bash-specific change to sync.sh Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * chore(gofumpt): Format on llama.go to pass linting Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llm): Fix missing <thread> include in ext_server Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Remove TODO about grammar_first This feature was not used/needed previously so should be fine without plumbing it through now. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Better naming for sampling wrapper and args Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Fix patch 05 to use new wrapper api and re-sync Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * runner: Flush pending responses before returning If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707 * fix(llama/sampling): Use gpt_sampler with a forward declaration Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llama): Remove unnecessary patch for gguf impl header This was caused by an earlier mistake in the embeddings patch that was dereferencing the pointer instead of using the wrapper API. Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> * fix(llm): Remove use of deprecated --log-disable flag Branch: IBMGraniteArchitectureSupport Signed-off-by: Gabe Goodhart <[email protected]> --------- Signed-off-by: Gabe Goodhart <[email protected]>

- Add `struct llama_sampler` and `struct llama_sampler_i` - Add `llama_sampler_` API - Add `llama_sampler_chain_` API for chaining multiple samplers - Remove `LLAMA_API_INTERNAL` - Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`

github-actions bot added the examples label Sep 4, 2024

bviksoe mentioned this pull request Sep 4, 2024

Bug: uncached prompt is not used for penalty #8971

Closed

github-actions bot added testing Everything test related server android Issues specific to Android labels Sep 4, 2024

ggerganov changed the base branch from gg/llama-refactor-sampling to master September 4, 2024 12:38

ggerganov force-pushed the gg/llama-refactor-sampling-v2 branch from 762e955 to 3c46719 Compare September 4, 2024 14:26

slaren reviewed Sep 4, 2024

View reviewed changes

ggerganov force-pushed the gg/llama-refactor-sampling-v2 branch from 8307e96 to 11c2e46 Compare September 5, 2024 15:13

clarismiranda reviewed Sep 6, 2024

View reviewed changes

slaren reviewed Sep 6, 2024

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

ggerganov marked this pull request as ready for review September 6, 2024 12:32

ggerganov requested a review from slaren September 6, 2024 12:32

slaren approved these changes Sep 6, 2024

View reviewed changes

ggerganov added 12 commits September 7, 2024 12:29

llama : add llama_sampling API + move grammar in libllama

ab545c8

ggml-ci

llama : sketching new sampling API

86b07cc

cont : add llama_constraint_i [no ci]

5116b36

cont : initial implementation sketch [no ci]

cf4dd10

cont : fixes, naming [no ci]

1b07dc5

cont : add rest of the existing samplers [no ci]

71293a6

cont : fix [no ci]

0daebc6

cont : add penalties and logit-bias constraints [no ci]

a2ce91c

cont : add comments [no ci]

09ceb68

cont : leaner constraint initialization [no ci]

1e8e26c

cont : common/sampling use the new API [no ci]

91cbb40

cont : add n_prev to llama_sampler_params

437376e

RobinMoRi mentioned this pull request Sep 9, 2024

Feature: Migrate to nextjs RobinMoRi/portfolio-site#33

Closed

tc-wolf mentioned this pull request Sep 11, 2024

LLamaDiskCache: needs a RO / 'static' disk cache for RAG use cases abetlen/llama-cpp-python#1737

Open

ddh0 mentioned this pull request Sep 11, 2024

added implementation of DRY sampler #6839

Closed

mudler mentioned this pull request Sep 12, 2024

Bug: loading llava models fails #9455

Open

ggerganov mentioned this pull request Sep 13, 2024

changelog : libllama API #9289

Open

cebtenzzre mentioned this pull request Sep 26, 2024

backend: rebase llama.cpp on upstream as of Sep 26th nomic-ai/gpt4all#2998

Merged

gabe-l-hart mentioned this pull request Oct 14, 2024

IBM granite/granitemoe architecture support ollama/ollama#6760

Merged

2 tasks

eastriverlee mentioned this pull request Oct 19, 2024

sampling v2 update eastriverlee/LLM.swift#37

Merged

eugenehp added a commit to eugenehp/bitnet-cpp-rs that referenced this pull request Oct 26, 2024

https://github.com/ggerganov/llama.cpp/pull/9294

0de8fbd

HanClinto mentioned this pull request Nov 1, 2024

llama : speed-up grammar sampling #4218

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : refactor sampling v2 #9294

llama : refactor sampling v2 #9294

ggerganov commented Sep 3, 2024 •

edited

Loading

ExtReMLapin commented Sep 3, 2024

ggerganov commented Sep 3, 2024

ggerganov commented Sep 4, 2024

slaren Sep 4, 2024 •

edited

Loading

ggerganov Sep 5, 2024

ggerganov commented Sep 4, 2024

clarismiranda Sep 6, 2024

ggerganov Sep 6, 2024 •

edited

Loading

slaren left a comment •

edited

Loading

ggerganov commented Sep 8, 2024

giladgd commented Sep 9, 2024

mudler commented Sep 12, 2024 •

edited

Loading

ggerganov commented Sep 12, 2024

mudler commented Sep 12, 2024

mudler commented Sep 12, 2024 •

edited

Loading

slaren commented Sep 12, 2024 •

edited

Loading

mudler commented Sep 12, 2024 •

edited

Loading

slaren commented Sep 12, 2024

mudler commented Sep 12, 2024 •

edited

Loading

slaren commented Sep 12, 2024

ngxson commented Sep 24, 2024

ggerganov commented Sep 24, 2024

zhaoyinglia commented Oct 10, 2024 •

edited

Loading

ggerganov commented Oct 10, 2024

llama : refactor sampling v2 #9294

llama : refactor sampling v2 #9294

Conversation

ggerganov commented Sep 3, 2024 • edited Loading

Overview

API Changes

Implementation details

Example

Future plan and ideas

ExtReMLapin commented Sep 3, 2024

ggerganov commented Sep 3, 2024

ggerganov commented Sep 4, 2024

slaren Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Sep 5, 2024

Choose a reason for hiding this comment

ggerganov commented Sep 4, 2024

clarismiranda Sep 6, 2024

Choose a reason for hiding this comment

ggerganov Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

slaren left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov commented Sep 8, 2024

giladgd commented Sep 9, 2024

mudler commented Sep 12, 2024 • edited Loading

ggerganov commented Sep 12, 2024

mudler commented Sep 12, 2024

mudler commented Sep 12, 2024 • edited Loading

slaren commented Sep 12, 2024 • edited Loading

mudler commented Sep 12, 2024 • edited Loading

slaren commented Sep 12, 2024

mudler commented Sep 12, 2024 • edited Loading

slaren commented Sep 12, 2024

ngxson commented Sep 24, 2024

ggerganov commented Sep 24, 2024

zhaoyinglia commented Oct 10, 2024 • edited Loading

ggerganov commented Oct 10, 2024

ggerganov commented Sep 3, 2024 •

edited

Loading

slaren Sep 4, 2024 •

edited

Loading

ggerganov Sep 6, 2024 •

edited

Loading

slaren left a comment •

edited

Loading

mudler commented Sep 12, 2024 •

edited

Loading

mudler commented Sep 12, 2024 •

edited

Loading

slaren commented Sep 12, 2024 •

edited

Loading

mudler commented Sep 12, 2024 •

edited

Loading

mudler commented Sep 12, 2024 •

edited

Loading

zhaoyinglia commented Oct 10, 2024 •

edited

Loading