Skip to content

Commit f27cd40

Browse files
ikawrakowIwan Kawrakow
andauthored
Enable faster prompt processing with mainline llama.cpp GGUFs (#409)
* Enable MLA-3 in crippled GGUFs: WIP * Enable MLA-3 in crippled GGUFs: seems to work * Add newly created tensors to model.tensors_by_name Else they don't get run-time repacked. --------- Co-authored-by: Iwan Kawrakow <[email protected]>
1 parent 465569d commit f27cd40

File tree

3 files changed

+294
-140
lines changed

3 files changed

+294
-140
lines changed

common/common.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2334,6 +2334,7 @@ struct llama_model_params llama_model_params_from_gpt_params(const gpt_params &
23342334
if (params.n_gpu_layers != -1) {
23352335
mparams.n_gpu_layers = params.n_gpu_layers;
23362336
}
2337+
mparams.mla = params.mla_attn;
23372338
mparams.rpc_servers = params.rpc_servers.c_str();
23382339
mparams.main_gpu = params.main_gpu;
23392340
mparams.split_mode = params.split_mode;

include/llama.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -325,6 +325,7 @@ extern "C" {
325325

326326
struct llama_model_params {
327327
int32_t n_gpu_layers; // number of layers to store in VRAM
328+
int32_t mla; // MLA implementation to use (only applicable to DeepSeek models at this point)
328329
enum llama_split_mode split_mode; // how to split the model across multiple GPUs
329330

330331
// main_gpu interpretation depends on split_mode:

0 commit comments

Comments
 (0)