Update llama.cpp to latest #6

boxbeam · 2024-02-29T22:33:10Z

No description provided.

Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)

* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <[email protected]>

…#5330)

* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <[email protected]>

…ganov#5295) * added dynamic temp params in main * added help text

…rganov#5307)

We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <[email protected]>

* py : fix internlm2-hf convert to gguf * ggml-ci

Co-authored-by: Iwan Kawrakow <[email protected]>

…ov#5325) * Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <[email protected]>

* Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <[email protected]>

* make: Use ccache for faster compilation

* py : handle byte tokens in `get_token_type` * py : fix empty bytes arg

…#5300) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert

* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <[email protected]>

…v#5362)

* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Add some links to quantization related PRs

* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section

* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm

* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md

* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <[email protected]>

…nov#5760) * WIP: make i-quants work for QK_K = 64 * iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish. * QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works. * Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access * Q2_K: fixed bug in imatrix quantization for QK_K = 64 * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) --------- Co-authored-by: Iwan Kawrakow <[email protected]>

* Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>

Signed-off-by: Daniel Bevenius <[email protected]>

* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>

This reverts a single line from ggerganov#5475

* server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>

* Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in ggerganov/ggml#741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <[email protected]> * Fix brace style to match Co-authored-by: slaren <[email protected]> * Add void to () in function signature Co-authored-by: slaren <[email protected]> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <[email protected]>

* add magika inference example * ggml : fix unaligned accesses in custom ops * ggml : fix FP32 GELU for values that exceed the FP16 range * use ggml_pool_1d * add README * Update README.md * pad inputs if the files are too small * cleanup ggml-ci

ggml-ci

* server: normalize naming * fix spacing

* Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <[email protected]>

…erganov#5794)

* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time

* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <[email protected]>

…en (ggerganov#5813)

* Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <[email protected]>

…b2297

github-actions bot and others added 30 commits February 4, 2024 08:45

[SYCL] Fix cpy with dims of 3 (ggerganov#5289)

4833ac2

* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <[email protected]>

readme : add CodeShell models to the supported models list (ggerganov…

5d55b0c

…#5330)

scripts : add non-interactive server-llm.sh (ggerganov#5303)

4be04c8

* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <[email protected]>

scripts : fix typos, cleanup (ggerganov#5303)

30679d4

common : add dynamic temperature parameters to main example cli (gger…

e6f8177

…ganov#5295) * added dynamic temp params in main * added help text

server : allow to get default generation settings for completion (gge…

a2d60c9

…rganov#5307)

py : fix internlm2-hf convert to gguf (ggerganov#5305)

7e1ae37

* py : fix internlm2-hf convert to gguf * ggml-ci

iq3_xxs: quards for the no-imatrix situation (ggerganov#5334)

89503dc

Co-authored-by: Iwan Kawrakow <[email protected]>

ggml : make use of ggml-quants.h possible in C++ code (ggerganov#5338)

c6b3955

* Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <[email protected]>

README: updated introduction (ggerganov#5343)

78b00dd

* README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <[email protected]>

make: Use ccache for faster compilation (ggerganov#5318)

098f6d7

* make: Use ccache for faster compilation

py : handle byte tokens in get_token_type (ggerganov#5341)

906cff5

* py : handle byte tokens in `get_token_type` * py : fix empty bytes arg

server : various fixes for the prompt field in /completion (ggerganov…

4ffc7a1

…#5300) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert

server : add dynatemp_range and dynatemp_exponent (ggerganov#5352)

31e7903

* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <[email protected]>

server : include total "num_slots" in props endpoint (ggerganov#5349)

8a79c59

CUDA: mul_mat_vec_q for batch sizes > 1 (ggerganov#5351)

2c51661

readme : add phi, orion 14b, internlm2, and yi-VL to readme (ggergano…

2e9c0bd

…v#5362)

Slight quantization improvement for Q4_K and Q5_K (ggerganov#5361)

f57fadc

* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <[email protected]>

Update README.md (ggerganov#5366)

b08f22c

Add some links to quantization related PRs

CUDA: mul_mat_vec_q max. batch size 8 -> 4 (ggerganov#5370)

17c97fb

server : remove model.json endpoint (ggerganov#5371)

213d143

convert : fix TypeError on GPT-2 vocab.json (ggerganov#5288)

f68664a

readme : update ui list (ggerganov#5354)

9a697d8

readme : modernize (ggerganov#5379)

ed0bf32

* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md

ikawrakow and others added 28 commits February 28, 2024 10:37

server : add "/chat/completions" alias for "/v1/...` (ggerganov#5722)

efc7225

* Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>

readme : add link to LLaVA 1.6 models (ggerganov#5758)

6c44168

Signed-off-by: Daniel Bevenius <[email protected]>

llama : fix non-quantization of expert gating tensors (ggerganov#5754)

adcb12a

This reverts a single line from ggerganov#5475

sync : ggml

8c0e8f4

awq-py : remove (ggerganov#5768)

78aacf3

llama : remove deprecated API (ggerganov#5770)

08c5ee8

ggml-ci

make portability_enumeration_ext apple only (ggerganov#5757)

317709b

ci : reduce 3b ppl chunks to 1 to avoid timeout (ggerganov#5771)

87c91c0

ggml-ci

llama : constified llama_set_state_data's src (ggerganov#5774)

d5ab297

Server: normalize naming (ggerganov#5779)

052051d

* server: normalize naming * fix spacing

Merge tag 'b2297' of https://github.com/ggerganov/llama.cpp

e841b7a

[SYCL] Use batched mul_mat pathway (ggerganov#5591)

38d1521

* Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <[email protected]>

server : fix newlines in help (ggerganov#5785)

f105471

ci : add Ubuntu 22 Vulkan CI run (ggerganov#5789)

6ea0f01

server: allow to override threads server pool with --threads-http (gg…

5cb02b4

…erganov#5794)

unicode : switch to multimap based nfd_map (ggerganov#5799)

9600d59

* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time

llama : cleanup unused mmq flags (ggerganov#5772)

3ab8b3a

* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <[email protected]>

common : fix flag --logits-all to --all-logits (ggerganov#5805)

f49a535

gemma : fix bfloat16 -> float16 conversion issue (ggerganov#5810)

e743386

ggml-vulkan: fix VULKAN_CHECK_RESULTS flag, which was previously brok…

c2224f0

…en (ggerganov#5813)

server : remove api_like_OAI.py proxy script (ggerganov#5808)

38d16b1

Merge branch 'master' of https://github.com/ggerganov/llama.cpp into …

c504a54

…b2297

boxbeam changed the title ~~B2297~~ Update llama.cpp to latest Mar 1, 2024

wsxiaoys merged commit 807ee66 into master Mar 1, 2024
44 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update llama.cpp to latest #6

Update llama.cpp to latest #6

boxbeam commented Feb 29, 2024

Update llama.cpp to latest #6

Update llama.cpp to latest #6

Conversation

boxbeam commented Feb 29, 2024