forked from ggerganov/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update llama.cpp to latest #6
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Flake lock file updates: • Updated input 'flake-parts': 'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11) → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01) • Updated input 'flake-parts/nixpkgs-lib': 'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30) → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29) • Updated input 'nixpkgs': 'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25) → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
* Fix cpy with dims of 3 * rm asserts --------- Co-authored-by: Abhilash Majumder <[email protected]>
* Update server-llm.sh Add flag --non-interactive that allows run script without asking a permission * Update scripts/server-llm.sh --------- Co-authored-by: Georgi Gerganov <[email protected]>
…ganov#5295) * added dynamic temp params in main * added help text
We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <[email protected]>
* py : fix internlm2-hf convert to gguf * ggml-ci
Co-authored-by: Iwan Kawrakow <[email protected]>
…ov#5325) * Avoid duplicating function calls when using MIN/MAX macros. Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice. By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer: https://godbolt.org/z/Ee4KMrvKh Code behaves exactly the same. * Update ggml.c --------- Co-authored-by: Georgi Gerganov <[email protected]>
* Make use of ggml-quants.h possible in C++ code * One cannot possibly be defining static_assert in a C++ compilation --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* README: updated introduction * readme : update --------- Co-authored-by: Georgi Gerganov <[email protected]>
* make: Use ccache for faster compilation
* py : handle byte tokens in `get_token_type` * py : fix empty bytes arg
…#5300) server : fix deadlock when prompt array contains strings and numbers server : removed an unnecessary generation when generating multi-prompts server : removed an unnecessary assert
* server: added `dynatemp_range` and `dynatemp_exponent` * Update README.md --------- Co-authored-by: Michael Coppola <[email protected]>
* Q4_K: slightly better quantization * Q5_K: slightly better quantization --------- Co-authored-by: Iwan Kawrakow <[email protected]>
Add some links to quantization related PRs
* include total "num_slots" in default_generation_settings_for_props * cleanup total_slots return value in /props endpoint * update /props endpoint docs with total_slots * remove num_slots from default_generation_settings_for_props * update /props endpoint section
* support minicpm arch. * fix tab/space typo. * convert minicpm model via convert-hf-gguf.py * try to make tokenizer work * fix bug for quantize minicpm * fix for flake8 lint * remove convert-minicpm.py * fix for editorconfig * correct minicpm model type (size) * constants expanded for minicpm * Minor change of the constant names for minicpm
* first cleanup, update everything to Llama 2 and remove outdated content * Delete SHA256SUMS * make build instructions generic * recommend Q4_K_M quantization method * Update README.md
* Initial Vulkan multi-gpu implementation Move most global variables into backend context * Add names to backend device functions * Add further missing cleanup code * Reduce code duplication in tensor split layer assignment * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h * Only do device info print in the beginning and initialize one backend for cpu assist Add missing cleanup code * Rework backend memory management to make sure devices and buffers get properly allocated and freed * Rename cpu assist free function --------- Co-authored-by: slaren <[email protected]>
…nov#5760) * WIP: make i-quants work for QK_K = 64 * iq2_xs: attempt to fix AVX dot product for QK_K = 64 Tests pass, but I get gibberish. * QK_K = 64 tests pass on ARM_NEON and Metal Sadly, that does not mean it actually works. * Make CUDA compile with QK_K = 64 Tests don't pass, plus we get misaligned access * Q2_K: fixed bug in imatrix quantization for QK_K = 64 * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work) --------- Co-authored-by: Iwan Kawrakow <[email protected]>
* Add "/chat/completions" as alias for "/v1/chat/completions" * merge to upstream master * minor : fix trailing whitespace --------- Co-authored-by: Georgi Gerganov <[email protected]>
Signed-off-by: Daniel Bevenius <[email protected]>
* implement nfd for stripping accents in wpm tokenizer * sort nfd map; reuse iterator * use builtin tolower * add locale include * Simplify to_lower cases Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
This reverts a single line from ggerganov#5475
* server: twice ctrl+C to exit * std::atomic_flag * sigint: message * sigint: stderr * Update examples/server/server.cpp Co-authored-by: Jared Van Bortel <[email protected]> --------- Co-authored-by: Jared Van Bortel <[email protected]>
* Introduce backend GUIDs Initial proposed implementation of backend GUIDs (Discussed in ggerganov/ggml#741) Hardcoded CPU backend GUID (for now) Change ggml_backend_is_cpu logic to use GUID * Remove redundant functions Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion * Add spaces to match style Co-authored-by: slaren <[email protected]> * Fix brace style to match Co-authored-by: slaren <[email protected]> * Add void to () in function signature Co-authored-by: slaren <[email protected]> * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid * add guids to all backends ggml-ci --------- Co-authored-by: slaren <[email protected]>
* add magika inference example * ggml : fix unaligned accesses in custom ops * ggml : fix FP32 GELU for values that exceed the FP16 range * use ggml_pool_1d * add README * Update README.md * pad inputs if the files are too small * cleanup ggml-ci
* server: normalize naming * fix spacing
* Use batched mul_mat pathway * rm extra line * Explicitly state scaled data type --------- Co-authored-by: Abhilash Majumder <[email protected]>
* switch to multimap based nfd_map due to compile time issues * simplify multimap keys * dont construct new locale every time
* cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q * remove: mul_mat_q in compare llama bench and usage * update llama-bench --------- Co-authored-by: slaren <[email protected]>
* Add support for starcoder2 * handle rope type * skip rope freq and rotary embeddings from being serialized * resolve comments * Update llama.cpp * remove redundant changes * handle `rope-theta` * llama : change starcoder2 rope type * address comment --------- Co-authored-by: Georgi Gerganov <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.