Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update llama.cpp to latest #6

Merged
merged 343 commits into from
Mar 1, 2024
Merged

Update llama.cpp to latest #6

merged 343 commits into from
Mar 1, 2024
This pull request is big! We’re only showing the most recent 250 commits.

Commits on Feb 4, 2024

  1. flake.lock: Update

    Flake lock file updates:
    
    • Updated input 'flake-parts':
        'github:hercules-ci/flake-parts/07f6395285469419cf9d078f59b5b49993198c00' (2024-01-11)
      → 'github:hercules-ci/flake-parts/b253292d9c0a5ead9bc98c4e9a26c6312e27d69f' (2024-02-01)
    • Updated input 'flake-parts/nixpkgs-lib':
        'github:NixOS/nixpkgs/b0d36bd0a420ecee3bc916c91886caca87c894e9?dir=lib' (2023-12-30)
      → 'github:NixOS/nixpkgs/97b17f32362e475016f942bbdfda4a4a72a8a652?dir=lib' (2024-01-29)
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/ae5c332cbb5827f6b1f02572496b141021de335f' (2024-01-25)
      → 'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
    github-actions[bot] authored and philiptaron committed Feb 4, 2024
    Configuration menu
    Copy the full SHA
    9392ebd View commit details
    Browse the repository at this point in the history

Commits on Feb 5, 2024

  1. [SYCL] Fix cpy with dims of 3 (ggerganov#5289)

    * Fix cpy with dims of 3
    
    * rm asserts
    
    ---------
    
    Co-authored-by: Abhilash Majumder <[email protected]>
    AidanBeltonS and abhilash1910 authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    4833ac2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5d55b0c View commit details
    Browse the repository at this point in the history
  3. scripts : add non-interactive server-llm.sh (ggerganov#5303)

    * Update server-llm.sh
    
    Add flag --non-interactive that allows run script without asking a permission
    
    * Update scripts/server-llm.sh
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    garrnizon and ggerganov authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    4be04c8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    30679d4 View commit details
    Browse the repository at this point in the history
  5. common : add dynamic temperature parameters to main example cli (gger…

    …ganov#5295)
    
    * added dynamic temp params in main
    
    * added help text
    l3utterfly authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    e6f8177 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    a2d60c9 View commit details
    Browse the repository at this point in the history
  7. iq2_xxs: tune quantization (ggerganov#5320)

    We get slightly better PPL, and we cut quantization time in
    nearly half.
    
    The trick is to 1st quantize without forcing points onto the E8-lattice.
    We can then use a narrower search range around the block scale that we
    got that way.
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    6fdfa2e View commit details
    Browse the repository at this point in the history
  8. py : fix internlm2-hf convert to gguf (ggerganov#5305)

    * py : fix internlm2-hf convert to gguf
    
    * ggml-ci
    SolenoidWGT authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    7e1ae37 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    89503dc View commit details
    Browse the repository at this point in the history
  10. ggml : avoid duplicating function calls using MIN/MAX macros (ggergan…

    …ov#5325)
    
    * Avoid duplicating function calls when using MIN/MAX macros.
    
    Since these copy "a" and "b" they ask the compiler to evaluate one of them twice. The compiler doesn't have a problem with removing the duplication in something like MAX(0, x + 2), but in some cases we're calling functions, and those calls just happen twice.
    By explicitly evaluating at the expression we get smaller and faster code without duplicate calls. See ggml_rope_yarn_corr_dims in Compiler Explorer:
    
    https://godbolt.org/z/Ee4KMrvKh
    
    Code behaves exactly the same.
    
    * Update ggml.c
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    tom7 and ggerganov authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    abb6194 View commit details
    Browse the repository at this point in the history
  11. ggml : make use of ggml-quants.h possible in C++ code (ggerganov#5338)

    * Make use of ggml-quants.h possible in C++ code
    
    * One cannot possibly be defining static_assert in a C++ compilation
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    c6b3955 View commit details
    Browse the repository at this point in the history
  12. README: updated introduction (ggerganov#5343)

    * README: updated introduction
    
    * readme : update
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    JohannesGaessler and ggerganov authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    78b00dd View commit details
    Browse the repository at this point in the history
  13. make: Use ccache for faster compilation (ggerganov#5318)

    * make: Use ccache for faster compilation
    JohannesGaessler authored Feb 5, 2024
    Configuration menu
    Copy the full SHA
    098f6d7 View commit details
    Browse the repository at this point in the history

Commits on Feb 6, 2024

  1. py : handle byte tokens in get_token_type (ggerganov#5341)

    * py : handle byte tokens in `get_token_type`
    
    * py : fix empty bytes arg
    ggerganov authored Feb 6, 2024
    Configuration menu
    Copy the full SHA
    906cff5 View commit details
    Browse the repository at this point in the history
  2. server : various fixes for the prompt field in /completion (ggerganov…

    …#5300)
    
    server : fix deadlock when prompt array contains strings and numbers
    
    server : removed an unnecessary generation when generating multi-prompts
    
    server : removed an unnecessary assert
    Niall- authored Feb 6, 2024
    Configuration menu
    Copy the full SHA
    4ffc7a1 View commit details
    Browse the repository at this point in the history
  3. server : add dynatemp_range and dynatemp_exponent (ggerganov#5352)

    * server: added `dynatemp_range` and `dynatemp_exponent`
    
    * Update README.md
    
    ---------
    
    Co-authored-by: Michael Coppola <[email protected]>
    m18coppola and Michael Coppola authored Feb 6, 2024
    Configuration menu
    Copy the full SHA
    31e7903 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    8a79c59 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    2c51661 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    2e9c0bd View commit details
    Browse the repository at this point in the history
  7. Slight quantization improvement for Q4_K and Q5_K (ggerganov#5361)

    * Q4_K: slightly better quantization
    
    * Q5_K: slightly better quantization
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 6, 2024
    Configuration menu
    Copy the full SHA
    f57fadc View commit details
    Browse the repository at this point in the history
  8. Update README.md (ggerganov#5366)

    Add some links to quantization related PRs
    ikawrakow authored Feb 6, 2024
    Configuration menu
    Copy the full SHA
    b08f22c View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    17c97fb View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    213d143 View commit details
    Browse the repository at this point in the history

Commits on Feb 7, 2024

  1. Configuration menu
    Copy the full SHA
    f68664a View commit details
    Browse the repository at this point in the history
  2. server : update /props with "total_slots" value (ggerganov#5373)

    * include total "num_slots" in default_generation_settings_for_props
    
    * cleanup total_slots return value in /props endpoint
    
    * update /props endpoint docs with total_slots
    
    * remove num_slots from default_generation_settings_for_props
    
    * update /props endpoint section
    jparkerweb authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    f3e2b4f View commit details
    Browse the repository at this point in the history
  3. llama : add MiniCPM support (ggerganov#5346)

    * support minicpm arch.
    
    * fix tab/space typo.
    
    * convert minicpm model via convert-hf-gguf.py
    
    * try to make tokenizer work
    
    * fix bug for quantize minicpm
    
    * fix for flake8 lint
    
    * remove convert-minicpm.py
    
    * fix for editorconfig
    
    * correct minicpm model type (size)
    
    * constants expanded for minicpm
    
    * Minor change of the constant names for minicpm
    runfuture authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    316c7fa View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    9a697d8 View commit details
    Browse the repository at this point in the history
  5. readme : modernize (ggerganov#5379)

    * first cleanup, update everything to Llama 2 and remove outdated content
    
    * Delete SHA256SUMS
    
    * make build instructions generic
    
    * recommend Q4_K_M quantization method
    
    * Update README.md
    netrunnereve authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    ed0bf32 View commit details
    Browse the repository at this point in the history
  6. Basic Vulkan Multi-GPU implementation (ggerganov#5321)

    * Initial Vulkan multi-gpu implementation
    
    Move most global variables into backend context
    
    * Add names to backend device functions
    
    * Add further missing cleanup code
    
    * Reduce code duplication in tensor split layer assignment
    
    * generalize LLAMA_SPLIT_LAYER for all backends, do not expose device count and memory in llama.h
    
    * Only do device info print in the beginning and initialize one backend for cpu assist
    
    Add missing cleanup code
    
    * Rework backend memory management to make sure devices and buffers get properly allocated and freed
    
    * Rename cpu assist free function
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    0cc4m and slaren authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    ee1628b View commit details
    Browse the repository at this point in the history
  7. llava-cli : always tokenize special tokens (ggerganov#5382)

    * llava-cli: tokenize special tokens in prompt
    
    * llava-cli: use the escape CLI argument, remove incomplete separate escaping process
    jxy authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    0ef46da View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    10afa6f View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    aa7ab99 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    b906596 View commit details
    Browse the repository at this point in the history
  11. fix typo in readme (ggerganov#5399)

    Co-authored-by: Ebey Abraham <[email protected]>
    ebeyabraham and eabraham-1 authored Feb 7, 2024
    Configuration menu
    Copy the full SHA
    8c933b7 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    c4fbb67 View commit details
    Browse the repository at this point in the history

Commits on Feb 8, 2024

  1. Configuration menu
    Copy the full SHA
    8504d2d View commit details
    Browse the repository at this point in the history
  2. sampling: fix top_k <= 0 (ggerganov#5388)

    * sampling: fix top_k <= 0
    
    * Update llama.cpp
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    JohannesGaessler and ggerganov authored Feb 8, 2024
    Configuration menu
    Copy the full SHA
    26d4efd View commit details
    Browse the repository at this point in the history
  3. llava: fix typo/formatting in README.md (ggerganov#5405)

    This commit fixes a typo in the README.md file for the llava example
    which is causing the formatting to look a little off:
    
    Clone llava-v15-7b`` and clip-vit-large-patch14-336`` locally
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 8, 2024
    Configuration menu
    Copy the full SHA
    a6e514a View commit details
    Browse the repository at this point in the history
  4. llama : fix MiniCPM (ggerganov#5392)

    * fix bug for norm_rms_eps missing
    
    * to align with the same order as convert.py for model write
    
    * fix: undo HF models permute tensor
    
    * update for flake8 lint
    runfuture authored Feb 8, 2024
    Configuration menu
    Copy the full SHA
    4aa43fa View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    b7b74ce View commit details
    Browse the repository at this point in the history
  6. llava : add missing .py, and fix paths in README.md (ggerganov#5414)

    This commit adds the missing .py extension to the convert-image-encoder-to-gguf
    script. It also fixes the paths for the `model` and `mmproj` options in the
    example llava-cli command.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 8, 2024
    Configuration menu
    Copy the full SHA
    ff4ff05 View commit details
    Browse the repository at this point in the history
  7. Fix f16_sycl cpy call from Arc (ggerganov#5411)

    * fix f16_sycl cpy call
    
    * rm old logic
    
    * add fp16 build CI
    
    * use macro
    
    * format fix
    abhilash1910 authored Feb 8, 2024
    Configuration menu
    Copy the full SHA
    6e99f2a View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    41f308f View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    8e6a9d2 View commit details
    Browse the repository at this point in the history

Commits on Feb 9, 2024

  1. Fix Vulkan crash on APUs with very little device memory (ggerganov#5424)

    * Fix Vulkan crash on APUs with very little device memory
    
    * Fix debug output function names
    0cc4m authored Feb 9, 2024
    Configuration menu
    Copy the full SHA
    44fbe34 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    b2f87cb View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e4124c2 View commit details
    Browse the repository at this point in the history
  4. llama : do not cap thread count when MoE on CPU (ggerganov#5419)

    * Not capping thread count when MoE inference is running on CPU
    
    * Whitespace
    ptsochantaris authored Feb 9, 2024
    Configuration menu
    Copy the full SHA
    e5ca393 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    7c777fc View commit details
    Browse the repository at this point in the history
  6. llava : add requirements.txt and update README.md (ggerganov#5428)

    * llava: add requirements.txt and update README.md
    
    This commit adds a `requirements.txt` file to the `examples/llava`
    directory. This file contains the required Python packages to run the
    scripts in the `examples/llava` directory.
    
    The motivation of this to make it easier for users to run the scripts in
    `examples/llava`. This will avoid users from having to possibly run into
    missing package issues if the packages are not installed on their system.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    * llava: fix typo in llava-surgery.py output
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    ---------
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 9, 2024
    Configuration menu
    Copy the full SHA
    e00d2a6 View commit details
    Browse the repository at this point in the history
  7. vulkan: Set limit for task concurrency (ggerganov#5427)

    A common default for the maximum number of open files is 256, which can
    lead to `asyncio.gather(*tasks)` failing with Too many open files.
    
        $ python ggml_vk_generate_shaders.py --glslc=$ANDROID_NDK_PATH/shader-tools/darwin-x86_64/glslc
        ggml_vulkan: Generating and compiling shaders to SPIR-V
        Traceback (most recent call last):
          File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2326, in <module>
            asyncio.run(main())
          File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/runners.py", line 44, in run
            return loop.run_until_complete(main)
          File "/Users/neuman/Code.noindex/miniforge3/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
            return future.result()
          File "/Users/neuman/Code.noindex/github/llama.cpp/ggml_vk_generate_shaders.py", line 2294, in main
            await asyncio.gather(*tasks)
        [...snip...]
        OSError: [Errno 24] Too many open files
    
    This change sets a reasonable concurrency limit for tasks (and therefore
    open files), without significant impact on run time.
    luciferous authored Feb 9, 2024
    Configuration menu
    Copy the full SHA
    4b7b38b View commit details
    Browse the repository at this point in the history

Commits on Feb 10, 2024

  1. ggml : add abort_callback for cpu backend (ggml/725)

    * a way to use abort_callback with the cpu backend
    
    * whisper update
    Xarbirus authored and ggerganov committed Feb 10, 2024
    Configuration menu
    Copy the full SHA
    4633d93 View commit details
    Browse the repository at this point in the history
  2. sync : ggml

    ggerganov committed Feb 10, 2024
    Configuration menu
    Copy the full SHA
    43b65f5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    cd9aea6 View commit details
    Browse the repository at this point in the history
  4. metal : use autoreleasepool to avoid memory leaks (ggerganov#5437)

    There appears to be a known memory leak when using the
    `MLTCommandBuffer`. It is suggested to use `@autoreleasepool` in
    [1,2]
    
    [1] https://developer.apple.com/forums/thread/662721
    [2] https://forums.developer.apple.com/forums/thread/120931
    
    This change-set wraps the `ggml_metal_graph_compute` in a
    `@autoreleasepool`.
    
    This commit addresses ggerganov#5436
    irbull authored Feb 10, 2024
    Configuration menu
    Copy the full SHA
    f026f81 View commit details
    Browse the repository at this point in the history

Commits on Feb 11, 2024

  1. server : add llama2 chat template (ggerganov#5425)

    * server: add mistral chat template
    
    * server: fix typo
    
    * server: rename template mistral to llama2
    
    * server: format_llama2: remove BOS
    
    * server: validate "--chat-template" argument
    
    * server: clean up using_chatml variable
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    ngxson and cebtenzzre authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    907e08c View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    e4640d8 View commit details
    Browse the repository at this point in the history
  3. ggml : add mmla kernels for quantized GEMM (ggerganov#4966)

    * ggml: aarch64: implement smmla kernel for q8_0_q8_0 quantized gemm
    
    armv8.2-a and above supports MMLA instructions that have higher
    throughput than DOT. this commit adds mmla kernel for
    q8_0_q8_0 gemm. The feature is enabled if the platform supports
    "__ARM_FEATURE_MATMUL_INT8"
    
    On AWS Graviton3 processors this kernel resulted up to 1.5x
    improvement for prompt evaluation throughput compared to the
    default sdot kernel.
    
    * ggml: aarch64: implement smmla kernel for q4_0_q8_0 quantized gemm
    
    armv8.2-a and above supports MMLA instructions that have higher
    throughput than DOT. this commit adds mmla kernel for
    q4_0_q8_0 gemm. The feature is enabled if the platform supports
    "__ARM_FEATURE_MATMUL_INT8"
    
    On AWS Graviton3 processors this kernel resulted up to 1.5x
    improvement for prompt evaluation throughput compared to the
    default sdot kernel.
    
    * ggml: aarch64: implement smmla kernel for q4_1_q8_1 quantized gemm
    
    armv8.2-a and above supports MMLA instructions that have higher
    throughput than DOT. this commit adds mmla kernel for
    q4_1_q8_1 gemm. The feature is enabled if the platform supports
    "__ARM_FEATURE_MATMUL_INT8"
    
    On AWS Graviton3 processors this kernel resulted up to 1.5x
    improvement for prompt evaluation throughput compared to the
    default sdot kernel.
    
    * ggml: update unit tests for the new vec_dot interface
    
    * llama.cpp: add MATMUL_INT8 capability to system_info
    snadampal authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    a07d0fe View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    0f2411f View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    139b62a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    85910c5 View commit details
    Browse the repository at this point in the history
  7. server : allow to specify tokens as strings in logit_bias (ggerganov#…

    …5003)
    
    * server: allow to specify tokens as strings in logit_bias
    
    * Apply suggestions from code review
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    z80maniac and ggerganov authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    6847801 View commit details
    Browse the repository at this point in the history
  8. common : use enums for sampler types (ggerganov#5418)

    * common: use enums for sampler types
    
    * Apply suggestions from code review
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * minor : spaces
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    z80maniac and ggerganov authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    a803333 View commit details
    Browse the repository at this point in the history
  9. vulkan: only use M-sized matmul on Apple GPUs (ggerganov#5412)

    * vulkan: refactor guess_matmul_pipeline for vendor
    
    Refactor ggml_vk_guess_matmul_pipeline to simplify adding per-vendor
    conditionals.
    
    Signed-off-by: Sergio Lopez <[email protected]>
    
    * vulkan: only use M-sized matmul on Apple GPUs
    
    L-sized and S-sized matmuls are broken on Apple GPUs, force using
    M-size with this vendor.
    
    Signed-off-by: Sergio Lopez <[email protected]>
    
    ---------
    
    Signed-off-by: Sergio Lopez <[email protected]>
    slp authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    c88c74f View commit details
    Browse the repository at this point in the history
  10. flake.lock: Update

    Flake lock file updates:
    
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/b8b232ae7b8b144397fdb12d20f592e5e7c1a64d' (2024-01-31)
      → 'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
    github-actions[bot] authored and philiptaron committed Feb 11, 2024
    Configuration menu
    Copy the full SHA
    97a3365 View commit details
    Browse the repository at this point in the history
  11. Add support for BERT embedding models (ggerganov#5423)

    * BERT model graph construction (build_bert)
    * WordPiece tokenizer (llm_tokenize_wpm)
    * Add flag for non-causal attention models
    * Allow for models that only output embeddings
    * Support conversion of BERT models to GGUF
    * Based on prior work by @xyzhang626 and @skeskinen
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    Co-authored-by: Jared Van Bortel <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    4 people authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    2891c8a View commit details
    Browse the repository at this point in the history
  12. CUDA: mul_mat_vec_q tiling, refactor mul mat logic (ggerganov#5434)

    * CUDA: mul_mat_vec_q tiling, refactor mul mat logic
    
    Co-authored-by: slaren <[email protected]>
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    JohannesGaessler and slaren authored Feb 11, 2024
    Configuration menu
    Copy the full SHA
    3bdc4cd View commit details
    Browse the repository at this point in the history

Commits on Feb 12, 2024

  1. sync : ggml (ggerganov#5452)

    * ggml-alloc : v3 (ggml/727)
    
    * ggml-alloc v3
    
    ggml-ci
    
    * fix ci
    
    ggml-ci
    
    * whisper : check for backend buffer allocation failures
    
    * whisper : avoid leaks when initialization fails
    
    * cleanup
    
    ggml-ci
    
    * style fixes
    
    ggml-ci
    
    * sync : ggml
    
    * update llama.cpp, clip.cpp, export-lora.cpp
    
    * update finetune.cpp, train-text-from-scratch.cpp
    
    ggml-ci
    
    * ggml-backend : reduce alignment to 32 to match gguf and fix mmap
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    ggerganov and slaren authored Feb 12, 2024
    Configuration menu
    Copy the full SHA
    3b16944 View commit details
    Browse the repository at this point in the history
  2. llava : remove prog parameter from ArgumentParser (ggerganov#5457)

    * llava: remove prog parameter from ArgumentParser
    
    This commit removes the `prog` parameter from `ArgumentParser`
    so that it uses the default value which is the name of the script.
    
    The motivation for this change is that currently the usage output looks
    like this:
    ```console
    $ python examples/llava/convert-image-encoder-to-gguf.py --help
    usage: convert_hf_to_gguf.py [-h] ...
    ```
    And with this change it will look like this:
    ```console
    $ python examples/llava/convert-image-encoder-to-gguf.py --help
    usage: convert-image-encoder-to-gguf.py [-h] ...
    ```
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    * ci: add W503 to flake8 ignore list
    
    This commit adds W503 to the ignore list for flake8. This is done to
    avoid the following error:
    W503 line break before binary operator
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    ---------
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 12, 2024
    Configuration menu
    Copy the full SHA
    4a46d2b View commit details
    Browse the repository at this point in the history
  3. ggml-sycl: Replace 3d ops with macro (ggerganov#5458)

    * use macro
    
    * use macro
    
    * fix format
    abhilash1910 authored Feb 12, 2024
    Configuration menu
    Copy the full SHA
    43fe07c View commit details
    Browse the repository at this point in the history
  4. py : fix persimmon n_rot conversion (ggerganov#5460)

    * convert : fix persimmon offical weight conversion to write correct n_rot.
    
    * Update convert-persimmon-to-gguf.py
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    lx200916 and ggerganov authored Feb 12, 2024
    Configuration menu
    Copy the full SHA
    dbd8828 View commit details
    Browse the repository at this point in the history
  5. swift : package no longer use ggml dependency (ggerganov#5465)

    * Revert "swift : update Package.swift to use ggml as dependency (ggerganov#4691)"
    
    This reverts commit ece9a45.
    
    * spm : add ggml headers
    ggerganov authored Feb 12, 2024
    Configuration menu
    Copy the full SHA
    df334a1 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    099afc6 View commit details
    Browse the repository at this point in the history

Commits on Feb 13, 2024

  1. Configuration menu
    Copy the full SHA
    895407f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    99b8b43 View commit details
    Browse the repository at this point in the history
  3. bert : add tests + fix quantization (ggerganov#5475)

    * llama : do not quantize pos embd and token type tensors
    
    * ci : add BERT tests
    
    ggml-ci
    
    * ci : do not do BERT tests on low-perf nodes
    
    ggml-ci
    ggerganov authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    49cc1f7 View commit details
    Browse the repository at this point in the history
  4. make: add error message for bad CUDA version (ggerganov#5444)

    * make: add error message for bad CUDA version
    
    * Update Makefile
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    JohannesGaessler and cebtenzzre authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    ad014bb View commit details
    Browse the repository at this point in the history
  5. llama : support batched embeddings (ggerganov#5466)

    * batched embedding: pool outputs by sequence id. updated embedding example
    
    * bring back non-causal attention
    
    * embd : minor improvements
    
    * llama : minor
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    iamlemec and ggerganov authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    03bf161 View commit details
    Browse the repository at this point in the history
  6. tests : multi-thread the tokenizer tests (ggerganov#5474)

    * tests : multi-thread the tokenizer tests
    
    ggml-ci
    
    * unicode : fix data race for unidentified codepoints
    
    ggml-ci
    
    * unicode : minor style fixes
    
    ggml-ci
    ggerganov authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    cf45252 View commit details
    Browse the repository at this point in the history
  7. finetune : rename feed-forward tensors (w1/w2/w3) (ggerganov#4839)

    * finetune: rename feed-forward tensors (w1/w2/w3)
    
    This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
    ffn_down and ffn_up respectively.
    
    The motivation for this change is to make it easier to understand the
    purpose of the tensors. This also seems to be inline with the names
    used in the llama_layer struct in llama.cpp.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    * train-text-from-scratch: rename ff tensors
    
    This commit renames the feed-forward tensors w1, w2 and w3 to ffn_gate,
    ffn_down and ffn_up respectively.
    
    The motivation for this change is to make it easier to understand the
    purpose of the tensors. This also seems to be inline with the names
    used in the llama_layer struct in llama.cpp
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    ---------
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    2639789 View commit details
    Browse the repository at this point in the history
  8. llama : make load error reporting more granular (ggerganov#5477)

    Makes it easier to pinpoint where e.g. `unordered_map::at: key not found` comes from.
    akx authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    037259b View commit details
    Browse the repository at this point in the history
  9. llama : allow raw byte in SPM vocabs; don't crash on nl 404 (ggergano…

    …v#5478)
    
    * common : don't crash if newline token is not found
    
    * common : llama_byte_to_token: allow falling back to finding just the token byte in SPM vocabs
    akx authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    c4e6dd5 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    ea9c8e1 View commit details
    Browse the repository at this point in the history
  11. gguf : add python reader example (ggerganov#5216)

    * Update CMakeLists.txt
    
    * Create reader.py
    
    * Update reader.py
    
    * Update reader.py
    
    another whitespace :|
    
    * Update reader.py
    
    * lintlintlint
    cmp-nct authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    6c00a06 View commit details
    Browse the repository at this point in the history
  12. Early return for zero size calls to get_tensor. (ggerganov#5482)

    * Early return for zero size calls to get_tensor.
    
    Signed-off-by: Adam Treat <[email protected]>
    
    * Update ggml-kompute.cpp
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update ggml-kompute.cpp
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Add an early return to the get/set tensor when the size is null.
    
    Signed-off-by: Adam Treat <[email protected]>
    
    * Early return after the assertions.
    
    Signed-off-by: Adam Treat <[email protected]>
    
    * Since we do the early return in the generic backend now no reason to do so here as well.
    
    Signed-off-by: Adam Treat <[email protected]>
    
    ---------
    
    Signed-off-by: Adam Treat <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    manyoso and ggerganov authored Feb 13, 2024
    Configuration menu
    Copy the full SHA
    f5ca054 View commit details
    Browse the repository at this point in the history

Commits on Feb 14, 2024

  1. llava : support v1.6 (ggerganov#5267)

    * Create llava-survery-v2.py
    
    * Update convert-image-encoder-to-gguf.py
    
    * Update convert-image-encoder-to-gguf.py
    
    * Rename llava-survery-v2.py to llava-surgery-v2.py
    
    * Update convert-image-encoder-to-gguf.py
    
    will now search for projector
    
    * Update convert-image-encoder-to-gguf.py
    
    whoops
    
    * Update llava-surgery-v2.py
    
    * Clip: Bugfix for normalization (it did not loat the 3 std and mean values)
    Clip: bicubic resize function
    Clip: added save-to-bmp/pil for debugging and conversion from/to 32/8 images
    Clip: added normalization with FP16 precision simulation (image tensors match HF implementation, can be switched off, only used for llava-1.6)
    Clip: added newline tensor, mergetype kv, image-grid kv, new resize-pad function with resolution from gridpoints
    Clip: clip_image_preprocess now returns a float * vector instead of float, this way llava 1.5 and 1.6 is supported
    llava: added ggml cpu graph for embedding patching, added spatial_unpad preliminary support, added a lot of comments that need to be cleaned when all is final
    convert-image-encoder: fixed image-grid flattening
    
    * whitespace corrections
    
    * ws
    
    * Tensors are now properly permuted.
    Before the embeddings were inserted 1:1, now they are split into the 24x24 patches as in reference.
    
    * ws
    
    * added verbose_prompt support into cli
    added stopwords for llava-1.6 into cli
    
    * moved llava functions to llava.cpp, made clip.h C compatible API, replaced vector style functions with pointers, added a debug define to remove functions from compilation while not needed
    
    * ws
    
    * convert : skip unknown tensors (need for LLaVA)
    
    * llava : update readme
    
    * llava : fix compile warnings
    
    * llava : style
    
    * convert : add --skip-unknown CLI arg
    
    * server : remove clip structs
    
    * bugfix for non llava-1.6
    
    It should now work with llava-1.5 as well
    
    * clip : minor code rearrange
    
    * llava : update readme a bit
    
    ---------
    
    Co-authored-by: John <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    3 people authored Feb 14, 2024
    Configuration menu
    Copy the full SHA
    aa23412 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8084d55 View commit details
    Browse the repository at this point in the history
  3. llava : update README.md (ggerganov#5489)

    * Update README.md
    
    * Update README.md
    
    * Update examples/llava/README.md
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    cmp-nct and ggerganov authored Feb 14, 2024
    Configuration menu
    Copy the full SHA
    ccbb277 View commit details
    Browse the repository at this point in the history
  4. readme : fix typo (ggerganov#5490)

    executabhle -> executable
    Rune-AI authored Feb 14, 2024
    Configuration menu
    Copy the full SHA
    594fca3 View commit details
    Browse the repository at this point in the history

Commits on Feb 15, 2024

  1. Configuration menu
    Copy the full SHA
    704359e View commit details
    Browse the repository at this point in the history
  2. llaba : hotfix for llava-1.6 image number (ggerganov#5495)

    Co-authored-by: John <[email protected]>
    cmp-nct and John authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    7930a8a View commit details
    Browse the repository at this point in the history
  3. llava : fix memory management bug (ggerganov#5491)

    * Fix memory management in llava and server code
    
    Fixes this error:
    
    llama_new_context_with_model: graph splits (measure): 3
    Available slots:
     -> Slot 0 - max context: 6000
    {"timestamp":1707926446,"level":"INFO","function":"main","line":2623,"message":"model loaded"}
    all slots are idle and system prompt is empty, clear the KV cache
    slot 0 - loaded image
    slot 0 is processing [task id: 0]
    slot 0 : kv cache rm - [0, end)
    slot 0 - encoding image [id: 1]
    munmap_chunk(): invalid pointer
    Aborted
    
    * Make it cleaner by checking size in batch free wrapper
    Elbios authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    0d41771 View commit details
    Browse the repository at this point in the history
  4. fix(gguf-py): special tokens are no longer skipped when add_<token>_t…

    …oken is set to false (ggerganov#5487)
    
    * fix(gguf-py): special tokens are no longer skipped when add_<token>_token is set to false
    
    * fix(gguf-py): added missing cls and mask token ids to the gguf metadata
    vriesdemichael authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    7312247 View commit details
    Browse the repository at this point in the history
  5. scripts : add hf.sh helper script (ggerganov#5501)

    * scripts : add hf.sh helper scripts
    
    * hf : add error logs
    
    * hf : add support for --repo and --file
    ggerganov authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    9350a1c View commit details
    Browse the repository at this point in the history
  6. cuda : print message when initialization fails (ggerganov#5512)

    * cuda : print message when initialization fails
    
    * use CUDA_NAME both times
    slaren authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    9060a1e View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    c06e45d View commit details
    Browse the repository at this point in the history
  8. Use correct type of pooling for embedding models (ggerganov#5500)

    Use correct type of pooling for embedding models
    iamlemec authored Feb 15, 2024
    Configuration menu
    Copy the full SHA
    4524290 View commit details
    Browse the repository at this point in the history

Commits on Feb 16, 2024

  1. Configuration menu
    Copy the full SHA
    594845a View commit details
    Browse the repository at this point in the history
  2. llava : fix clip-model-is-vision flag in README.md (ggerganov#5509)

    * llava: fix clip-model-is-vision flag in README.md
    
    This commit fixes the flag `--clip_model_is_vision` in README.md which
    is does not match the actual flag:
    ```console
    $ python convert-image-encoder-to-gguf.py --help
    ...
      --clip-model-is-vision
                            The clip model is a pure vision model
                            (ShareGPT4V vision extract for example)
    ```
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    * llava: update link to vit config in README.md
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    
    ---------
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 16, 2024
    Configuration menu
    Copy the full SHA
    60ed04c View commit details
    Browse the repository at this point in the history
  3. ggml : add numa options (ggerganov#5377)

    * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
    
    * Reverted Makefile
    
    * Fixed include
    
    * Removed sched.h from ggml.h, moved ggml_get_numa_affinity into ggml.c, removed trailing whitespace and fixed up a few inconsistent variables
    
    * removed trailing whitespace
    
    * Added numa options to allow finer grained control as well as plumbing for a new mirror mode that will require numa.h
    
    * Reverting Makefile
    
    * Fixed a number of issues with the move from BOOL to ggml_numa_strategies. Added a note about mirror mode note being implemented yet
    
    * Removing MIRROR_MODE code for this PR
    
    * Removing last bit of MIRROR_MODE code for this PR
    
    * Removing unneeded branch in server.cpp example and moving get_numa_affinity and making it static
    
    * Fixed lingering init_llama_backend() bool calls in tests and examples
    
    * Remote enum llama_numa_strategies
    
    * Revert bad merge with dynatemp flags
    
    * add missing enum ggml_numa_strategies declaration and revert sync problem with master
    
    * add missing enum ggml_numa_strategies declaration
    
    * fixed ggml_init_numa variable
    
    * Update ggml.h
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    * Update READMEs with info about numa flags, change INTERLEAVE strategy name to DISTRIBUTE everywhere, implement the improved distribution strategy from @rankaiyx, fix a spelling mistake and un-merge some bad merges
    
    * split numa init out from llama_backend_init and created llama_numa_init. Updated all code paths and samples
    
    * Fix up some boolean vs enum comparisons
    
    * Added #ifdefs for non-Linux OS that don't have cpu_set_t datatype
    
    * Update ggml.h
    
    Align enum values
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update ggml.c
    
    Remove whitespace
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update ggml.c
    
    align paremeters
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update examples/server/server.cpp
    
    remove whitespace and align brace
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update common/common.cpp
    
    Remove whitespace and align brace
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * unified ggml_numa_strategy enum and fixed text alignment in server.cpp example
    
    * Update ggml.c
    
    simplified return for platforms without NUMA support
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    * removed redundant else from cli argument processing of --numa
    
    * whitespace
    
    ---------
    
    Co-authored-by: root <[email protected]>
    Co-authored-by: Jared Van Bortel <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    Co-authored-by: Jared Van Bortel <[email protected]>
    5 people authored Feb 16, 2024
    Configuration menu
    Copy the full SHA
    f486f6e View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5f5808c View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    6dcc02d View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    65085c7 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    4cb0727 View commit details
    Browse the repository at this point in the history
  8. scripts : add helpers script for bench comparing commits (ggerganov#5521

    )
    
    * scripts : add helpers script for bench comparing commits
    
    * scripts : detect CUDA
    
    * set flags after checking the command line
    
    * fix make flags
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    ggerganov and slaren authored Feb 16, 2024
    Configuration menu
    Copy the full SHA
    d2819d5 View commit details
    Browse the repository at this point in the history
  9. cmake : fix VULKAN and ROCm builds (ggerganov#5525)

    * cmake : fix VULKAN and ROCm builds
    
    * cmake : fix (cont)
    
    * vulkan : fix compile warnings
    
    ggml-ci
    
    * cmake : fix
    
    ggml-ci
    
    * cmake : minor
    
    ggml-ci
    ggerganov authored Feb 16, 2024
    Configuration menu
    Copy the full SHA
    5bf2b94 View commit details
    Browse the repository at this point in the history

Commits on Feb 17, 2024

  1. Configuration menu
    Copy the full SHA
    d250c9d View commit details
    Browse the repository at this point in the history
  2. ci : add an option to fail on compile warning (ggerganov#3952)

    * feat(ci): add an option to fail on compile warning
    
    * Update CMakeLists.txt
    
    * minor : fix compile warnings
    
    ggml-ci
    
    * ggml : fix unreachable code warnings
    
    ggml-ci
    
    * ci : disable fatal warnings for windows, ios and tvos
    
    * ggml : fix strncpy warning
    
    * ci : disable fatal warnings for MPI build
    
    * ci : add fatal warnings to ggml-ci
    
    ggml-ci
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    ananta and ggerganov authored Feb 17, 2024
    Configuration menu
    Copy the full SHA
    6e4e973 View commit details
    Browse the repository at this point in the history
  3. ggml : add ALiBi support for ggml_soft_max_ext (ggerganov#5488)

    * ggml : avoid recomputing alibi slopes (CPU)
    
    * llama : reuse hparams.f_max_alibi_bias in all cases
    
    ggml-ci
    
    * ggml : support alibi bias in ggml_soft_max_ext (CPU + Metal)
    
    ggml-ci
    
    * ggml : handle all SRCs (do not break on first null)
    
    ggml-ci
    
    * tests : do not use slope for large soft_max
    
    accumulates too much error
    
    ggml-ci
    
    * ggml : alternative ALiBi without extra tensor
    
    We compute the slopes in the kernel
    
    ggml-ci
    
    * cuda : add ALiBi support in ggml_soft_max_ext
    
    ggml-ci
    
    * ggml : deprecate ggml_alibi
    
    * ggml : support multi-sequence ALiBi (Metal)
    
    ggml-ci
    
    * cuda : add multi-seq ALiBi + remote F16 soft_max
    
    ggml-ci
    
    * ggml : update deprecation message
    
    * ggml : fix pos ptr when no ALiBi
    
    ggml-ci
    
    * cuda : fix performance (pow -> powf)
    
    * cuda : precompute ALiBi constants
    
    * metal : pre-compute ALiBi slopes
    
    ggml-ci
    
    * llama : init kq_pos only if needed
    
    ggml-ci
    
    * test-backend-ops : add null pos test to soft_max
    
    test-backend-ops : replace soft_max tests
    
    ggml-ci
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    ggerganov and slaren authored Feb 17, 2024
    Configuration menu
    Copy the full SHA
    8f1be0d View commit details
    Browse the repository at this point in the history

Commits on Feb 18, 2024

  1. flake.lock: Update

    Flake lock file updates:
    
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/f8e2ebd66d097614d51a56a755450d4ae1632df1' (2024-02-07)
      → 'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
    github-actions[bot] authored and philiptaron committed Feb 18, 2024
    Configuration menu
    Copy the full SHA
    c8e0d7e View commit details
    Browse the repository at this point in the history
  2. 1.5 bit quantization (ggerganov#5453)

    * iq1_s: WIP basics
    
    * iq1_s: CUDA is working
    
    * iq1_s: scalar CPU dot product
    
    * iq1_s: WIP AVX2 dot product - something is not right
    
    * Fix tests
    
    * Fix shadow warnings
    
    * Fix after merge with latest master
    
    * iq1_s: AVX2 finally works
    
    * iq1_s: ARM_NEON dot product. Works, but not very fast
    
    * iq1_s: better grid
    
    * iq1_s: use IQ2_XXS for attn_output
    
    At a cost of 0.04 extra bpw this gives a big improvement in PPL.
    
    * iq1_s: Metal basics
    
    Dequantize works, but not dot product
    
    * iq1_s: Metal works, but quite slow
    
    As usual, Apple Silicon does not like the code I write.
    
    * iq1_s: Tests
    
    * iq1_s: slightly faster dot product
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    bd2d4e3 View commit details
    Browse the repository at this point in the history
  3. llava : update surgery script to not remove tensors (ggerganov#5536)

    This commit updates the surgery script to not remove the tensors from the
    model file. For this to work the `--skip-unknown` flag is added as an
    argument to the convert.py script in README.md.
    
    The motivation for this change is that the surgery script currently
    removes the projector tensors from the model file. If the model was
    checked out from a repository, the model file will have been updated
    and have to be checked out again to reset this effect. If this can be
    avoided I think it would be preferable.
    
    I did not perform this change for BakLLaVA models as I am not sure
    how that part works.
    danbev authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    fc0c8d2 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5d3de51 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    1dcc3fd View commit details
    Browse the repository at this point in the history
  6. server : graceful server shutdown (ggerganov#5244)

    This updates the server queue to support graceful shutdown of the server on signals.
    dhiltgen authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    66c1968 View commit details
    Browse the repository at this point in the history
  7. server : --n-predict option document and cap to max value (ggerganov#…

    …5549)
    
    * server: document --n-predict
    
    * server: ensure client request cannot override n_predict if set
    
    * server: fix print usage LF in new --n-predict option
    phymbert authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    36376ab View commit details
    Browse the repository at this point in the history
  8. server : enhanced health endpoint (ggerganov#5548)

    * server: enrich health endpoint with available slots, return 503 if not slots are available
    
    * server: document new status no slot available in the README.md
    phymbert authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    e75c627 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    f3f28c5 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    689a091 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    c145f8a View commit details
    Browse the repository at this point in the history
  12. common, server : surface min_keep as its own parameter (ggerganov#5567)

    * Feature - surface min_keep as its own parameter
    
    * Updated README with min_keep param
    robeyh authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    5ee99c3 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    7ad554f View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    b1de968 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    14278f5 View commit details
    Browse the repository at this point in the history
  16. build : pass all warning flags to nvcc via -Xcompiler (ggerganov#5570)

    * build : pass all warning flags to nvcc via -Xcompiler
    * make : fix apparent mis-merge from ggerganov#3952
    * make : fix incorrect GF_CC_VER for CUDA host compiler
    cebtenzzre authored Feb 18, 2024
    Configuration menu
    Copy the full SHA
    a0c2dad View commit details
    Browse the repository at this point in the history

Commits on Feb 19, 2024

  1. ggml : android and old glibc NUMA incompatibility bugfixes (ggerganov…

    …#5557)
    
    * #ifdef out some code NUMA blocks for Android due to lack of support
    
    * added in some __ANDROID__ if def gates around numa code and forced GLIBC prior to 2.29 to use a syscall for getcpu instead of the wrapper
    
    * Changed gates on numa platform specific stuff to __gnu_linux__ to skip any platforms without glibc
    
    * harmonizing #if defined blocks for numa code to __gnu_linux__ since that's the only model that's being followed anyways
    
    ---------
    
    Co-authored-by: root <[email protected]>
    bmtwl and root authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    f0d1faf View commit details
    Browse the repository at this point in the history
  2. readme : update (ggerganov#5572)

    Added 1.5-bit on README.md
    Mirko185 authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    769a716 View commit details
    Browse the repository at this point in the history
  3. cuda, metal : fix nans in soft_max (ggerganov#5574)

    * cuda : fix nans in soft_max
    
    * metal : fix nans in soft_max
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    slaren and ggerganov authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    3a9cb4c View commit details
    Browse the repository at this point in the history
  4. llama : add llama_chat_apply_template() (ggerganov#5538)

    * llama: add llama_chat_apply_template
    
    * test-chat-template: remove dedundant vector
    
    * chat_template: do not use std::string for buffer
    
    * add clarification for llama_chat_apply_template
    
    * llama_chat_apply_template: add zephyr template
    
    * llama_chat_apply_template: correct docs
    
    * llama_chat_apply_template: use term "chat" everywhere
    
    * llama_chat_apply_template: change variable name to "tmpl"
    ngxson authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    11b12de View commit details
    Browse the repository at this point in the history
  5. baby-llama : allocate graphs in ggml_context (ggerganov#5573)

    * Fixed the baby-llama issue (see issue ggerganov#4830)
    
    * minor : fix whitespaces
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    NawafAlansari and ggerganov authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    4480542 View commit details
    Browse the repository at this point in the history
  6. llava : avoid changing the original BakLLaVA model (ggerganov#5577)

    This is a follup of Commit fc0c8d2
    ("llava : update surgery script to not remove tensors") but this time
    the change is to the BakLLaVA specific part of the surgery script.
    
    I've been able to test this using SkunkworksAI/BakLLaVA-1 and it works
    as expected using the instructions in README.md.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    7084755 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f53119c View commit details
    Browse the repository at this point in the history
  8. cmake : remove obsolete sycl compile flags (ggerganov#5581)

    * rm unwanted sycl compile options
    
    * fix bug
    
    * fix bug
    
    * format fix
    abhilash1910 authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    13e2c77 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    70d45af View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    68a6b98 View commit details
    Browse the repository at this point in the history
  11. ci : enable -Werror for CUDA builds (ggerganov#5579)

    * cmake : pass -Werror through -Xcompiler
    
    ggml-ci
    
    * make, cmake : enable CUDA errors on warnings
    
    ggml-ci
    ggerganov authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    d0e3ce5 View commit details
    Browse the repository at this point in the history
  12. metal : option to embed MSL source into compiled binary (whisper/1842)

    * ggml : embed Metal library source (ggml-metal.metal) into binary
    
    enable by setting WHISPER_EMBED_METAL_LIBRARY
    
    * rename the build option
    
    * rename the preprocessor directive
    
    * generate Metal library embedding assembly on-fly during build process
    didzis authored and ggerganov committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    890559a View commit details
    Browse the repository at this point in the history
  13. ggml-alloc : apply ggml/731

    ggerganov committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    a3145bd View commit details
    Browse the repository at this point in the history
  14. sync : ggml

    ggml-ci
    ggerganov committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    337c9cb View commit details
    Browse the repository at this point in the history
  15. llava : replace ggml_cpy with ggml_cont

    slaren authored and ggerganov committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    6fd4137 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    1387cf6 View commit details
    Browse the repository at this point in the history
  17. examples : support minItems/maxItems in JSON grammar converter (ggerg…

    …anov#5039)
    
    * support minLength and maxLength in JSON schema grammar converter
    
    * Update examples/json-schema-to-grammar.py
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    nopperl and ggerganov authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    9d679f0 View commit details
    Browse the repository at this point in the history
  18. Configuration menu
    Copy the full SHA
    f24ed14 View commit details
    Browse the repository at this point in the history
  19. cuda : ignore peer access already enabled errors (ggerganov#5597)

    * cuda : ignore peer access already enabled errors
    
    * fix hip
    slaren authored Feb 19, 2024
    Configuration menu
    Copy the full SHA
    40c3a6c View commit details
    Browse the repository at this point in the history
  20. Allow for Vulkan build with Accelerate.

    dokterbob authored and philiptaron committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    5dde540 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    42f664a View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    d8c0545 View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    f50db6a View commit details
    Browse the repository at this point in the history
  24. Refactor validation and enumeration platform checks into functions to…

    … clean up ggml_vk_instance_init()
    0cc4m authored and philiptaron committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    bb9dcd5 View commit details
    Browse the repository at this point in the history
  25. Enable Vulkan MacOS CI

    0cc4m authored and philiptaron committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    22f83f0 View commit details
    Browse the repository at this point in the history
  26. nix: now that we can do so, allow MacOS to build Vulkan binaries

    Author:    Philip Taron <[email protected]>
    Date:      Tue Feb 13 20:28:02 2024 +0000
    dokterbob authored and philiptaron committed Feb 19, 2024
    Configuration menu
    Copy the full SHA
    633782b View commit details
    Browse the repository at this point in the history

Commits on Feb 20, 2024

  1. Update ggml_sycl_op_mul_mat_vec_q (ggerganov#5502)

    * Update ggml_sycl_op_mul_mat_vec_q
    
    * Apply suggestions from code review
    
    Co-authored-by: Abhilash Majumder <[email protected]>
    
    * revert suggestion on macro
    
    * fix bug
    
    * Add quant type GGML_TYPE_IQ1_S to unsupported
    
    * fix format
    
    ---------
    
    Co-authored-by: Abhilash Majumder <[email protected]>
    AidanBeltonS and abhilash1910 authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    b9111bd View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c0a8c6d View commit details
    Browse the repository at this point in the history
  3. metal : add build system support for embedded metal library (ggergano…

    …v#5604)
    
    * add build support for embedded metal library
    
    * Update Makefile
    
    ---------
    
    Co-authored-by: Haoxiang Fei <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    3 people authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    8dbbd75 View commit details
    Browse the repository at this point in the history
  4. readme : update UI list (ggerganov#5605)

    * Add maid to ui list
    
    * Specify licence
    danemadsen authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    5207b3f View commit details
    Browse the repository at this point in the history
  5. Server: use llama_chat_apply_template (ggerganov#5593)

    * server: use llama_chat_apply_template
    
    * server: remove trailing space
    
    * server: fix format_chat
    
    * server: fix help message
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * server: fix formatted_chat
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    ngxson and ggerganov authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    9c405c9 View commit details
    Browse the repository at this point in the history
  6. llava : add explicit instructions for llava-1.6 (ggerganov#5611)

    This commit contains a suggestion for the README.md in the llava
    example. The suggestion adds explicit instructions for how to convert
    a llava-1.6 model and run it using llava-cli.
    
    The motivation for this is that having explicit instructions similar to
    the 1.5 instructions will make it easier for users to try this out.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    4ed8e4f View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    06bf2cf View commit details
    Browse the repository at this point in the history
  8. server : support llava 1.6 (ggerganov#5553)

    * server: init working 1.6
    
    * move clip_image to header
    
    * remove commented code
    
    * remove c++ style from header
    
    * remove todo
    
    * expose llava_image_embed_make_with_clip_img
    
    * fix zig build
    cjpais authored Feb 20, 2024
    Configuration menu
    Copy the full SHA
    6560bed View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2024

  1. IQ4_NL: 4-bit non-linear quants with blocks of 32 (ggerganov#5590)

    * iq4_nl: squash commits for easier rebase
    
    * Basics (quantize, dequantize)
    * CUDA dequantize and dot product
    * Slightly faster CUDA dot product (120 t/s)
    * Switch to 6-bit scales
    * Scalar dot product
    * AVX2 dot product
    * ARM_NEON dot product
    * Works on metal, but still slow
    * Slightly better Metal dot product
    * Another small Metal improvement
    * Metal dot product is getting there
    * Faster CUDA dot product
    * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
    * Report the actual bpw
    * Add _xs mix that is 4.05 bpw for non-MoE models
    * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
    * AVX2 dot product uses Q8_0 instead of Q8_K
    * Add to test-backend-ops
    * Minor fix
    * Also use use Q5_K for attn_output in MoE models
    * Fixes after merging latest master
    * Switching to blocks of 32
    * AVX2 for blocks of 32
    * Scaler dot product for blocks of 32
    * ARM_NEON dot product for blocks of 32
    * Metal kernels for blocks of 32
    * Slightly faster Metal kernels
    
    * iq4_nl: Fix after merging with master
    
    * iq4_nl: another fix after merging with master
    
    * Use IQ4_NL instead of Q4_K when using k-quants is not possible
    
    * Fix typo that makes several tests fail
    
    * It was the ggml_vdotq thing missed inside the brackets
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    a14679c View commit details
    Browse the repository at this point in the history
  2. [SYCL] conext add name (ggerganov#5624)

    * [SYCL] conext add name
    
    * name should start with SYCL*
    airMeng authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    88c46cb View commit details
    Browse the repository at this point in the history
  3. llama : add gemma model (ggerganov#5631)

    There are couple things in this architecture:
    
    1. Shared input and output embedding parameters.
    2. Key length and value length are not derived from `n_embd`.
    
    More information about the models can be found at
    https://ai.google.dev/gemma. GGUFs can be downloaded from
    https://huggingface.co/google.
    postmasters authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    580111d View commit details
    Browse the repository at this point in the history
  4. llava : add --skip-unknown to 1.6 convert.py (ggerganov#5632)

    This commit adds the `--skip-unknown` option to the convert.py script
    and removes the saving of the updated checkpoints to avoid updating
    possibly checked out files.
    
    The motivation for this change is that this was done for 1.5
    in Commit fc0c8d2 ("llava :
    update surgery script to not remove tensors") and makes the examples
    more consistent.
    
    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    cc6cac0 View commit details
    Browse the repository at this point in the history
  5. readme : update hot topics

    ggerganov authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    c14f72d View commit details
    Browse the repository at this point in the history
  6. sync : ggml (ggerganov#5633)

    * ggml : fix conv_2d batch mode (ggml/737)
    
    Co-authored-by: bssrdf <[email protected]>
    
    * ggml : compute forward no longer pass src tensors (ggml/729)
    
    * sync : ggml
    
    ggml-ci
    
    ---------
    
    Co-authored-by: bssrdf <[email protected]>
    Co-authored-by: bssrdf <[email protected]>
    3 people authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    eccd7a2 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    a00a35c View commit details
    Browse the repository at this point in the history
  8. server: health: fix race condition on slots data using tasks queue (g…

    …gerganov#5634)
    
    * server: health: fix race condition on slots data using tasks queue
    
    * server: health:
        * include_slots only if slots_endpoint
        * fix compile warning task.target_id not initialized.
    phymbert authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    1ecea25 View commit details
    Browse the repository at this point in the history
  9. sync : ggml

    ggerganov committed Feb 21, 2024
    Configuration menu
    Copy the full SHA
    5022cf2 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    89febfe View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    ba2135c View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    7fe4678 View commit details
    Browse the repository at this point in the history
  13. Add docs for llama_chat_apply_template (ggerganov#5645)

    * add docs for llama_chat_apply_template
    
    * fix typo
    ngxson authored Feb 21, 2024
    Configuration menu
    Copy the full SHA
    7c8bcc1 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    973053d View commit details
    Browse the repository at this point in the history

Commits on Feb 22, 2024

  1. mpt : add optional bias tensors (ggerganov#5638)

    Update for MPT with optional bias parameters: to work with PhoGPT and SEA-LION models that were pre-trained with 'bias'.
    datquocnguyen authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    4ef245a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    c5688c6 View commit details
    Browse the repository at this point in the history
  3. server : fallback to chatml, add AlphaMonarch chat template (ggergano…

    …v#5628)
    
    * server: fallback to chatml
    
    * add new chat template
    
    * server: add AlphaMonarch to test chat template
    
    * server: only check model template if there is no custom tmpl
    
    * remove TODO
    ngxson authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    a46f507 View commit details
    Browse the repository at this point in the history
  4. readme : update hot topics

    ggerganov authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    56d03d9 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    3a03541 View commit details
    Browse the repository at this point in the history
  6. workflows: nix: hardcode cachix ids, build unconditionally (ggerganov…

    …#5663)
    
    GitHub does not expose environment and repository variables to PRs coming from forks implies that we've been disabling the Nix CI actions for most PRs. 
    
    The `if:` also didn't make much sense, because we can always pull from cachix, and there's no point (albeit no risk either) in pushing cache for the untrusted code.
    SomeoneSerge authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    4cb4d8b View commit details
    Browse the repository at this point in the history
  7. Add Gemma chat template (ggerganov#5665)

    * add gemma chat template
    
    * gemma: only apply system_prompt on non-model message
    ngxson authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    373ee3f View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5a9e2f6 View commit details
    Browse the repository at this point in the history
  9. nix: init singularity and docker images (ggerganov#5056)

    Exposes a few attributes demonstrating how to build [singularity](https://docs.sylabs.io/guides/latest/user-guide/)/[apptainer](https://apptainer.org/) and Docker images re-using llama.cpp's Nix expression.
    
    Built locally on `x86_64-linux` with `nix build github:someoneserge/llama.cpp/feat/nix/images#llamaPackages.{docker,docker-min,sif,llama-cpp}` and it's fast and effective.
    SomeoneSerge authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    201294a View commit details
    Browse the repository at this point in the history
  10. ggml : 32-bit arm compat (whisper/1891)

    * ggml : 32-bit arm compat
    
    * ggml : add ggml_vqtbl1q_s8 impl
    
    * ggml : cont
    ggerganov committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    efd56b1 View commit details
    Browse the repository at this point in the history
  11. sync : ggml

    ggerganov committed Feb 22, 2024
    Configuration menu
    Copy the full SHA
    334f76f View commit details
    Browse the repository at this point in the history
  12. ggml : always define ggml_fp16_t as uint16_t (ggerganov#5666)

    * ggml : always define ggml_fp16_t as uint16_t
    
    ggml-ci
    
    * ggml : cont
    
    ggml-ci
    
    * ggml : cont
    
    * ggml : cont
    
    ggml-ci
    
    * ggml : cont
    
    ggml-ci
    
    * cuda : no longer ggml headers last
    
    ggml-ci
    
    * ggml : fix q6_K FP16 -> FP32 conversion
    
    ggml-ci
    
    * ggml : more FP16 -> FP32 conversion fixes
    
    ggml-ci
    ggerganov authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    7e4f339 View commit details
    Browse the repository at this point in the history
  13. py : add Gemma conversion from HF models (ggerganov#5647)

    * py : add gemma conversion from HF models
    
    * Update convert-hf-to-gguf.py
    
    Co-authored-by: Aarni Koskela <[email protected]>
    
    * Update convert-hf-to-gguf.py
    
    Co-authored-by: Aarni Koskela <[email protected]>
    
    * Update convert-hf-to-gguf.py
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    ---------
    
    Co-authored-by: Aarni Koskela <[email protected]>
    Co-authored-by: Jared Van Bortel <[email protected]>
    3 people authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    847eedb View commit details
    Browse the repository at this point in the history
  14. gemma : use more bits for the token_embd.weight tensor (ggerganov#5650)

    * gemma : use Q8_0 for the token_embd.weight tensor
    
    * llama : quantize token_embd.weight using output type
    ggerganov authored Feb 22, 2024
    Configuration menu
    Copy the full SHA
    96633ee View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    15499eb View commit details
    Browse the repository at this point in the history

Commits on Feb 23, 2024

  1. Configuration menu
    Copy the full SHA
    54fbcd2 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    fd43d66 View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2024

  1. server: init functional tests (ggerganov#5566)

    * server: tests: init scenarios
     - health and slots endpoints
     - completion endpoint
     - OAI compatible chat completion requests w/ and without streaming
     - completion multi users scenario
     - multi users scenario on OAI compatible endpoint with streaming
     - multi users with total number of tokens to predict exceeds the KV Cache size
     - server wrong usage scenario, like in Infinite loop of "context shift" ggerganov#3969
     - slots shifting
     - continuous batching
     - embeddings endpoint
     - multi users embedding endpoint: Segmentation fault ggerganov#5655
     - OpenAI-compatible embeddings API
     - tokenize endpoint
     - CORS and api key scenario
    
    * server: CI GitHub workflow
    
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    phymbert and ggerganov authored Feb 24, 2024
    Configuration menu
    Copy the full SHA
    525213d View commit details
    Browse the repository at this point in the history
  2. IQ3_S: a much better alternative to Q3_K (ggerganov#5676)

    * iq4_nl: squash commits for easier rebase
    
    * Basics (quantize, dequantize)
    * CUDA dequantize and dot product
    * Slightly faster CUDA dot product (120 t/s)
    * Switch to 6-bit scales
    * Scalar dot product
    * AVX2 dot product
    * ARM_NEON dot product
    * Works on metal, but still slow
    * Slightly better Metal dot product
    * Another small Metal improvement
    * Metal dot product is getting there
    * Faster CUDA dot product
    * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided
    * Report the actual bpw
    * Add _xs mix that is 4.05 bpw for non-MoE models
    * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl
    * AVX2 dot product uses Q8_0 instead of Q8_K
    * Add to test-backend-ops
    * Minor fix
    * Also use use Q5_K for attn_output in MoE models
    * Fixes after merging latest master
    * Switching to blocks of 32
    * AVX2 for blocks of 32
    * Scaler dot product for blocks of 32
    * ARM_NEON dot product for blocks of 32
    * Metal kernels for blocks of 32
    * Slightly faster Metal kernels
    
    * Resurrecting iq3_xs
    
    After all the experimentation, nothing was better than this.
    
    * Minor PPL improvement via a block scale fudge factor
    
    * Minor improvement via 3 neighbours
    
    * iq3_xs: working scalar and AVX2 dot products
    
    * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s)
    
    * iq3_xs: working Metal implementation
    
    * Adding IQ3_M - IQ3_XS mix with mostly Q4_K
    
    * iiq3_xs: a 3.4375 bpw variant
    
    * iq3_xs: make CUDA work for new version
    
    * iq3_xs: make scalar and AVX2 work for new version
    
    * iq3_s: make ARM_NEON work with new version
    
    * iq3_xs: make new version work on metal
    
    Performance is very similar to Q3_K_S
    
    * iq3_xs: tiny Metal speed improvement
    
    * iq3_xs: tiny Metal speed improvement
    
    * Fix stupid warning
    
    * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS
    
    * iq3_xs: rename to iq3_s
    
    * iq3_s: make tests pass
    
    * Move Q3_K_XS mix to 3.25 bpw
    
    * Attempt to fix failing tests
    
    * Another attempt to fix the Windows builds
    
    * Attempt to fix ROCm
    
    * ROCm again
    
    * iq3_s: partial fix for QK_K = 64
    
    * iq3_s: make it work on metal for QK_K = 64
    
    Pleasent surprise: the coding was super-block size independent,
    so all it took was to delete some QK_K == 256 guards.
    
    * Will this fix ROCm?
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 24, 2024
    Configuration menu
    Copy the full SHA
    4c4cb30 View commit details
    Browse the repository at this point in the history
  3. server: continue to update other slots on embedding concurrent request (

    ggerganov#5699)
    
    * server: ggerganov#5655 - continue to update other slots on embedding concurrent request.
    
    * server: tests: add multi users embeddings as fixed
    
    * server: tests: adding OAI compatible embedding concurrent endpoint
    
    * server: tests: adding OAI compatible embedding with multiple inputs
    phymbert authored Feb 24, 2024
    Configuration menu
    Copy the full SHA
    9e359a4 View commit details
    Browse the repository at this point in the history

Commits on Feb 25, 2024

  1. py : fix StableLM conversion after config.json changes (ggerganov#5703)

    * Fix issues during StableLM models conversion
    
    * Fix hard coded layer_norm_eps
    
    * Support layer_norm_eps for LlavaStableLM
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    * Add missing parenthesis
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    * Support rotary_factor for LlavaStableLM
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    * fix typo
    
    * Add StableLMEpochForCausalLM for safety
    
    Co-authored-by: compilade <[email protected]>
    
    * Add StableLMEpochForCausalLM for safety 2
    
    Co-authored-by: compilade <[email protected]>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    Co-authored-by: Jared Van Bortel <[email protected]>
    Co-authored-by: compilade <[email protected]>
    4 people authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    69917df View commit details
    Browse the repository at this point in the history
  2. code : normalize enum names (ggerganov#5697)

    * coda : normalize enum names
    
    ggml-ci
    
    * code : cont
    
    * code : cont
    ggerganov authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    ab336a9 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1289408 View commit details
    Browse the repository at this point in the history
  4. server: concurrency fix + monitoring - add /metrics prometheus compat…

    …ible endpoint (ggerganov#5708)
    
    * server: monitoring - add /metrics prometheus compatible endpoint
    
    * server: concurrency issue, when 2 task are waiting for results, only one call thread is notified
    
    * server: metrics - move to a dedicated struct
    phymbert authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    d52d781 View commit details
    Browse the repository at this point in the history
  5. server: logs - unified format and --log-format option (ggerganov#5700)

    * server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id
    
    * server : skip GH copilot requests from logging
    
    * server : change message format of server_log()
    
    * server : no need to repeat log in comment
    
    * server : log style consistency
    
    * server : fix compile warning
    
    * server : fix tests regex patterns on M2 Ultra
    
    * server: logs: PR feedback on log level
    
    * server: logs: allow to choose log format in json or plain text
    
    * server: tests: output server logs in text
    
    * server: logs switch init logs to server logs macro
    
    * server: logs ensure value json value does not raised error
    
    * server: logs reduce level VERBOSE to VERB to max 4 chars
    
    * server: logs lower case as other log messages
    
    * server: logs avoid static in general
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    phymbert and ggerganov authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    930b178 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    7d548a1 View commit details
    Browse the repository at this point in the history
  7. make : fix nvcc version is empty (ggerganov#5713)

    fix nvcc version is empty
    kwin1412 authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    f1a98c5 View commit details
    Browse the repository at this point in the history
  8. ggml-quants : provide ggml_vqtbl1q_u8 for 64bit compatibility (ggerga…

    …nov#5711)
    
    * [ggml-quants] Provide ggml_vqtbl1q_u8 for 64bit compatibility
    
    vqtbl1q_u8 is not part of arm v7 neon library
    
    * [android-example] Remove abi filter after arm v7a fix
    
    * [github-workflows] Do not skip Android armeabi-v7a build
    rgryta authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    abbabc5 View commit details
    Browse the repository at this point in the history
  9. server : fix crash when system prompt is bigger than batch size (gger…

    …ganov#5714)
    
    The system prompt is now decoded in batches.
    
    * server : fix off-by-one n_past when start of prompt matches whole cache
    
    The tokens right after the matching part would otherwise skip a pos value.
    compilade authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    f762501 View commit details
    Browse the repository at this point in the history
  10. llama : refactor k-shift implementation + KV defragmentation (ggergan…

    …ov#5691)
    
    * llama : refactor k-shift implementation
    
    ggml-ci
    
    * llama : rename llama_kv_cache_seq_shift to llama_kv_cache_seq_add
    
    * llama : cont k-shift refactoring + normalize type names
    
    ggml-ci
    
    * minor : fix MPI builds
    
    * llama : reuse n_rot from the build context
    
    ggml-ci
    
    * llama : revert enum name changes from this PR
    
    ggml-ci
    
    * llama : update llama_rope_type
    
    * llama : add comment about rope values
    
    * llama : fix build
    
    * passkey : apply kv cache updates explicitly
    
    ggml-ci
    
    * llama : change name to llama_kv_cache_update()
    
    * llama : add llama_kv_cache_seq_pos_max()
    
    * passkey : fix llama_kv_cache_seq_pos_max() usage
    
    * llama : some llama_kv_cell simplifications
    
    * llama : add llama_kv_cache_compress (EXPERIMENTAL)
    
    * llama : add alternative KV cache merging (EXPERIMENTAL)
    
    * llama : add llama_kv_cache_defrag
    
    * llama : comments
    
    * llama : remove llama_kv_cache_compress
    
    will add in a separate PR
    
    ggml-ci
    
    * llama : defragment via non-overlapping moves
    
    * llama : ggml_graph based defrag implementation
    
    ggml-ci
    
    * llama : switch the loop order in build_defrag
    
    * llama : add comments
    ggerganov authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    bf08e00 View commit details
    Browse the repository at this point in the history
  11. server: docs - refresh and tease a little bit more the http server (g…

    …gerganov#5718)
    
    * server: docs - refresh and tease a little bit more the http server
    
    * Rephrase README.md server doc
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update examples/server/README.md
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update examples/server/README.md
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    * Update README.md
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    phymbert and ggerganov authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    8b35035 View commit details
    Browse the repository at this point in the history
  12. server: tests - slow inference causes timeout on the CI (ggerganov#5715)

    * server: tests - longer inference timeout for CI
    phymbert authored Feb 25, 2024
    Configuration menu
    Copy the full SHA
    e3965cf View commit details
    Browse the repository at this point in the history
  13. flake.lock: Update

    Flake lock file updates:
    
    • Updated input 'nixpkgs':
        'github:NixOS/nixpkgs/5863c27340ba4de8f83e7e3c023b9599c3cb3c80' (2024-02-16)
      → 'github:NixOS/nixpkgs/cbc4211f0afffe6dfd2478a62615dd5175a13f9a' (2024-02-23)
    github-actions[bot] authored and SomeoneSerge committed Feb 25, 2024
    Configuration menu
    Copy the full SHA
    c393733 View commit details
    Browse the repository at this point in the history

Commits on Feb 26, 2024

  1. Configuration menu
    Copy the full SHA
    269de86 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    8a533f0 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4804215 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    67fd331 View commit details
    Browse the repository at this point in the history
  5. [SYCL] Add support for soft_max ALiBi (ggerganov#5639)

    * Add support for bias
    
    * Update pre-processor
    
    * rm commented code
    
    * fix format
    
    * fix CI
    
    ---------
    
    Co-authored-by: Abhilash Majumder <[email protected]>
    AidanBeltonS and abhilash1910 authored Feb 26, 2024
    Configuration menu
    Copy the full SHA
    e849078 View commit details
    Browse the repository at this point in the history
  6. readme : update ui list (ggerganov#5731)

    * Add LLMFarm (ui for iOS) to list
    guinmoon authored Feb 26, 2024
    Configuration menu
    Copy the full SHA
    c4d7f81 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    47bb7b4 View commit details
    Browse the repository at this point in the history
  8. Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantizati…

    …on range (ggerganov#5721)
    
    * Adding IQ2_S and IQ2_M as a single cumulative commit
    
    * Update examples/quantize/quantize.cpp
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    Co-authored-by: Georgi Gerganov <[email protected]>
    3 people authored Feb 26, 2024
    Configuration menu
    Copy the full SHA
    a33e6a0 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    b11a93d View commit details
    Browse the repository at this point in the history

Commits on Feb 27, 2024

  1. Makefile: use variables for cublas (ggerganov#5689)

    * make: use arch variable for cublas
    
    * fix UNAME_M
    
    * check opt first
    
    ---------
    
    Co-authored-by: lindeer <[email protected]>
    lindeer and lindeer authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    cbbd1ef View commit details
    Browse the repository at this point in the history
  2. llama : fix defrag bugs + add parameter (ggerganov#5735)

    * llama : fix defrag bugs + enable by default
    
    ggml-ci
    
    * llama : add defrag_thold parameter
    
    ggml-ci
    
    * llama : cont
    
    * llama : disable log message
    
    ggml-ci
    
    * llama : fix graph size check during defrag
    ggerganov authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    9d533a7 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    1f30b7a View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    c24a2a6 View commit details
    Browse the repository at this point in the history
  5. IQ4_XS: a 4.25 bpw quantization (ggerganov#5747)

    * Try IQ4_NL with blocks of 64 - does not look good
    
    * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
    
    * iq4_xs: CUDA works - 133.2 t/s
    
    * iq4_xs: AVX2 dot product
    
    * iq4_xs: ARM_NEON dot product
    
    * iq4_nl: Metal implementation
    
    As usual, Metal / Apple Silicon don't like my quants.
    
    * iq3_xs: minor fix
    
    * iq4_xs: shrink by using IQ3_S for attn_k and attn_q
    
    * iq4_xs: revert using IQ3_S for attn_k and attn_v
    
    PPL vs size is good, but CPU performance suffers: on M2 Max
    TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
    to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
    using IQ3_S vs 133 t/s with pure IQ4_XS.
    
    * Fix CI
    
    * iq4_xs: Added forgotten check for 256 divisibility
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    0becb22 View commit details
    Browse the repository at this point in the history
  6. Attempt to fix android build (ggerganov#5752)

    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 27, 2024
    Configuration menu
    Copy the full SHA
    cb49e0f View commit details
    Browse the repository at this point in the history

Commits on Feb 28, 2024

  1. ggml : make i-quants work with super-blocks of 64 (CPU,Metal) (ggerga…

    …nov#5760)
    
    * WIP: make i-quants work for QK_K = 64
    
    * iq2_xs: attempt to fix AVX dot product for QK_K = 64
    
    Tests pass, but I get gibberish.
    
    * QK_K = 64 tests pass on ARM_NEON and Metal
    
    Sadly, that does not mean it actually works.
    
    * Make CUDA compile with QK_K = 64
    
    Tests don't pass, plus we get misaligned access
    
    * Q2_K: fixed bug in imatrix quantization for QK_K = 64
    
    * iq1_s: turn off SIMD implementation for QK_K = 64 (it does not work)
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <[email protected]>
    ikawrakow and Kawrakow authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    7c4263d View commit details
    Browse the repository at this point in the history
  2. server : add "/chat/completions" alias for "/v1/...` (ggerganov#5722)

    * Add "/chat/completions" as alias for "/v1/chat/completions"
    
    * merge to upstream master
    
    * minor : fix trailing whitespace
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    jorgealias and ggerganov authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    efc7225 View commit details
    Browse the repository at this point in the history
  3. readme : add link to LLaVA 1.6 models (ggerganov#5758)

    Signed-off-by: Daniel Bevenius <[email protected]>
    danbev authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    6c44168 View commit details
    Browse the repository at this point in the history
  4. llama : improve BERT tokenization (ggerganov#5740)

    * implement nfd for stripping accents in wpm tokenizer
    
    * sort nfd map; reuse iterator
    
    * use builtin tolower
    
    * add locale include
    
    * Simplify to_lower cases
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    iamlemec and cebtenzzre authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    177628b View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    adcb12a View commit details
    Browse the repository at this point in the history
  6. server : hit Ctrl+C twice to exit (ggerganov#5734)

    * server: twice ctrl+C to exit
    
    * std::atomic_flag
    
    * sigint: message
    
    * sigint: stderr
    
    * Update examples/server/server.cpp
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    
    ---------
    
    Co-authored-by: Jared Van Bortel <[email protected]>
    ngxson and cebtenzzre authored Feb 28, 2024
    Configuration menu
    Copy the full SHA
    a693bea View commit details
    Browse the repository at this point in the history
  7. Introduce backend GUIDs (ggml/743)

    * Introduce backend GUIDs
    
    Initial proposed implementation of backend GUIDs
    (Discussed in ggerganov/ggml#741)
    
    Hardcoded CPU backend GUID (for now)
    Change ggml_backend_is_cpu logic to use GUID
    
    * Remove redundant functions
    
    Remove redundant functions `ggml_backend_i::get_name` and `ggml_backend_guid` which are not desired for future expansion
    
    * Add spaces to match style
    
    Co-authored-by: slaren <[email protected]>
    
    * Fix brace style to match
    
    Co-authored-by: slaren <[email protected]>
    
    * Add void to () in function signature
    
    Co-authored-by: slaren <[email protected]>
    
    * Add back ggml_backend_guid and make CPU_GUID a local static in ggml_backend_cpu_guid
    
    * add guids to all backends
    
    ggml-ci
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    2 people authored and ggerganov committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    5f70671 View commit details
    Browse the repository at this point in the history
  8. add google magika inference example (ggml/748)

    * add magika inference example
    
    * ggml : fix unaligned accesses in custom ops
    
    * ggml : fix FP32 GELU for values that exceed the FP16 range
    
    * use ggml_pool_1d
    
    * add README
    
    * Update README.md
    
    * pad inputs if the files are too small
    
    * cleanup
    
    ggml-ci
    slaren authored and ggerganov committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    2774b0c View commit details
    Browse the repository at this point in the history
  9. sync : ggml

    ggerganov committed Feb 28, 2024
    Configuration menu
    Copy the full SHA
    8c0e8f4 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    78aacf3 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    08c5ee8 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    317709b View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    87c91c0 View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. Configuration menu
    Copy the full SHA
    d5ab297 View commit details
    Browse the repository at this point in the history
  2. Server: normalize naming (ggerganov#5779)

    * server: normalize naming
    
    * fix spacing
    ngxson authored Feb 29, 2024
    Configuration menu
    Copy the full SHA
    052051d View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    e841b7a View commit details
    Browse the repository at this point in the history

Commits on Mar 1, 2024

  1. [SYCL] Use batched mul_mat pathway (ggerganov#5591)

    * Use batched mul_mat pathway
    
    * rm extra line
    
    * Explicitly state scaled data type
    
    ---------
    
    Co-authored-by: Abhilash Majumder <[email protected]>
    AidanBeltonS and abhilash1910 authored Mar 1, 2024
    Configuration menu
    Copy the full SHA
    38d1521 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f105471 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6ea0f01 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    5cb02b4 View commit details
    Browse the repository at this point in the history
  5. unicode : switch to multimap based nfd_map (ggerganov#5799)

    * switch to multimap based nfd_map due to compile time issues
    
    * simplify multimap keys
    
    * dont construct new locale every time
    iamlemec authored Mar 1, 2024
    Configuration menu
    Copy the full SHA
    9600d59 View commit details
    Browse the repository at this point in the history
  6. llama : cleanup unused mmq flags (ggerganov#5772)

    * cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q
    
    * remove: mul_mat_q in compare llama bench and usage
    
    * update llama-bench
    
    ---------
    
    Co-authored-by: slaren <[email protected]>
    phymbert and slaren authored Mar 1, 2024
    Configuration menu
    Copy the full SHA
    3ab8b3a View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    f49a535 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    e743386 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    c2224f0 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    38d16b1 View commit details
    Browse the repository at this point in the history
  11. llama : add StarCoder2 support (ggerganov#5795)

    * Add support for starcoder2
    
    * handle rope type
    
    * skip rope freq and rotary embeddings from being serialized
    
    * resolve comments
    
    * Update llama.cpp
    
    * remove redundant changes
    
    * handle `rope-theta`
    
    * llama : change starcoder2 rope type
    
    * address comment
    
    ---------
    
    Co-authored-by: Georgi Gerganov <[email protected]>
    pacman100 and ggerganov authored Mar 1, 2024
    Configuration menu
    Copy the full SHA
    c29af7e View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    c504a54 View commit details
    Browse the repository at this point in the history