Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged
merged 33 commits into from
Sep 6, 2024

Commits on Jun 27, 2024

  1. Configuration menu
    Copy the full SHA
    bd80749 View commit details
    Browse the repository at this point in the history
  2. ggml-quants : faster 1.625 bpw AVX2 vec_dot

    Not using a lookup table anymore makes it match q4_0 speed.
    
    * gguf-py : fix formatting
    
    * llama : remove spaces on empty line
    compilade committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    7ef4254 View commit details
    Browse the repository at this point in the history
  3. ggml-quants : substract 1 when back in epi8

    This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
    compilade committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    48b73b8 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    ef1e345 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    638ad52 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    9465ec6 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    89dc3b2 View commit details
    Browse the repository at this point in the history
  8. convert-hf : simplify BitNet pre-quantization

    This still results in the exact same tensor weights and scales,
    but it reveals some weirdness in the current algorithm.
    compilade committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    961e293 View commit details
    Browse the repository at this point in the history
  9. convert-hf : allow converting the weird BitNet 1.3B

    Its FFN size is 5460 which is not convenient.
    The offending tensors are kept in F16,
    which makes the final model 5.01 bpw.
    compilade committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    0996149 View commit details
    Browse the repository at this point in the history

Commits on Jun 29, 2024

  1. Configuration menu
    Copy the full SHA
    bfd2f21 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    ec50944 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8fbd593 View commit details
    Browse the repository at this point in the history

Commits on Jul 29, 2024

  1. Configuration menu
    Copy the full SHA
    dd3e62a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    79a278e View commit details
    Browse the repository at this point in the history

Commits on Jul 30, 2024

  1. Configuration menu
    Copy the full SHA
    77b8f84 View commit details
    Browse the repository at this point in the history

Commits on Jul 31, 2024

  1. ggml : even faster TQ2_0

    compilade committed Jul 31, 2024
    Configuration menu
    Copy the full SHA
    560873f View commit details
    Browse the repository at this point in the history
  2. ggml : also faster TQ1_0

    Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
    This makes TQ1_0 almost as fast as Q8_0 on AVX2.
    compilade committed Jul 31, 2024
    Configuration menu
    Copy the full SHA
    e971957 View commit details
    Browse the repository at this point in the history

Commits on Aug 1, 2024

  1. Configuration menu
    Copy the full SHA
    a6dd699 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5417089 View commit details
    Browse the repository at this point in the history
  3. ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat

    The compiler seems smart enough to use the same instruction
    even when using vget_high_s8 instead.
    compilade committed Aug 1, 2024
    Configuration menu
    Copy the full SHA
    45719a2 View commit details
    Browse the repository at this point in the history

Commits on Aug 3, 2024

  1. ggml : remove q1_3 and q2_2

    * llama : remove the separate scale tensors of BitNet b1.58
    
    They won't be needed, since the remaining ternary quant types have
    built-in scales.
    compilade committed Aug 3, 2024
    Configuration menu
    Copy the full SHA
    04eec58 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f034aa1 View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2024

  1. ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot

    Not yet tested on harware which supports it,
    might not work or might not even compile. But also it might.
    It should make the performance better on recent ARM CPUs.
    
    * ggml-quants : remove comment about possible format change of TQ2_0
    
    Making it slightly more convenient for AVX512
    but less convenient for everything else is not worth the trouble.
    compilade committed Aug 7, 2024
    Configuration menu
    Copy the full SHA
    96b3d41 View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2024

  1. Configuration menu
    Copy the full SHA
    d911cd1 View commit details
    Browse the repository at this point in the history

Commits on Aug 12, 2024

  1. gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0

    * ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0
    
    This does not change anything for ternary models,
    since their values should never end up being in halfway cases anyway.
    compilade committed Aug 12, 2024
    Configuration menu
    Copy the full SHA
    3a0bf17 View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2024

  1. convert : allow direct conversion to TQ1_0 and TQ2_0

    The token embeddings and output tensors are kept in F16
    to allow quantizing them to Q4_K and Q6_K with llama-quantize.
    
    * llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0
    
    Q4_0 is not completely symmetric (so not lossless for ternary models),
    but it should be good enough.
    compilade committed Aug 13, 2024
    Configuration menu
    Copy the full SHA
    895004f View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    69f7726 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    82b2404 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    35cc556 View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2024

  1. Configuration menu
    Copy the full SHA
    cb6d996 View commit details
    Browse the repository at this point in the history

Commits on Sep 4, 2024

  1. Configuration menu
    Copy the full SHA
    7f3a619 View commit details
    Browse the repository at this point in the history
  2. ggml ; remove unused ggml_mul special case

    It would otherwise conflict with the more general
    optimization coming with Mamba-2.
    
    * ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
    compilade committed Sep 4, 2024
    Configuration menu
    Copy the full SHA
    8d61607 View commit details
    Browse the repository at this point in the history
  3. test-backend-ops : add TQ1_0 and TQ2_0 comments for later

    Not yet adding uncommented, because some backends like SYCL and Metal
    do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
    (and Metal also doesn't handle it with GGML_OP_GET_ROWS)
    Support for TQ1_0 and TQ2_0 for other backends than CPU
    will be added in follow-up pull requests.
    compilade committed Sep 4, 2024
    Configuration menu
    Copy the full SHA
    75b3a09 View commit details
    Browse the repository at this point in the history