Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding IQ2_TN for use with ternary models #13

Merged
merged 11 commits into from
Aug 7, 2024
Merged

Adding IQ2_TN for use with ternary models #13

merged 11 commits into from
Aug 7, 2024

Commits on Aug 5, 2024

  1. iq2_tn: TriLM specific 2.0625 bpw quantization

    Quantize/dequantize/scale dot product.
    
    I get 46 t/s for the TriLM-3.9B with any SIMD!
    Finally a compiler doing a decent job auto-vectorizing the
    scalar implementation.
    Kawrakow committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    1b41d79 View commit details
    Browse the repository at this point in the history
  2. iq2_tn: AVX512

    Just reusing the k-quants template gets us to PP-512 = 376 t/s,
    TG-128 = 47.6 t/s for TriLM-3.9B.
    Kawrakow committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    dd0b08d View commit details
    Browse the repository at this point in the history
  3. iq2_tn: AVX512

    With this tweak we get to PP-512 = 431 t/s.
    Kawrakow committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    c063954 View commit details
    Browse the repository at this point in the history
  4. iq2_tn: AVX512

    With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
    At 4 threads we saturate at 48.41 t/s, and then performance slowly
    degrades with increasing number of threads.
    Kawrakow committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    d0cc103 View commit details
    Browse the repository at this point in the history
  5. iq2_tn: AVX2

    PP512 = 440 t/s on the Ryzen-5975WX.
    We should be able to do better.
    Kawrakow committed Aug 5, 2024
    Configuration menu
    Copy the full SHA
    a63ba11 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8102855 View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2024

  1. iq2_tn: NEON

    For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
    TG-128 = 75.5 t/s. This is in line with what we have for
    iq2_bn ant 3.3B Bitnet.
    Kawrakow committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    e528505 View commit details
    Browse the repository at this point in the history
  2. iq2_tn: Metal

    For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
    TG-128 = 98.5 t/s.
    Kawrakow committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    5d02f7f View commit details
    Browse the repository at this point in the history
  3. iq2_tn: CUDA

    For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
    TG-128 = 299.2 t/s.
    Kawrakow committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    2cc6338 View commit details
    Browse the repository at this point in the history
  4. iq2_tn: AVX2 PP improvement

    We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
    We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
    Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
    expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
    that is not quite optimal in iq2_tn.
    Kawrakow committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    9780ac4 View commit details
    Browse the repository at this point in the history
  5. iq2_tn: small NEON improvement

    For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.
    Kawrakow committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    8178075 View commit details
    Browse the repository at this point in the history