New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Merged

compilade merged 33 commits into master from compilade/bitnet-ternary

Sep 6, 2024

Commits on Jun 27, 2024

ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b

compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for bd80749

Browse repository at this point
Copy the full SHA

bd80749 View commit details

Browse the repository at this point in the history
ggml-quants : faster 1.625 bpw AVX2 vec_dot
```
Not using a lookup table anymore makes it match q4_0 speed.

* gguf-py : fix formatting

* llama : remove spaces on empty line
```
compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 7ef4254

Browse repository at this point
Copy the full SHA

7ef4254 View commit details

Browse the repository at this point in the history
ggml-quants : substract 1 when back in epi8
```
This makes the 1.625 bpw type go faster than q4_0. Still not the fastest.
```
compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 48b73b8

Browse repository at this point
Copy the full SHA

48b73b8 View commit details

Browse the repository at this point in the history
ggml-quants : Q2_2 now faster than Q4_K on with AVX2

compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for ef1e345

Browse repository at this point
Copy the full SHA

ef1e345 View commit details

Browse the repository at this point in the history
ggml-quants : cleanup Q1_3 code formatting

compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 638ad52

Browse repository at this point
Copy the full SHA

638ad52 View commit details

Browse the repository at this point in the history
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3

compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 9465ec6

Browse repository at this point
Copy the full SHA

9465ec6 View commit details

Browse the repository at this point in the history
ggml-quants : use ceiling division when quantizing q1_3

compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 89dc3b2

Browse repository at this point
Copy the full SHA

89dc3b2 View commit details

Browse the repository at this point in the history
convert-hf : simplify BitNet pre-quantization
```
This still results in the exact same tensor weights and scales,
but it reveals some weirdness in the current algorithm.
```
compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 961e293

Browse repository at this point
Copy the full SHA

961e293 View commit details

Browse the repository at this point in the history
convert-hf : allow converting the weird BitNet 1.3B
```
Its FFN size is 5460 which is not convenient.
The offending tensors are kept in F16,
which makes the final model 5.01 bpw.
```
compilade committed Jun 27, 2024
Configuration menu
View commit details

Copy full SHA for 0996149

Browse repository at this point
Copy the full SHA

0996149 View commit details

Browse the repository at this point in the history

Commits on Jun 29, 2024

bitnet : replace 1.58b with b1.58, as in the paper

compilade committed Jun 29, 2024
Configuration menu
View commit details

Copy full SHA for bfd2f21

Browse repository at this point
Copy the full SHA

bfd2f21 View commit details

Browse the repository at this point in the history
ggml-quants : fix build failure on Windows

compilade committed Jun 29, 2024
Configuration menu
View commit details

Copy full SHA for ec50944

Browse repository at this point
Copy the full SHA

ec50944 View commit details

Browse the repository at this point in the history
ggml-quants : attempt to fix Arm 32-bit support

compilade committed Jun 29, 2024
Configuration menu
View commit details

Copy full SHA for 8fbd593

Browse repository at this point
Copy the full SHA

8fbd593 View commit details

Browse the repository at this point in the history

Commits on Jul 29, 2024

ggml : add some informative comments in q1_3 vec_dot

compilade committed Jul 29, 2024
Configuration menu
View commit details

Copy full SHA for dd3e62a

Browse repository at this point
Copy the full SHA

dd3e62a View commit details

Browse the repository at this point in the history
Merge branch 'master' into compilade/bitnet-ternary

compilade committed Jul 29, 2024
Configuration menu
View commit details

Copy full SHA for 79a278e

Browse repository at this point
Copy the full SHA

79a278e View commit details

Browse the repository at this point in the history

Commits on Jul 30, 2024

ggml : add TQ1_0 and TQ2_0 ternary quantization types

compilade committed Jul 30, 2024
Configuration menu
View commit details

Copy full SHA for 77b8f84

Browse repository at this point
Copy the full SHA

77b8f84 View commit details

Browse the repository at this point in the history

Commits on Jul 31, 2024

ggml : even faster TQ2_0

compilade committed Jul 31, 2024
Configuration menu
View commit details

Copy full SHA for 560873f

Browse repository at this point
Copy the full SHA

560873f View commit details

Browse the repository at this point in the history
ggml : also faster TQ1_0
```
Same optimization as for TQ2_0 by offsetting the sum instead of the weights.
This makes TQ1_0 almost as fast as Q8_0 on AVX2.
```
compilade committed Jul 31, 2024
Configuration menu
View commit details

Copy full SHA for e971957

Browse repository at this point
Copy the full SHA

e971957 View commit details

Browse the repository at this point in the history

Commits on Aug 1, 2024

ggml : fix build issues in certain environments

compilade committed Aug 1, 2024
Configuration menu
View commit details

Copy full SHA for a6dd699

Browse repository at this point
Copy the full SHA

a6dd699 View commit details

Browse the repository at this point in the history
ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0

compilade committed Aug 1, 2024
Configuration menu
View commit details

Copy full SHA for 5417089

Browse repository at this point
Copy the full SHA

5417089 View commit details

Browse the repository at this point in the history
ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
```
The compiler seems smart enough to use the same instruction
even when using vget_high_s8 instead.
```
compilade committed Aug 1, 2024
Configuration menu
View commit details

Copy full SHA for 45719a2

Browse repository at this point
Copy the full SHA

45719a2 View commit details

Browse the repository at this point in the history

Commits on Aug 3, 2024

ggml : remove q1_3 and q2_2
```
* llama : remove the separate scale tensors of BitNet b1.58

They won't be needed, since the remaining ternary quant types have
built-in scales.
```
compilade committed Aug 3, 2024
Configuration menu
View commit details

Copy full SHA for 04eec58

Browse repository at this point
Copy the full SHA

04eec58 View commit details

Browse the repository at this point in the history
ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency

compilade committed Aug 3, 2024
Configuration menu
View commit details

Copy full SHA for f034aa1

Browse repository at this point
Copy the full SHA

f034aa1 View commit details

Browse the repository at this point in the history

Commits on Aug 7, 2024

ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot

Not yet tested on harware which supports it,
might not work or might not even compile. But also it might.
It should make the performance better on recent ARM CPUs.

* ggml-quants : remove comment about possible format change of TQ2_0

Making it slightly more convenient for AVX512
but less convenient for everything else is not worth the trouble.

compilade committed Aug 7, 2024

96b3d41

Commits on Aug 11, 2024

Merge branch 'master' into compilade/bitnet-ternary

compilade committed Aug 11, 2024
Configuration menu
View commit details

Copy full SHA for d911cd1

Browse repository at this point
Copy the full SHA

d911cd1 View commit details

Browse the repository at this point in the history

Commits on Aug 12, 2024

gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
```
* ggml-quants : use roundf instead of nearest_int for TQ1_0 and TQ2_0

This does not change anything for ternary models,
since their values should never end up being in halfway cases anyway.
```
compilade committed Aug 12, 2024
Configuration menu
View commit details

Copy full SHA for 3a0bf17

Browse repository at this point
Copy the full SHA

3a0bf17 View commit details

Browse the repository at this point in the history

Commits on Aug 13, 2024

convert : allow direct conversion to TQ1_0 and TQ2_0

The token embeddings and output tensors are kept in F16
to allow quantizing them to Q4_K and Q6_K with llama-quantize.

* llama : handle fallback for TQ1_0 and TQ2_0 with Q4_0

Q4_0 is not completely symmetric (so not lossless for ternary models),
but it should be good enough.

compilade committed Aug 13, 2024

895004f

ggml-quants : allow using ARM dot product instructions for TQ1_0

compilade committed Aug 13, 2024
Configuration menu
View commit details

Copy full SHA for 69f7726

Browse repository at this point
Copy the full SHA

69f7726 View commit details

Browse the repository at this point in the history
Merge branch 'master' into compilade/bitnet-ternary

compilade committed Aug 13, 2024
Configuration menu
View commit details

Copy full SHA for 82b2404

Browse repository at this point
Copy the full SHA

82b2404 View commit details

Browse the repository at this point in the history
ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support

compilade committed Aug 13, 2024
Configuration menu
View commit details

Copy full SHA for 35cc556

Browse repository at this point
Copy the full SHA

35cc556 View commit details

Browse the repository at this point in the history

Commits on Aug 22, 2024

Merge branch 'master' into compilade/bitnet-ternary

compilade committed Aug 22, 2024
Configuration menu
View commit details

Copy full SHA for cb6d996

Browse repository at this point
Copy the full SHA

cb6d996 View commit details

Browse the repository at this point in the history

Commits on Sep 4, 2024

Merge branch 'master' into compilade/bitnet-ternary

compilade committed Sep 4, 2024
Configuration menu
View commit details

Copy full SHA for 7f3a619

Browse repository at this point
Copy the full SHA

7f3a619 View commit details

Browse the repository at this point in the history
ggml ; remove unused ggml_mul special case
```
It would otherwise conflict with the more general
optimization coming with Mamba-2.

* ggml : handle TQ1_0 and TQ2_0 in dequantization-based operators
```
compilade committed Sep 4, 2024
Configuration menu
View commit details

Copy full SHA for 8d61607

Browse repository at this point
Copy the full SHA

8d61607 View commit details

Browse the repository at this point in the history

test-backend-ops : add TQ1_0 and TQ2_0 comments for later

Not yet adding uncommented, because some backends like SYCL and Metal
do not properly handle unknown types in supports_op for GGML_OP_MUL_MAT.
(and Metal also doesn't handle it with GGML_OP_GET_ROWS)
Support for TQ1_0 and TQ2_0 for other backends than CPU
will be added in follow-up pull requests.

compilade committed Sep 4, 2024

75b3a09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151

Commits on Jun 27, 2024

Commits on Jun 29, 2024

Commits on Jul 29, 2024

Commits on Jul 30, 2024

Commits on Jul 31, 2024

Commits on Aug 1, 2024

Commits on Aug 3, 2024

Commits on Aug 7, 2024

Commits on Aug 11, 2024

Commits on Aug 12, 2024

Commits on Aug 13, 2024

Commits on Aug 22, 2024

Commits on Sep 4, 2024