-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-quants : ternary packing for TriLMs and BitNet b1.58 #8151
Merged
+937
−35
Merged
Changes from 11 commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
bd80749
ggml-quants : 1.625 bpw ternary packing for BitNet 1.58b
compilade 7ef4254
ggml-quants : faster 1.625 bpw AVX2 vec_dot
compilade 48b73b8
ggml-quants : substract 1 when back in epi8
compilade ef1e345
ggml-quants : Q2_2 now faster than Q4_K on with AVX2
compilade 638ad52
ggml-quants : cleanup Q1_3 code formatting
compilade 9465ec6
ggml-quants : ARM NEON vec_dot for q2_2 and q1_3
compilade 89dc3b2
ggml-quants : use ceiling division when quantizing q1_3
compilade 961e293
convert-hf : simplify BitNet pre-quantization
compilade 0996149
convert-hf : allow converting the weird BitNet 1.3B
compilade bfd2f21
bitnet : replace 1.58b with b1.58, as in the paper
compilade ec50944
ggml-quants : fix build failure on Windows
compilade 8fbd593
ggml-quants : attempt to fix Arm 32-bit support
compilade dd3e62a
ggml : add some informative comments in q1_3 vec_dot
compilade 79a278e
Merge branch 'master' into compilade/bitnet-ternary
compilade 77b8f84
ggml : add TQ1_0 and TQ2_0 ternary quantization types
compilade 560873f
ggml : even faster TQ2_0
compilade e971957
ggml : also faster TQ1_0
compilade a6dd699
ggml : fix build issues in certain environments
compilade 5417089
ggml : add NEON vec_dot implementation for TQ1_0 and TQ2_0
compilade 45719a2
ggml : avoid directly using vmlal_high_s8, for 32-bit ARM compat
compilade 04eec58
ggml : remove q1_3 and q2_2
compilade f034aa1
ggml-quants : rename fields of TQ1_0 and TQ2_0 structs for consistency
compilade 96b3d41
ggml-quants : allow using vdotq_s32 in TQ2_0 vec_dot
compilade d911cd1
Merge branch 'master' into compilade/bitnet-ternary
compilade 3a0bf17
gguf-py : Numpy (de)quantization for TQ1_0 and TQ2_0
compilade 895004f
convert : allow direct conversion to TQ1_0 and TQ2_0
compilade 69f7726
ggml-quants : allow using ARM dot product instructions for TQ1_0
compilade 82b2404
Merge branch 'master' into compilade/bitnet-ternary
compilade 35cc556
ggml-quants : deduplicate TQ1_0 and TQ2_0 __ARM_FEATURE_DOTPROD support
compilade cb6d996
Merge branch 'master' into compilade/bitnet-ternary
compilade 7f3a619
Merge branch 'master' into compilade/bitnet-ternary
compilade 8d61607
ggml ; remove unused ggml_mul special case
compilade 75b3a09
test-backend-ops : add TQ1_0 and TQ2_0 comments for later
compilade File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the names of the new quant types, since these are quite specific to BitNet models, I was thinking to name them with something starting with
QB
, a bit like suggested in #5761 (comment).I'll first be describing what I want from the naming scheme, then I'll attempt to make it work.
The naming scheme should have room for:
{-1, 0, 1}
1.625 bpw
quant with a block size of 64, with 13 bytes per blockQ8_0
as itsvec_dot_type
(for the activations)float16
scale in the leftover bits in the last byte of 16 consecutive blocks (this means 1024 elements minimum per row), although it can't really be extracted with SIMD)2.000 bpw
quant with a block size of 32, with 8 bytes per blockQ8_0
as itsvec_dot_type
(for the activations)2.000 bpw
quant with a block size of 64, with 16 bytes per block, and a float16 scale1.625 bpw
type, but with an extra byte and a row-wisefloat16
scale duplicated in each block.2.000 bpw
quant with a block size of 4, with 1 byte per blockvec_dot_type
{-1, 1}
1 bpw
type{0, 1}
8.5 bpw
likeQ8_0
, but all the scales of a row are the samefloat32
operations in the vec_dot of the above types.10 bpw
, 5 bytes per block of 4 elements, with a weird layout which only uses blocks to get a big enough buffer, with a single float32 scale and some padding before all row elements, aligned and contiguous.2.000 bpw
type, and also maybe the other ones for best performance.So the naming scheme could be:
QB<x>_<y>
<x>
is the floor of the expected bpw of the type<y>
is0
binary type,{0, 1}
QB8_0
which is likeQ8_0
but with a guaranteed duplicated row-wise scale1
binary type,{-1, 1}
2
ternary type using some kind of binary-coded ternary3
ternary type with fixed-point packed values4
weird type with a block size of 4Which for the previously-mentioned possible BitNet types would mean:
QB1_3
{-1, 0, 1}
Q1_3
QB2_2
{-2, -1, 0, 1}
Q2_2
QB2_3
{-1, 0, 1}
f16
QB2_4
{-2, -1, 0, 1}
QB1_1
{-1, 1}
QB1_0
{0, 1}
QB8_0
[-127, 127]
f16
QB8_4
[-127, 127]
f32
, weird layoutI'm not saying these should all exist, though, only that the naming scheme should not be too limiting for possible future extensions (which might not exist anyway due to lack of time).
So I think I'll rename
Q1_3
toQB1_3
, andQ2_2
toQB2_2
. Anyone has comments on this? Or a better naming scheme for the new BitNet quant types?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it were me, considering this only works with bitnet models and nothing else, I'd want the designations to be exceptionally clear that they are different and shouldn't be used on just anything. "QB" is good, but I'd take it a step further and remove the Q entirely. As bitnet is being colloquially referred to as a "1-bit" model, B1 makes more sense. Considering the plausible range for weights, I'd cut it off at tenths and ditch the decimal. This leaves plenty of room for variations, while making the native BPW very clear. I feel this is superior to the arbitrary "_2" and "_3" subtypes.
So what I would propose is:
1.625bpw =
B1_16
2.000bpw =
B1_20