-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding IQ2_TN for use with ternary models #13
Merged
Commits on Aug 5, 2024
-
iq2_tn: TriLM specific 2.0625 bpw quantization
Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation.
Configuration menu - View commit details
-
Copy full SHA for 1b41d79 - Browse repository at this point
Copy the full SHA 1b41d79View commit details -
Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B.
Configuration menu - View commit details
-
Copy full SHA for dd0b08d - Browse repository at this point
Copy the full SHA dd0b08dView commit details -
Configuration menu - View commit details
-
Copy full SHA for c063954 - Browse repository at this point
Copy the full SHA c063954View commit details -
With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads.
Configuration menu - View commit details
-
Copy full SHA for d0cc103 - Browse repository at this point
Copy the full SHA d0cc103View commit details -
PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better.
Configuration menu - View commit details
-
Copy full SHA for a63ba11 - Browse repository at this point
Copy the full SHA a63ba11View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8102855 - Browse repository at this point
Copy the full SHA 8102855View commit details
Commits on Aug 6, 2024
-
For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet.
Configuration menu - View commit details
-
Copy full SHA for e528505 - Browse repository at this point
Copy the full SHA e528505View commit details -
For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s.
Configuration menu - View commit details
-
Copy full SHA for 5d02f7f - Browse repository at this point
Copy the full SHA 5d02f7fView commit details -
For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s.
Configuration menu - View commit details
-
Copy full SHA for 2cc6338 - Browse repository at this point
Copy the full SHA 2cc6338View commit details -
We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn.
Configuration menu - View commit details
-
Copy full SHA for 9780ac4 - Browse repository at this point
Copy the full SHA 9780ac4View commit details -
iq2_tn: small NEON improvement
For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.
Configuration menu - View commit details
-
Copy full SHA for 8178075 - Browse repository at this point
Copy the full SHA 8178075View commit details
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.