Skip to content

Conversation

@ikawrakow
Copy link
Owner

Just reuse the IQ1_BN implementation. The only twist is that we now have the row scale stored at the beginning of the row, so we need a small modification of the dot product template to have a pointer to the beginning of the row passed to the dot product implementation.

It is slightly slower than IQ2_TN (305 t/s vs 320 t/s for the 4B TriLM model on RTX-4080), but this is to be expected given the bit twiddling we need to unpack the ternary bits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants