-
Notifications
You must be signed in to change notification settings - Fork 25
Introduce 8da4w quant for decoder-only text models #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
92541cc
to
602420d
Compare
a7369e6
to
c20bd3e
Compare
@kimishpatel @metascroy @jerryzh168 for review. |
797a747
to
c126e84
Compare
Tagging @tarun292 for review as we start adding quantization recipe for native HF models. |
c126e84
to
2c461b4
Compare
c650a1c
to
a071afe
Compare
Fixed the executorch version check issue on Linux, which returns '0.6.0+cpu', causing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quantization related code LGTM
ea01b8c
to
aa83f06
Compare
a2de3dc
to
484eeb9
Compare
484eeb9
to
d741f97
Compare
Initial efforts to introduce quantization to native Hugging Face models that are already supported in optimum-executorch. Start with decoder-only text models using "
8da4w
" for linear layers and int8 for embedding.Experiment the quantization configs with the following models:
Qwen3-0.6B
gemma-3-1b
HuggingFaceTB/SmolLM2-135M
Example usage
via
optimum-cli
optimum-cli export executorch --model Qwen/Qwen3-0.6B --task text-generation --recipe xnnpack --use_custom_sdpa --qlinear --qembedding --output_dir qwen3_8da4w_8we
or use the
ExecuTorchModelForCausalLM.from_pretrained
.et_model = ExecuTorchModelForCausalLM.from_pretrained("./qwen3_8da4w_8we")
.pte
size comparison: