Introduce 8da4w quant for decoder-only text models #62

guangy10 · 2025-05-01T00:20:35Z

Initial efforts to introduce quantization to native Hugging Face models that are already supported in optimum-executorch. Start with decoder-only text models using "8da4w" for linear layers and int8 for embedding.

Experiment the quantization configs with the following models:

Qwen3-0.6B
gemma-3-1b
HuggingFaceTB/SmolLM2-135M

Example usage

via optimum-cli
optimum-cli export executorch --model Qwen/Qwen3-0.6B --task text-generation --recipe xnnpack --use_custom_sdpa --qlinear --qembedding --output_dir qwen3_8da4w_8we

or use the ExecuTorchModelForCausalLM.from_pretrained.

et_model = ExecuTorchModelForCausalLM.from_pretrained("./qwen3_8da4w_8we")

`.pte` size comparison:

qwen3_8da4w_8we:
total 1035336
-rw-r--r--  506M  May  1 13:09 model.pte

qwen3_8da4w:
total 1944584
-rw-r--r--  950M  May  1 13:16 model.pte

qwen3_float16:
total 2937408
-rw-r--r--  1.4G  May  1 13:22 model.pte

qwen3_float32:
total 5873128
-rw-r--r--  2.8G  May  1 13:26 model.pte

HuggingFaceDocBuilderDev · 2025-05-01T00:23:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

guangy10 · 2025-05-01T19:14:50Z

@kimishpatel @metascroy @jerryzh168 for review.

guangy10 · 2025-05-01T20:47:05Z

Tagging @tarun292 for review as we start adding quantization recipe for native HF models.

tests/models/test_modeling_qwen3.py

guangy10 · 2025-05-02T01:38:58Z

Fixed the executorch version check issue on Linux, which returns '0.6.0+cpu', causing parse(executorch_version.__version__) > parse("0.6.0") tests True for executorch==0.6.0. To me it looks like a bug.

optimum/exporters/executorch/tasks/causal_lm.py

jerryzh168

quantization related code LGTM

optimum/commands/export/executorch.py

.github/workflows/test_models.yml

guangy10 force-pushed the intro_8da4w_quant branch 12 times, most recently from 92541cc to 602420d Compare May 1, 2025 05:18

guangy10 marked this pull request as ready for review May 1, 2025 05:53

guangy10 force-pushed the intro_8da4w_quant branch 3 times, most recently from a7369e6 to c20bd3e Compare May 1, 2025 18:24

metascroy approved these changes May 1, 2025

View reviewed changes

Introduce 8da4w quant for decoder-only text models

bf5605b

guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from 797a747 to c126e84 Compare May 1, 2025 20:28

guangy10 force-pushed the intro_8da4w_quant branch from c126e84 to 2c461b4 Compare May 1, 2025 21:15

rebase on gemma3 ci and log pte file size

4756ed7

kimishpatel reviewed May 1, 2025

View reviewed changes

tests/models/test_modeling_qwen3.py Show resolved Hide resolved

guangy10 force-pushed the intro_8da4w_quant branch 4 times, most recently from c650a1c to a071afe Compare May 2, 2025 01:37

jerryzh168 reviewed May 2, 2025

View reviewed changes

optimum/exporters/executorch/tasks/causal_lm.py Show resolved Hide resolved

jerryzh168 reviewed May 2, 2025

View reviewed changes

optimum/exporters/executorch/tasks/causal_lm.py Show resolved Hide resolved

jerryzh168 approved these changes May 2, 2025

View reviewed changes

kimishpatel reviewed May 2, 2025

View reviewed changes

optimum/commands/export/executorch.py Show resolved Hide resolved

guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from ea01b8c to aa83f06 Compare May 2, 2025 17:16

guangy10 commented May 2, 2025

View reviewed changes

.github/workflows/test_models.yml Show resolved Hide resolved

guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from a2de3dc to 484eeb9 Compare May 2, 2025 21:34

add support for embedding quantization

d741f97

guangy10 force-pushed the intro_8da4w_quant branch from 484eeb9 to d741f97 Compare May 5, 2025 20:00

guangy10 merged commit efecfc5 into huggingface:main May 5, 2025
107 checks passed

guangy10 deleted the intro_8da4w_quant branch May 5, 2025 23:20

guangy10 mentioned this pull request May 7, 2025

Support lowering quantized checkpoint from HF Hub #67

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce 8da4w quant for decoder-only text models #62

Introduce 8da4w quant for decoder-only text models #62

Uh oh!

guangy10 commented May 1, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

guangy10 commented May 1, 2025

Uh oh!

guangy10 commented May 1, 2025

Uh oh!

Uh oh!

guangy10 commented May 2, 2025

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Introduce 8da4w quant for decoder-only text models #62

Introduce 8da4w quant for decoder-only text models #62

Uh oh!

Conversation

guangy10 commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example usage

.pte size comparison:

Uh oh!

HuggingFaceDocBuilderDev commented May 1, 2025

Uh oh!

guangy10 commented May 1, 2025

Uh oh!

guangy10 commented May 1, 2025

Uh oh!

Uh oh!

guangy10 commented May 2, 2025

Uh oh!

Uh oh!

Uh oh!

jerryzh168 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

guangy10 commented May 1, 2025 •

edited

Loading

`.pte` size comparison: