Skip to content

Conversation

guangy10
Copy link
Collaborator

@guangy10 guangy10 commented May 1, 2025

Initial efforts to introduce quantization to native Hugging Face models that are already supported in optimum-executorch. Start with decoder-only text models using "8da4w" for linear layers and int8 for embedding.

Experiment the quantization configs with the following models:

  • Qwen3-0.6B
  • gemma-3-1b
  • HuggingFaceTB/SmolLM2-135M

Example usage

via optimum-cli
optimum-cli export executorch --model Qwen/Qwen3-0.6B --task text-generation --recipe xnnpack --use_custom_sdpa --qlinear --qembedding --output_dir qwen3_8da4w_8we

or use the ExecuTorchModelForCausalLM.from_pretrained.

et_model = ExecuTorchModelForCausalLM.from_pretrained("./qwen3_8da4w_8we")

Screenshot 2025-05-05 at 1 30 43 PM

.pte size comparison:

qwen3_8da4w_8we:
total 1035336
-rw-r--r--  506M  May  1 13:09 model.pte

qwen3_8da4w:
total 1944584
-rw-r--r--  950M  May  1 13:16 model.pte

qwen3_float16:
total 2937408
-rw-r--r--  1.4G  May  1 13:22 model.pte

qwen3_float32:
total 5873128
-rw-r--r--  2.8G  May  1 13:26 model.pte

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 12 times, most recently from 92541cc to 602420d Compare May 1, 2025 05:18
@guangy10 guangy10 marked this pull request as ready for review May 1, 2025 05:53
@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 3 times, most recently from a7369e6 to c20bd3e Compare May 1, 2025 18:24
@guangy10
Copy link
Collaborator Author

guangy10 commented May 1, 2025

@kimishpatel @metascroy @jerryzh168 for review.

@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from 797a747 to c126e84 Compare May 1, 2025 20:28
@guangy10
Copy link
Collaborator Author

guangy10 commented May 1, 2025

Tagging @tarun292 for review as we start adding quantization recipe for native HF models.

@guangy10 guangy10 force-pushed the intro_8da4w_quant branch from c126e84 to 2c461b4 Compare May 1, 2025 21:15
@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 4 times, most recently from c650a1c to a071afe Compare May 2, 2025 01:37
@guangy10
Copy link
Collaborator Author

guangy10 commented May 2, 2025

Fixed the executorch version check issue on Linux, which returns '0.6.0+cpu', causing parse(executorch_version.__version__) > parse("0.6.0") tests True for executorch==0.6.0. To me it looks like a bug.

Copy link

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantization related code LGTM

@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from ea01b8c to aa83f06 Compare May 2, 2025 17:16
@guangy10 guangy10 force-pushed the intro_8da4w_quant branch 2 times, most recently from a2de3dc to 484eeb9 Compare May 2, 2025 21:34
@guangy10 guangy10 force-pushed the intro_8da4w_quant branch from 484eeb9 to d741f97 Compare May 5, 2025 20:00
@guangy10 guangy10 merged commit efecfc5 into huggingface:main May 5, 2025
107 checks passed
@guangy10 guangy10 deleted the intro_8da4w_quant branch May 5, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants