Skip to content

[Feat] dnnl build for AVX2 W8A8 Int8#41318

Merged
bigPYJ1151 merged 17 commits intovllm-project:mainfrom
tianmu-li:feat/avx2_w8a8
May 6, 2026
Merged

[Feat] dnnl build for AVX2 W8A8 Int8#41318
bigPYJ1151 merged 17 commits intovllm-project:mainfrom
tianmu-li:feat/avx2_w8a8

Conversation

@tianmu-li
Copy link
Copy Markdown
Contributor

@tianmu-li tianmu-li commented Apr 30, 2026

Purpose

The CPU backend's W8A8 INT8 quantization ops (static_scaled_int8_quant, dynamic_scaled_int8_quant, onednn_scaled_mm) were gated behind __AVX512F__ and completely absent from the _C_AVX2 shared library. Running a compressed-tensors W8A8 INT8 model on an AVX2-only host (E.g: Xeon-6 with E-cores) resulted in a missing-symbol error at runtime. This PR links _C_AVX2 against the existing dnnl_ext and adds avx2 operators needed for quantization. int8 quantization is especially beneficial to AVX2, as bf16/fp16 models run at fp32 rate on AVX2.

Note: dnnl_ext now compiles with -mavx2. onednn detects isa and jit-compiles kernels during runtime, so I don't expect it to be a problem.

Test Plan

Test platform: an AVX2-enabled platform
Server

export VLLM_CPU_KVCACHE_SPACE=20
export VLLM_CPU_OMP_THREADS_BIND=0-47
MODEL_NAME={RedHatAI/Meta-Llama-3.1-8B-Instruct-quantized.w8a8 or meta-llama/Llama-3.1-8B-Instruct}
vllm serve $MODEL_NAME \
  --served-model-name meta-llama/Llama-3.1-8B-Instruct \
  --port 8868 --host 0.0.0.0 \
  --no-enable-prefix-caching \
  --max-model-len=16384 \
  --max-num-batched-tokens=8192

Client

MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
vllm bench serve --model $MODEL_NAME \
  --num-prompt 50 \
  --port=8868 \
  --random-input-len 128 \
  --random-output-len 128

Test Result

Dtype Before throughput (toks/s) After throughput (toks/s)
bf16 69 72.5
int8 DNR 172.7

AI assistance

This PR was developed with Claude Code assistance. All changed lines have been reviewed by the submitting author.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
…d for both avx2 and avx512

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
@tianmu-li tianmu-li requested a review from bigPYJ1151 as a code owner April 30, 2026 02:16
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added ci/build cpu Related to CPU backends labels Apr 30, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables oneDNN support for AVX2 architectures by updating CMake configurations and providing AVX2-compatible implementations for vector operations, including masked stores, clamping, and reductions. It also fixes a loop increment bug in the dynamic quantization kernel where the index was being incremented by one instead of the vector element count. I have no feedback to provide.

Copy link
Copy Markdown
Member

@bigPYJ1151 bigPYJ1151 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! LGTM :)

@bigPYJ1151 bigPYJ1151 added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 30, 2026
@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) April 30, 2026 11:47
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 30, 2026

Hi @tianmu-li, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

1 similar comment
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 30, 2026

Hi @tianmu-li, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

@tianmu-li tianmu-li marked this pull request as draft April 30, 2026 16:02
auto-merge was automatically disabled April 30, 2026 16:02

Pull request was converted to draft

@tianmu-li
Copy link
Copy Markdown
Contributor Author

Found some issues in an apple silicon smoke test https://github.com/tianmu-li/vllm/actions/runs/25149522905/job/73716695058#logs, will need to merge/rebase after #41387. Also, some potential compilation issues on ARM using dnnl that needs fixing

@tianmu-li tianmu-li marked this pull request as ready for review April 30, 2026 20:55
@louie-tsai
Copy link
Copy Markdown
Contributor

loop @louie-tsai

@bigPYJ1151 bigPYJ1151 merged commit e87e09a into vllm-project:main May 6, 2026
20 checks passed
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
amd-mghanimi pushed a commit to amd-mghanimi/vllm that referenced this pull request May 6, 2026
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Signed-off-by: Mehdi Ghanimifard <mehdi.ghanimifard@amd.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
libinta pushed a commit to libinta/vllm that referenced this pull request May 8, 2026
Signed-off-by: Li, Tianmu <tianmu.li@intel.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Signed-off-by: Libin Tang <libin.tang@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants