Skip to content

Comments

Enhance Autoround to support multiple cards tuning#2157

Merged
brian-dellabetta merged 34 commits intovllm-project:mainfrom
yiliu30:auto-device
Jan 6, 2026
Merged

Enhance Autoround to support multiple cards tuning#2157
brian-dellabetta merged 34 commits intovllm-project:mainfrom
yiliu30:auto-device

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Dec 19, 2025

Given AutoRound uses block‑level reconstruction loss to fine‑tune quantization parameters, which requires running backward passes on each block. For large model, like Qwen3-235B, a single GPU often doesn’t have enough memory to hold an entire block during backward computation. To address this, we use the HF accelerator to dispatch the module across multiple devices.
In this PR, we enable this feature on LLMC side:

  • Add device_ids for tuning with multiple cards
  • Map ignore to Autoround skipping layers
  • Add Qwen/Qwen3-235B-A22B as example for multiple cards

Test plan

pytest -svv ./llmcompressor/transformers/autoround/test_autoround_oneshot.py -k test_oneshot_with_device_map

Example results

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS1-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.928|±  |0.0082|
# |     |       |strict-match    |     5|exact_match|↑  |0.930|±  |0.0081|
  

# vllm (pretrained=INC4AI/Qwen3-235B-A22B-W4A16-G128-AutoRound-ITERS200-LLMC-TEST-ONLY,tensor_parallel_size=2,max_model_len=8192,max_num_batched_tokens=32768,max_num_seqs=128,add_bos_token=True,gpu_memory_utilization=0.8,dtype=bfloat16,max_gen_toks=2048,enable_prefix_caching=False), gen_kwargs: (None), limit: 1000.0, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0079|
# |     |       |strict-match    |     5|exact_match|↑  |0.915|±  |0.0088|

cc @hshen14 @thuang6 @wenhuach21

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AutoRoundModifier by introducing robust support for multi-GPU tuning, allowing the quantization process to efficiently distribute model layers across multiple devices. This is achieved through the addition of a device_map parameter and a mechanism to temporarily suspend Accelerate's device management hooks, ensuring compatibility and optimized resource utilization during quantization. The changes are validated with a new test case and demonstrated with a practical example for the Qwen3 model.

Highlights

  • Multi-GPU Support for AutoRound: The AutoRoundModifier now includes a device_map parameter, enabling the AutoRound quantization process to distribute model layers and leverage multiple GPUs for more efficient tuning.
  • Accelerate Hook Management: A new suspend_accelerate_hooks context manager has been introduced. This temporarily detaches Accelerate's device offloading hooks during AutoRound's tuning phase, preventing conflicts and ensuring proper device management when using multiple GPUs.
  • Improved Unquantized Layer Handling: A get_unquantized_layer_names method was added, and the fp_layers parameter is now passed to the AutoRound constructor. This provides more precise control over which specific layers are excluded from the quantization process.
  • Qwen3 Example Added: A new example script (qwen3_example.py) has been added, demonstrating how to apply AutoRound quantization to the Qwen3-235B model using multiple A100 GPUs, showcasing the new multi-card tuning capability.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances AutoRoundModifier to support multi-GPU tuning by integrating auto_round's device_map functionality. This is primarily achieved by adding a device_map parameter to the modifier and introducing a new context manager, suspend_accelerate_hooks, to correctly handle models with Hugging Face Accelerate hooks. The changes are well-supported by a new example for a large model and a new test case for multi-GPU execution. The implementation is solid, but I've identified a potential edge case in the new suspend_accelerate_hooks function that could lead to a crash if a model has no parameters, for which I've provided a suggestion.

@yiliu30 yiliu30 marked this pull request as ready for review December 19, 2025 06:20
@dsikka dsikka added the autoround For any PR / issue related to autoround support label Dec 19, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi AutoRound team, I think these changes make sense, though we are refactoring some things that overlap with these changes. Please see comments.

Can you point me to the logic in the auto round repo that handles the multi-gpu parallelization work? I'd like to see how you're handling it

@yiliu30
Copy link
Contributor Author

yiliu30 commented Dec 20, 2025

Hi AutoRound team, I think these changes make sense, though we are refactoring some things that overlap with these changes. Please see comments.

Can you point me to the logic in the auto round repo that handles the multi-gpu parallelization work? I'd like to see how you're handling it

Hi @brian-dellabetta , here is the logic for multi-gpu devices, https://github.com/intel/auto-round/blob/b53ead7d77746385d700152c7f00960f18fb9d85/auto_round/compressors/base.py#L1560-L1562.

We take a block, its input, and the list of available devices, then assign each submodule to one of those devices. The accelerator’s AlignDevicesHook later used for dispatching the submodules accordingly.

Inside set_auto_device_map_for_block_with_tuning, we estimate the block’s memory requirements based on its parameters, input, batch size, and a few heuristic factors. Using this estimate, we assign devices to the submodules to make memory usage stays as balanced as possible across all GPUs. The final mapping is then attached to each module as its tuning_device.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
kylesayrs
kylesayrs previously approved these changes Jan 4, 2026
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to #2180. I've done some basic modeling and it seems like AutoRound could have improved performance from deeper integration and utilizing the DP strategy detailed in the RFC.

In the meantime, this PR looks great, thanks!

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@dsikka dsikka requested a review from kylesayrs January 5, 2026 15:01
@dsikka dsikka added the ready When a PR is ready for review label Jan 5, 2026
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30
Copy link
Contributor Author

yiliu30 commented Jan 6, 2026

Hi @dsikka could you help retrigger the CI. Thanks!

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you guys are in agreement to keep suspend_accelerate_hooks separate from our other implementation and remove in future, I'm good with these changes.

Thanks for adding this!

@brian-dellabetta brian-dellabetta merged commit 0fa6368 into vllm-project:main Jan 6, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autoround For any PR / issue related to autoround support ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants