[Feature Request] [Help Wanted] Parallel Calibration and Model Optimization

## Problem Description ##
LLM Compressor enables compression of very large language models, sometimes reaching as many as trillions of parameters. However, compression of these very large models is slow. This is because very large models have more parameters than can fit in GPU memory.

In order to work within these constraints, the [sequential pipeline](https://github.com/vllm-project/llm-compressor/blob/main/src/llmcompressor/pipelines/sequential/pipeline.py) was introduced, which allows one layer and one calibration sample to be onloaded at a time, leading to much lower memory usage at the cost of runtime. For a more detailed explanation, see [vLLM Office Hours #23 - Deep Dive Into the LLM Compressor](https://www.youtube.com/live/GrhuqQDmBk8?si=TlFlh_TdX9AbJpg5&t=3066)

While this enables large models to be compressed, the runtime cost can be significant, sometimes stretching into multiple days to compress the largest models, while only utilizing one GPU at a time. Below I discuss some implementation options which would allow models to be compressed faster and in parallel.

## Potential Solutions ##

There are a couple ways you could implement a solution to this problem

### Option A: Tensor Parallelism ###
This option is the most ideal in terms of throughput and memory usage, and would involve sharding each weight across N gpus. While this is the typical solution to multigpu dispatches for inference, this will likely never be practical for LLM Compressor because most compression algorithms are implemented with the whole weight in mind, and would be difficult and runtime-inefficient to implement with weight/ activation sharding.

<img width="1192" height="560" alt="Image" src="https://github.com/user-attachments/assets/42ef6623-e09e-4730-910e-6a88e790c2f0" />


### Optional B: Modifier Optimization Parallelism ###
This option is the easiest to implement and probably a good place to start. This option parallelizes the compression step of GPTQ/ other modifiers so that they can be done in parallel, potentially across multiple GPUs. 

This only helps to reduce the runtime of optimization/quantization itself, but does not reduce the runtime of calibration. LLM compressor is split into two steps, calibration and then optimization/quantization, both of which take a roughly equal amount of time in most cases. This method aims to speed up the second step (quantization) but cannot help with the first step (calibration).

```python3
class GPTQModifier:
    def compress_modules(...):
        for module in modules_to_quantize:
            gptq_quantize(module)
```

```python3
class GPTQModifier:
    def compress_modules(...):
        with ThreadPoolExecutor(...) as executor:
            for module in modules_to_quantize:
                module.to(GPU)
                executor.submit(gptq_quantize, module)
                module.to(ORIGINAL_GPU)
```


### Option C: Asynchronous Pipeline Parallelism ###
This option allows parallelism across layers of the model, enabling both calibration parallelism and module compression quantization parallelism. In short, each layer of the model is on a separate GPU, and each layer is compressed on a separate GPU asynchronously, so subsequent layers can be calibrated/compressed while previous layers are still compressing.

This implementation unfortunately sacrifices the step where you rerun the model to propagate quantization error, but it's very unlikely that this will lead to a noticeable accuracy loss.

```python3
class SequentialPipeline:
    def __call__(...):
        dispatch_for_sequential(model)

        for subgraph_index, subgraph in enumerate(subgraphs):
            with disable_offloading():
                for i, batch in enumerate(data_loader):
                    _ = subgraph.forward(model, **inputs)
                    LifecycleCallbacks.sequential_epoch_end()  # this is where optimization/quantization happens
                    with HooksMixin.disable_hooks():
                        inputs[i] = subgraph.forward(model, **inputs)
```

```python3
class SequentialPipeline:
    def __call__(...):
        for subgraph_index, subgraph in enumerate(subgraphs):
            with disable_offloading(move_to=GPU), ThreadPoolExecutor(...) as executor:
               for i, batch in enumerate(data_loader):
                    inputs[i] = subgraph.forward(model, **inputs)
                    executor.submit(LifecycleCallbacks.sequential_epoch_end)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] [Help Wanted] Parallel Calibration and Model Optimization #1809

Problem Description

Potential Solutions

Option A: Tensor Parallelism

Optional B: Modifier Optimization Parallelism

Option C: Asynchronous Pipeline Parallelism

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] [Help Wanted] Parallel Calibration and Model Optimization #1809

Description

Problem Description

Potential Solutions

Option A: Tensor Parallelism

Optional B: Modifier Optimization Parallelism

Option C: Asynchronous Pipeline Parallelism

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions