Skip to content

[Feature Request] [Help Wanted] Parallel Calibration and Model Optimization #1809

@kylesayrs

Description

@kylesayrs

Problem Description

LLM Compressor enables compression of very large language models, sometimes reaching as many as trillions of parameters. However, compression of these very large models is slow. This is because very large models have more parameters than can fit in GPU memory.

In order to work within these constraints, the sequential pipeline was introduced, which allows one layer and one calibration sample to be onloaded at a time, leading to much lower memory usage at the cost of runtime. For a more detailed explanation, see vLLM Office Hours #23 - Deep Dive Into the LLM Compressor

While this enables large models to be compressed, the runtime cost can be significant, sometimes stretching into multiple days to compress the largest models, while only utilizing one GPU at a time. Below I discuss some implementation options which would allow models to be compressed faster and in parallel.

Potential Solutions

There are a couple ways you could implement a solution to this problem

Option A: Tensor Parallelism

This option is the most ideal in terms of throughput and memory usage, and would involve sharding each weight across N gpus. While this is the typical solution to multigpu dispatches for inference, this will likely never be practical for LLM Compressor because most compression algorithms are implemented with the whole weight in mind, and would be difficult and runtime-inefficient to implement with weight/ activation sharding.

Image

Optional B: Modifier Optimization Parallelism

This option is the easiest to implement and probably a good place to start. This option parallelizes the compression step of GPTQ/ other modifiers so that they can be done in parallel, potentially across multiple GPUs.

This only helps to reduce the runtime of optimization/quantization itself, but does not reduce the runtime of calibration. LLM compressor is split into two steps, calibration and then optimization/quantization, both of which take a roughly equal amount of time in most cases. This method aims to speed up the second step (quantization) but cannot help with the first step (calibration).

class GPTQModifier:
    def compress_modules(...):
        for module in modules_to_quantize:
            gptq_quantize(module)
class GPTQModifier:
    def compress_modules(...):
        with ThreadPoolExecutor(...) as executor:
            for module in modules_to_quantize:
                module.to(GPU)
                executor.submit(gptq_quantize, module)
                module.to(ORIGINAL_GPU)

Option C: Asynchronous Pipeline Parallelism

This option allows parallelism across layers of the model, enabling both calibration parallelism and module compression quantization parallelism. In short, each layer of the model is on a separate GPU, and each layer is compressed on a separate GPU asynchronously, so subsequent layers can be calibrated/compressed while previous layers are still compressing.

This implementation unfortunately sacrifices the step where you rerun the model to propagate quantization error, but it's very unlikely that this will lead to a noticeable accuracy loss.

class SequentialPipeline:
    def __call__(...):
        dispatch_for_sequential(model)

        for subgraph_index, subgraph in enumerate(subgraphs):
            with disable_offloading():
                for i, batch in enumerate(data_loader):
                    _ = subgraph.forward(model, **inputs)
                    LifecycleCallbacks.sequential_epoch_end()  # this is where optimization/quantization happens
                    with HooksMixin.disable_hooks():
                        inputs[i] = subgraph.forward(model, **inputs)
class SequentialPipeline:
    def __call__(...):
        for subgraph_index, subgraph in enumerate(subgraphs):
            with disable_offloading(move_to=GPU), ThreadPoolExecutor(...) as executor:
               for i, batch in enumerate(data_loader):
                    inputs[i] = subgraph.forward(model, **inputs)
                    executor.submit(LifecycleCallbacks.sequential_epoch_end)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestgood first issueA good first issue for users wanting to contributegood follow-up issueA good issue for users with some familiarity of the codebase

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions