-
Notifications
You must be signed in to change notification settings - Fork 263
Description
Problem Description
LLM Compressor enables compression of very large language models, sometimes reaching as many as trillions of parameters. However, compression of these very large models is slow. This is because very large models have more parameters than can fit in GPU memory.
In order to work within these constraints, the sequential pipeline was introduced, which allows one layer and one calibration sample to be onloaded at a time, leading to much lower memory usage at the cost of runtime. For a more detailed explanation, see vLLM Office Hours #23 - Deep Dive Into the LLM Compressor
While this enables large models to be compressed, the runtime cost can be significant, sometimes stretching into multiple days to compress the largest models, while only utilizing one GPU at a time. Below I discuss some implementation options which would allow models to be compressed faster and in parallel.
Potential Solutions
There are a couple ways you could implement a solution to this problem
Option A: Tensor Parallelism
This option is the most ideal in terms of throughput and memory usage, and would involve sharding each weight across N gpus. While this is the typical solution to multigpu dispatches for inference, this will likely never be practical for LLM Compressor because most compression algorithms are implemented with the whole weight in mind, and would be difficult and runtime-inefficient to implement with weight/ activation sharding.

Optional B: Modifier Optimization Parallelism
This option is the easiest to implement and probably a good place to start. This option parallelizes the compression step of GPTQ/ other modifiers so that they can be done in parallel, potentially across multiple GPUs.
This only helps to reduce the runtime of optimization/quantization itself, but does not reduce the runtime of calibration. LLM compressor is split into two steps, calibration and then optimization/quantization, both of which take a roughly equal amount of time in most cases. This method aims to speed up the second step (quantization) but cannot help with the first step (calibration).
class GPTQModifier:
def compress_modules(...):
for module in modules_to_quantize:
gptq_quantize(module)
class GPTQModifier:
def compress_modules(...):
with ThreadPoolExecutor(...) as executor:
for module in modules_to_quantize:
module.to(GPU)
executor.submit(gptq_quantize, module)
module.to(ORIGINAL_GPU)
Option C: Asynchronous Pipeline Parallelism
This option allows parallelism across layers of the model, enabling both calibration parallelism and module compression quantization parallelism. In short, each layer of the model is on a separate GPU, and each layer is compressed on a separate GPU asynchronously, so subsequent layers can be calibrated/compressed while previous layers are still compressing.
This implementation unfortunately sacrifices the step where you rerun the model to propagate quantization error, but it's very unlikely that this will lead to a noticeable accuracy loss.
class SequentialPipeline:
def __call__(...):
dispatch_for_sequential(model)
for subgraph_index, subgraph in enumerate(subgraphs):
with disable_offloading():
for i, batch in enumerate(data_loader):
_ = subgraph.forward(model, **inputs)
LifecycleCallbacks.sequential_epoch_end() # this is where optimization/quantization happens
with HooksMixin.disable_hooks():
inputs[i] = subgraph.forward(model, **inputs)
class SequentialPipeline:
def __call__(...):
for subgraph_index, subgraph in enumerate(subgraphs):
with disable_offloading(move_to=GPU), ThreadPoolExecutor(...) as executor:
for i, batch in enumerate(data_loader):
inputs[i] = subgraph.forward(model, **inputs)
executor.submit(LifecycleCallbacks.sequential_epoch_end)