Skip to content

Can't quantize kv cache: observer = self.k_observers[layer_idx] liste index out of range #1295

@DreamGenX

Description

@DreamGenX

Describe the bug
Using this recipe:

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head", "re:.*layers.0..*", "re:.*layers.79..*"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: channel
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: token
                        dynamic: true
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true

Results in the error below. If I remove the kv_cache_scheme, it works.
The model I am quantizing is Llama 3.3 70b.

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]:
  2. Python version [e.g. 3.7]:
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.4.1
  4. ML framework version(s) [e.g. torch 2.3.1]:
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
  6. Other relevant environment information [e.g. hardware, CUDA version]: 2xH100SXM
    To Reproduce
    Exact steps to reproduce the behavior:

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 112, in on_initialize
    self._calibrate_if_possible(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 266, in _calibrate_if_possible
    self._calibrate(module)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/quantization/base.py", line 314, in _calibrate
    run_calibration_forward(
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/utils/pytorch_helpers.py", line 82, in run_calibration_forward
    forward_fn(batch, module=model)
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/pytorch/utils/helpers.py", line 394, in tensors_module_forward
    return module(**tensors)
           ^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 853, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 601, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 343, in forward
    hidden_states, self_attn_weights = self.self_attn(
                                       ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
    return inner()
           ^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1793, in inner
    result = forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/accelerate/hooks.py", line 176, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 287, in forward
    key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 93, in update
    q_key_states = self._quantize(
                   ^^^^^^^^^^^^^^^
  File "/root/venv/lib/python3.12/site-packages/llmcompressor/modifiers/quantization/cache.py", line 144, in _quantize
    observer = self.k_observers[layer_idx]
               ~~~~~~~~~~~~~~~~^^^^^^^^^^^
IndexError: list index out of range

Additional context
Add any other context about the problem here. Also include any relevant files.

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions