Awq masking by ZewenShen-Cohere · Pull Request #1 · ZewenShen-Cohere/llm-compressor-fork

ZewenShen-Cohere · 2025-12-19T18:48:15Z

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

HDCharles · 2026-01-07T16:12:55Z

+    # Cache loss_mask for each parent module, one mask per batch
+    _loss_masks: list[torch.Tensor | None] = PrivateAttr(
+        default_factory=list
+    )


should this be none if the user isn't using loss masks?

HDCharles · 2026-01-07T16:17:58Z

        num_elements = 0

        # Compute the MSE loss for each batch
-        for fp16_batch, int_w_batch in zip(fp16_outputs, int_w_outputs):


see changes in vllm-project#2188 which will land soon

i suspect it will make more sense to apply the mask in run_samples and the concatenated fp16_output calculation rather than the loss calculation if possible

Sure, I'll make that change

HDCharles · 2026-01-07T16:23:43Z

+# Context variable to store the current batch's loss_mask for hooks to access
+_current_loss_mask: ContextVar[torch.Tensor | None] = ContextVar(
+    "_current_loss_mask", default=None
+)


do the loss masks change for sample to sample or are they largely constant, i think this approach is fine if they change a lot but we could potentially do something different where we just alter the AWQ modifier to take the loss mask into accound directly if they tend to be constant.

I assume the chat template is usually going to be pretty consistent so that may make more sense.

also wondering if the loss mask is usually a step function with a single edge, may make more sense to not store the entire mask and just the edge.

I think we'd like to generalize masks to be fully expressive, to include things like padding tokens. This shouldn't be too much memory, just num_samples * seq_len * bool ~= 1mb, or 8b if you don't want to offload and instead keep as a tensor.

kylesayrs · 2026-01-09T20:23:57Z

+# Context variable to store the current batch's loss_mask for hooks to access
+_current_loss_mask: ContextVar[torch.Tensor | None] = ContextVar(
+    "_current_loss_mask", default=None
+)


I think we'd like to generalize masks to be fully expressive, to include things like padding tokens. This shouldn't be too much memory, just num_samples * seq_len * bool ~= 1mb, or 8b if you don't want to offload and instead keep as a tensor.

kylesayrs · 2026-01-09T20:24:48Z

+__all__ = ["SequentialPipeline", "_current_loss_mask"]
+
+# Context variable to store the current batch's loss_mask for hooks to access
+_current_loss_mask: ContextVar[torch.Tensor | None] = ContextVar(


I think, to better work in the LLM Compressor framework, we should store this variable on the State

kylesayrs · 2026-01-09T20:28:15Z

+
+                        # Set loss_mask in context variable if enabled, so hooks can access it
+                        if dataset_args.use_loss_mask:
+                            loss_mask_dict = activations.fetch(batch_idx, ["loss_mask"])


How did the "loss_mask" argument end up in the activations cache?

It's probably better if we implement a calculate_token_mask which gets called just once per batch with the model inputs (so it can use things like the attention mask). This way, we also don't need to continuously offload/onload the values.

https://github.com/ZewenShen-Cohere/llm-compressor-fork/pull/1/changes#diff-7fa7c4bb4a7a6087e1af538b307f86d166cff0365bfe9977f9512fd8777df0a4L93

activations = IntermediatesCache.from_dataloader(
dataloader, model_device, offload_device=offload_device
)
will automatically saves all the columns from the dataloader

kylesayrs · 2026-01-09T20:29:17Z

+                # mask shape: [batch, seq_len]
+                # output shape: [batch, seq_len, hidden_dim]
+                # Flatten both to [batch * seq_len, hidden_dim] and [batch * seq_len]
+                fp16_flat = fp16_batch.flatten(0, -2)  # [batch * seq_len, hidden_dim]


Is all this logic actually required? I would assume that you don't need to do any flattening, instead just use something like masked_scatter.

kylesayrs · 2026-01-09T20:31:26Z

My big points are
Use State rather than a context var, and also integrate with basic pipeline
Calculate loss mask once (as soon as it comes out of the data_loader)
Avoid flattening if possible using something like masked_scatter

ZewenShen-Cohere added 2 commits December 19, 2025 00:33

Add support for masking in AWQ mean activation calculation

7c84c0b

Add mse masking support for AWQ

941eb28

HDCharles reviewed Jan 7, 2026

View reviewed changes

kylesayrs reviewed Jan 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Awq masking#1

Awq masking#1
ZewenShen-Cohere wants to merge 2 commits intoawq_bugfixfrom
awq_masking

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

HDCharles Jan 7, 2026

Uh oh!

HDCharles Jan 7, 2026 •

edited

Loading

Uh oh!

ZewenShen-Cohere Jan 7, 2026

Uh oh!

HDCharles Jan 7, 2026

Uh oh!

kylesayrs Jan 9, 2026

Uh oh!

kylesayrs Jan 9, 2026

Uh oh!

kylesayrs Jan 9, 2026

Uh oh!

kylesayrs Jan 9, 2026

Uh oh!

ZewenShen-Cohere Jan 9, 2026

Uh oh!

kylesayrs Jan 9, 2026

Uh oh!

kylesayrs commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ZewenShen-Cohere commented Dec 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HDCharles Jan 7, 2026 •

edited

Loading