diff --git a/.claude/skills/deployment/SKILL.md b/.claude/skills/deployment/SKILL.md index 45545397aa..e0189cb12e 100644 --- a/.claude/skills/deployment/SKILL.md +++ b/.claude/skills/deployment/SKILL.md @@ -222,6 +222,10 @@ For NEL-managed deployment (evaluation with self-deployment), use the evaluation | `Connection refused` on health check | Server still starting | Wait 30-60s for large models; check logs for errors | | `modelopt_fp4 not supported` | Framework doesn't support FP4 for this model | Check support matrix in `references/support-matrix.md` | +## Unsupported Models + +If the model is not in the validated support matrix (`references/support-matrix.md`), deployment may fail due to weight key mismatches, missing architecture mappings, or quantized/unquantized layer confusion. Read `references/unsupported-models.md` for the iterative debug loop: **run → read error → diagnose → patch framework source → re-run**. For kernel-level issues, escalate to the framework team rather than attempting fixes. + ## Success Criteria 1. Server process is running and healthy (`/health` returns 200) diff --git a/.claude/skills/deployment/references/unsupported-models.md b/.claude/skills/deployment/references/unsupported-models.md new file mode 100644 index 0000000000..5d90331c72 --- /dev/null +++ b/.claude/skills/deployment/references/unsupported-models.md @@ -0,0 +1,70 @@ +# Deploying Unsupported Models + +When deploying a model not in the validated support matrix (`support-matrix.md`), expect failures. This guide covers the iterative debug loop for getting unsupported models running on vLLM, SGLang, or TRT-LLM. + +## Step 1 — Run and collect the error + +Submit the deployment job. When it fails, read the full log — focus on the **first** error traceback (not "See root cause above" wrappers). Identify the file and line number in the framework source. + +## Step 2 — Diagnose the root cause + +Fetch the framework source at the failing line (use `gh api` for the tagged version, or `find` inside the container). Common error categories: + +| Category | Symptoms | Examples | +|----------|----------|----------| +| **Weight key mismatch** | `KeyError`, `Unexpected key`, `Missing key` during weight loading | Checkpoint uses `model.language_model.layers.*` but framework expects `model.layers.*`. See [vllm#39406](https://github.com/vllm-project/vllm/pull/39406) | +| **Quantized/unquantized layer confusion** | Wrong layer type loaded, dtype errors, shape mismatches | Framework tries to load unquantized layers with FP4 kernel due to overly broad `quantization_config.ignore` patterns or missing ignore entries. See [sglang#18937](https://github.com/sgl-project/sglang/pull/18937) | +| **Missing architecture support** | `NoneType is not iterable`, `KeyError` on model type, unknown architecture | Framework's model handler doesn't recognize the text backbone type (e.g., `ministral3` not handled in vLLM's `mistral3.py` init). Fix: extend the model type mapping | +| **Transformers version mismatch** | `ImportError`, `KeyError` on config fields | Framework ships with older transformers that doesn't know the model type. Fix: upgrade transformers after installing the framework | +| **Kernel-level issues** | CUDA errors, `triton` import failures, unsupported ops | Framework lacks kernel support for this model + quantization combo | + +## Step 3 — Apply a targeted fix + +Focus on **small, targeted patches** to the framework source. Do not modify `config.json` or the checkpoint — fix the framework's handling instead. + +### Weight key mismatches and architecture mapping gaps + +Patch the framework source in the run script using `sed` or a Python one-liner. Keep patches minimal — change only what's needed to unblock the current error. + +```bash +# Example: extend model type mapping in vLLM mistral3.py +FRAMEWORK_FILE=$(find /usr/local/lib -path "*/vllm/model_executor/models/mistral3.py" 2>/dev/null | head -1) +sed -i 's/old_pattern/new_pattern/' "${FRAMEWORK_FILE}" +``` + +> **Tip**: when locating framework source files inside containers, use `find` instead of Python import — some frameworks print log messages to stdout during import that can corrupt captured paths. + +### Speeding up debug iterations (vLLM) + +When iterating on fixes, use these flags to shorten the feedback loop: + +- **`--load-format dummy`** — skip loading actual model weights. Useful for testing whether the model initializes, config is parsed correctly, and weight keys match without waiting for the full checkpoint load. +- **`VLLM_USE_PRECOMPILED=1 pip install --editable .`** — when patching vLLM source directly (instead of `sed`), this rebuilds only Python code without recompiling C++/CUDA extensions. + +### Quantized/unquantized layer confusion + +Check `hf_quant_config.json` ignore patterns against the framework's weight loading logic. The framework may try to load layers listed in `ignore` with quantized kernels, or vice versa. Fix by adjusting the framework's layer filtering logic. + +### Kernel-level issues + +These require framework kernel team involvement. Do NOT attempt to patch kernels. Instead: + +1. Document the exact error (model, format, framework version, GPU type) +2. Inform the user: *"This model + quantization combination requires kernel support that isn't available in {framework} v{version}. I'd suggest reaching out to the {framework} kernel team or trying a different framework."* +3. Suggest trying an alternative framework (vLLM → SGLang → TRT-LLM) + +## Step 4 — Re-run and iterate + +After applying a fix, resubmit the job. Each iteration may reveal a new error (e.g., fixing the init error exposes a weight loading error). Continue the loop: **run → read error → diagnose → patch → re-run**. + +Typical iteration count: 1-3 for straightforward fixes, 3-5 for models requiring multiple patches. + +## Step 5 — Know when to stop + +**Stop patching and escalate** when: + +- The error is in compiled CUDA kernels or triton ops (not Python-level) +- The fix requires changes to core framework abstractions (not just model handlers) +- You've done 5+ iterations without the server starting + +In these cases, inform the user and suggest: trying a different framework, checking for a newer framework version, or filing an issue with the framework team. diff --git a/.claude/skills/ptq/SKILL.md b/.claude/skills/ptq/SKILL.md index 932f62ec2c..6849f8c94d 100644 --- a/.claude/skills/ptq/SKILL.md +++ b/.claude/skills/ptq/SKILL.md @@ -113,6 +113,10 @@ ls -lh / Report the path and size to the user. +### Post-quantization validation + +Validate the exported checkpoint's quantization pattern matches the recipe. Quantization config patterns can silently miss layers if the model uses non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns) — this only surfaces later as deployment failures. Read `references/checkpoint-validation.md` for the validation script, expected patterns per recipe, and common pattern gaps. + ## Key API Rules - `mtq.register()` classes **must** define `_setup()` and call it from `__init__` @@ -137,6 +141,7 @@ Report the path and size to the user. | `references/launcher-guide.md` | Step 4B only (launcher path) | | `tools/launcher/CLAUDE.md` | Step 4B only, if you need more launcher detail | | `references/unsupported-models.md` | Step 4C only (unlisted model) | +| `references/checkpoint-validation.md` | Step 5: validate quantization pattern matches recipe | | `skills/common/remote-execution.md` | Step 4A/4C only, if target is remote | | `skills/common/slurm-setup.md` | Step 4A/4C only, if using SLURM manually (not launcher) | | `references/slurm-setup-ptq.md` | Step 4A/4C only, PTQ-specific SLURM (container, GPU sizing, FSDP2) | diff --git a/.claude/skills/ptq/references/checkpoint-validation.md b/.claude/skills/ptq/references/checkpoint-validation.md new file mode 100644 index 0000000000..68d1ddd075 --- /dev/null +++ b/.claude/skills/ptq/references/checkpoint-validation.md @@ -0,0 +1,86 @@ +# Post-Quantization Checkpoint Validation + +Verify the exported checkpoint's quantization pattern matches the recipe used. Quantization config patterns may silently miss layers if the model uses non-standard naming — this only surfaces later as deployment failures when the serving framework tries to load unquantized weights as quantized. + +## Expected quantization patterns by recipe + +| Recipe (`--qformat`) | What should be quantized | What should be excluded | +|----------------------|-------------------------|------------------------| +| `nvfp4` | All linear layers | lm_head, routers, norms, embeddings | +| `nvfp4_mlp_only` | MLP layers (including MoE experts) | Attention layers, lm_head, routers | +| `nvfp4_experts_only` | MoE expert layers only | Dense MLP, attention, lm_head, routers | +| `nvfp4_omlp_only` | MLP + o_proj layers | Other attention layers, lm_head, routers | +| `fp8` | All linear layers | lm_head, norms, embeddings | +| `int4_awq` | All linear layers | lm_head, norms, embeddings | + +## Validation script + +Run against the exported checkpoint to check every linear layer is either quantized (has scale params) or explicitly excluded: + +```bash +python3 -c " +import json, fnmatch + +output = '' +idx = json.load(open(f'{output}/model.safetensors.index.json')) +cfg = json.load(open(f'{output}/hf_quant_config.json')) +excludes = cfg['quantization']['exclude_modules'] + +all_keys = set(idx['weight_map'].keys()) +# Identify linear weight params (skip norms, embeddings, scalars, scales) +skip_suffixes = ('_scale', '_scale_2', 'layernorm', 'layer_norm', 'norm.weight', 'embed', 'scalar') +linear_weights = sorted(k for k in all_keys + if k.endswith('.weight') and not any(s in k.lower() for s in skip_suffixes)) + +# Check which have quantization scales +quantized, excluded, unexpected = [], [], [] +for w in linear_weights: + base = w.rsplit('.weight', 1)[0] + has_scales = any(f'{base}.{s}' in all_keys for s in ['weight_scale', 'input_scale']) + is_excluded = any(fnmatch.fnmatch(w, p) or fnmatch.fnmatch(base, p) for p in excludes) + + if has_scales: + quantized.append(w) + elif is_excluded: + excluded.append(w) + else: + unexpected.append(w) + +print(f'Quantized layers: {len(quantized)}') +print(f'Excluded layers (in exclude_modules): {len(excluded)}') +if unexpected: + print(f'\nWARNING: {len(unexpected)} layers have NO scales and are NOT in exclude list:') + # Group by module type for readability + groups = {} + for w in unexpected: + parts = w.split('.') + module_type = next((p for p in parts if p in + ('self_attn', 'mlp', 'experts', 'router', 'lm_head', 'embed_tokens', 'vision_tower')), 'other') + groups.setdefault(module_type, []).append(w) + for mtype, weights in sorted(groups.items()): + print(f' {mtype}: {len(weights)} weights (e.g., {weights[0]})') + print() + print('These layers were silently skipped during quantization.') + print('Likely cause: quantization config patterns did not match these module names.') + print('This WILL cause deployment failures (framework loads them as quantized but they are BF16).') + print('Fix: add missing patterns to the config, or add to exclude_modules if intentionally unquantized.') +else: + print('\nAll layers are either quantized or explicitly excluded. Checkpoint is consistent.') +" +``` + +## Common pattern gaps + +Layers silently skipped because the quantization config patterns don't match the model's naming: + +| Model | Module path | Missed by pattern | Fix | +|-------|-------------|-------------------|-----| +| Gemma4 MoE | `layers.N.experts.*` | `*mlp*`, `*block_sparse_moe*` | Add `*.experts.*` (PR #1219) | +| Custom MoE | `layers.N.moe_block.experts.*` | `*mlp*` | Add matching pattern | +| VLM projector | `multi_modal_projector.*` | — | Usually excluded; verify | + +## What to do when warnings appear + +- **Layers should have been quantized** (e.g., MoE experts with `nvfp4_mlp_only`): the quantization config patterns missed them. Fix by adding the missing pattern to the config and re-running PTQ. Check if ModelOpt already has a plugin for the model in `modelopt/torch/quantization/plugins/huggingface.py`. + +- **Layers are intentionally unquantized** (e.g., attention layers with `nvfp4_mlp_only`): they should be in the `exclude_modules` list but the export didn't add them. Add them manually to both `hf_quant_config.json` and `config.json` `quantization_config.ignore` in the checkpoint to prevent deployment failures. diff --git a/.claude/skills/ptq/references/unsupported-models.md b/.claude/skills/ptq/references/unsupported-models.md index ab59cbf2e4..1a198f3e88 100644 --- a/.claude/skills/ptq/references/unsupported-models.md +++ b/.claude/skills/ptq/references/unsupported-models.md @@ -347,4 +347,5 @@ tokenizer.save_pretrained(output_path) - **Check quantizer summary**: `mtq.print_quant_summary(model)` shows which quantizers are enabled/disabled - **Inspect dtypes**: After loading, iterate `model.named_parameters()` and check for unexpected FP8 tensors - **Watch for silent disabling**: A misconfigured wildcard pattern can silently disable quantizers — always verify the summary +- **Validate quantization pattern after export**: Run the validation script from SKILL.md Step 5 on the exported checkpoint. It checks every linear layer is either quantized (has scale params) or explicitly excluded. Layers that are neither were silently skipped — common for models with non-standard naming (e.g., Gemma4 `experts.*` missed by `*mlp*` patterns). This causes deployment failures when the framework tries to load BF16 weights as quantized - **Read pip errors carefully**: `ResolutionImpossible` means dependency conflict (try `--no-deps`), NOT network failure. Check for `Connection refused`/`Name resolution failed` before concluding network is down