[studio] Fix VLM detection for transformers v5 by danielhanchen · Pull Request #17 · unslothai/unsloth-staging-1

danielhanchen · 2026-04-06T14:00:29Z

Staging mirror of unslothai#4868

Original PR: unslothai#4868
Author: Datta0

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Fixes: unslothai#4859

This PR contains code changes only (1 files). Test changes are in a separate PR.

Changed files:

studio/backend/utils/models/model_config.py

danielhanchen · 2026-04-06T14:01:06Z

/gemini review

danielhanchen · 2026-04-06T14:01:07Z

@codex review

gemini-code-assist

Code Review

This pull request enhances the Vision-Language Model (VLM) detection logic within the is_vision_model function. It introduces a more robust multi-stage detection process that includes in-process configuration checking, a version-specific subprocess check, and a final fallback to raw metadata inspection via _load_model_config_metadata. Additionally, the _is_vlm_config helper was added to unify the identification of VLM characteristics across different configuration formats, and error handling in the subprocess check was updated to return None to facilitate the new fallback mechanism. I have no feedback to provide.

chatgpt-codex-connector · 2026-04-06T14:05:17Z

Codex Review: Didn't find any major issues. More of your lovely PRs please.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

danielhanchen

Thank you for the PR! The goal of this PR is to fix VLM detection for models that require transformers v5 (like Qwen3.5 and Gemma4) in Unsloth Studio. As a summary, this PR refactors the vision detection logic by introducing a unified _is_vlm_config() helper, a raw config.json fallback via _load_model_config_metadata(), and changes _is_vision_model_subprocess() to return Optional[bool] (None on transient failures) to enable the multi-tier fallback chain.

Review Loop 1 - Consolidated Findings

Reviewers	Severity	Finding
7/7	High	Merge conflict resolution dropped main's vision-detection cache (`_vision_detection_cache`, `_token_fingerprint`, `_vision_cache_lock`, `_is_vision_model_uncached()`) from unslothai#4853/unslothai#4855. Every repeated `is_vision_model()` call now re-runs `load_model_config()` and may re-spawn the 60s subprocess.
2/7	Medium	For transformers-5-only models (Qwen3.5, Gemma4), the new flow tries the known-failing in-process `load_model_config()` before falling back to subprocess/raw-config. This adds avoidable latency and warning log noise for the exact models being fixed.
1/7	High	`_is_vlm_config()` false-positives on seq2seq models (T5, BART, mBART, Pegasus, etc.) because `ForConditionalGeneration` suffix matches both VLMs and seq2seq. Only `csm` and `whisper` are excluded.
1/7	Medium	`_VISION_CHECK_SCRIPT` subprocess still uses old suffix-only heuristic without audio-only exclusion, so subprocess and in-process detection can disagree (e.g. Whisper returns `is_vision: true` in subprocess).
1/7	Medium	`_load_model_config_metadata()` does not call `resolve_cached_repo_id_case()`, so different casings of the same repo can bypass HF cache and produce duplicate downloads.
1/7	Low	Type annotation `dict[str, Any]` (Python 3.9+) used in `_load_model_config_metadata` while rest of file uses `Dict[str, Any]` from typing.

Concrete suggestions for each finding below.

Fix 1: Restore caching layer (7/7 consensus)

The PR's new detection logic should be placed inside _is_vision_model_uncached() while keeping main's cache wrapper. This is the correct merge resolution.

Fix 2: Subprocess/fallback ordering for t5 models (2/7)

For needs_transformers_5 models, use subprocess first (as main did originally), then fall back to raw config.json. Do not try the known-failing in-process AutoConfig load.

Fix 3: Narrow `_is_vlm_config()` to exclude seq2seq (1/7)

Add _NON_VLM_CONDITIONAL_MODEL_TYPES set for t5, bart, mbart, mt5, pegasus, blenderbot to prevent false positives.

Fix 4: Update `_VISION_CHECK_SCRIPT` (1/7)

Add audio-only exclusion logic to the subprocess script to match _is_vlm_config().

Fix 5: Use `resolve_cached_repo_id_case()` in `_load_model_config_metadata` (1/7)

Fix 6: Use `Dict[str, Any]` instead of `dict[str, Any]` for consistency

Properly combines PR unslothai#4868's new detection logic with main's caching layer: - Add _is_vlm_config() helper for unified VLM detection across config types - Add _load_model_config_metadata() as raw config.json fallback when both AutoConfig and subprocess fail for transformers-5-only models - For needs_transformers_5 models: subprocess first, then raw config fallback - Add _NON_VLM_CONDITIONAL_MODEL_TYPES to prevent seq2seq false positives - Update _VISION_CHECK_SCRIPT with audio-only and seq2seq exclusion - Preserve vision detection cache and error classification from main

danielhanchen · 2026-04-06T14:16:50Z

/gemini review

danielhanchen · 2026-04-06T14:16:51Z

@codex review

chatgpt-codex-connector · 2026-04-06T14:17:02Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist

Code Review

This pull request refactors VLM detection logic by centralizing it into a helper function and adding a fallback to raw config metadata when subprocess checks fail. The reviewer identified logic duplication between the module and subprocess script that should be addressed to ensure consistency, and recommended restoring detailed logging in the main detection path to maintain observability.

gemini-code-assist · 2026-04-06T14:19:16Z

+    audio_only_types = {"csm", "whisper"}
+    non_vlm_seq2seq_types = {"bart", "mbart", "t5", "mt5", "pegasus",
+                              "blenderbot", "blenderbot-small"}
+
    is_vlm = False
-    if hasattr(config, "architectures"):
-        is_vlm = any(
-            x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
-            for x in config.architectures
-        )
-    if not is_vlm and hasattr(config, "vision_config"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "img_processor"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "image_token_index"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "model_type"):
-        vlm_types = {"phi3_v","llava","llava_next","llava_onevision",
-                      "internvl_chat","cogvlm2","minicpmv"}
-        if config.model_type in vlm_types:
+    if model_type not in audio_only_types and model_type not in non_vlm_seq2seq_types:
+        if hasattr(config, "architectures"):
+            is_vlm = any(
+                x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
+                for x in config.architectures
+            )
+        if not is_vlm and hasattr(config, "vision_config"):
+            is_vlm = True
+        if not is_vlm and hasattr(config, "img_processor"):
+            is_vlm = True
+        if not is_vlm and hasattr(config, "image_token_index"):
+            is_vlm = True
+        if not is_vlm and model_type in {
+            "phi3_v", "llava", "llava_next", "llava_onevision",
+            "internvl_chat", "cogvlm2", "minicpmv",
+        }:
            is_vlm = True


The VLM detection logic and model type sets are duplicated inside the _VISION_CHECK_SCRIPT string. This creates a maintenance risk where updates to the module-level constants (like _AUDIO_ONLY_MODEL_TYPES, _NON_VLM_CONDITIONAL_MODEL_TYPES, or _VLM_MODEL_TYPES) might not be reflected in the subprocess check. Consider using string formatting to inject these constants into the script template to ensure consistency across both detection paths.

gemini-code-assist · 2026-04-06T14:19:16Z


+    try:
+        config = load_model_config(model_name, use_auth = True, token = hf_token)
+        return _is_vlm_config(config)


The refactoring of _is_vision_model_uncached removed detailed logging that explained why a model was detected as a VLM. While the new _is_vlm_config helper improves code reuse, restoring logging here would maintain observability for the standard detection path, matching the logging added to the subprocess fallback path at line 796.

is_vlm = _is_vlm_config(config) if is_vlm: logger.info( "Model %s detected as VLM: model_type=%s architectures=%s", model_name, getattr(config, "model_type", None), getattr(config, "architectures", []), ) return is_vlm

…ion markers - Remove ForConditionalGeneration from VLM arch suffix heuristic. This caused false positives for seq2seq models (T5, BART, mBART, UMT5, M2M100, Switch Transformers, LongT5, Pegasus, Blenderbot) - Rely on explicit vision signals: vision_config, img_processor, image_token_index, ForVisionText2Text suffix, and known model types - Remove _NON_VLM_CONDITIONAL_MODEL_TYPES (no longer needed) - Add _classify_detection_error helper for permanent vs transient error classification, now used in both t5 fallback and non-t5 except paths - _load_model_config_metadata now returns (data, error) tuple so callers can classify permanent errors (repo not found, gated) for caching - Update _VISION_CHECK_SCRIPT subprocess to match new detection logic

danielhanchen

Review Loop 2 - Consolidated Findings

After loop 1 fixes, ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 4/7 APPROVE, 3/7 REQUEST_CHANGES.

Reviewers	Severity	Finding
1/7	High	ForConditionalGeneration heuristic still false-positives on UMT5, M2M100, Switch Transformers, LongT5 -- the exclusion set approach is fundamentally fragile
1/7	High	Seq2seq exclusion can suppress real multimodal configs that use a base model_type (e.g., T5-based VLMs with vision_config)
2/7	Medium	t5 raw-config fallback path returns None for permanent errors (RepositoryNotFoundError, GatedRepoError), preventing caching

Fixes Applied (committed)

Removed ForConditionalGeneration from VLM detection entirely. Instead of maintaining an ever-growing exclusion list, we now rely on explicit vision signals only:
- vision_config (most VLMs: LLaVA, Gemma-3/4, Qwen-VL, etc.)
- img_processor (Phi-3.5 Vision)
- image_token_index (common in VLMs)
- ForVisionText2Text architecture suffix (VLM-specific)
- Known _VLM_MODEL_TYPES set
This eliminates all seq2seq false positives while still detecting all real VLMs. Models with T5 base but vision capabilities (like Pix2Struct) are still correctly detected via their vision_config.
Removed _NON_VLM_CONDITIONAL_MODEL_TYPES -- no longer needed since we don't use ForConditionalGeneration.
Added _classify_detection_error() helper to classify permanent vs transient errors in both the t5 fallback and non-t5 paths. _load_model_config_metadata now returns (data, error) tuple so the t5 path can cache permanent failures.
Updated _VISION_CHECK_SCRIPT subprocess to match the new detection logic.

Test Results

33 simulation tests covering VLMs, audio-only, seq2seq, edge cases: all pass
8 subprocess script simulation tests: all pass
UMT5, M2M100, Switch Transformers, LongT5: all correctly classified as non-VLM
T5 with vision_config: correctly detected as VLM (vision signals take priority)

danielhanchen · 2026-04-06T14:30:03Z

/gemini review

danielhanchen · 2026-04-06T14:30:04Z

@codex review

chatgpt-codex-connector · 2026-04-06T14:30:16Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

…ration Since ForConditionalGeneration was removed as a VLM signal, known VLM model types that use it need to be in _VLM_MODEL_TYPES as a safety net. In practice, all these models expose vision_config in their real HF configs so they would be detected regardless, but the model_type list ensures compatibility with simplified mock configs (test_vision_cache.py) and edge cases. Added: gemma3, gemma4, qwen2_vl, qwen2_5_vl, qwen3_5, paligemma, pix2struct, video_llava, blip-2, blip_2, idefics2, idefics3, mllama, chameleon. Also updated subprocess script to match.

gemini-code-assist

Code Review

This pull request refactors VLM detection logic in model_config.py by adding a fallback to raw config metadata and implementing error classification for improved caching. Review feedback recommends simplifying the detection logic's boolean structure for better maintainability and restoring docstring details regarding the function's return values and transient failure handling.

I am having trouble creating individual review comments. Click here to see my feedback.

studio/backend/utils/models/model_config.py (524-543)

This logic for VLM detection can be improved for readability and maintainability.

The chain of if not is_vlm checks can be simplified into a single boolean expression.
The audio_only_types set and the VLM model types set are duplicates of module-level constants (_AUDIO_ONLY_MODEL_TYPES and _VLM_MODEL_TYPES). To avoid inconsistencies, consider making _VISION_CHECK_SCRIPT an f-string and injecting these constants.

Here's a suggestion that addresses the first point:

    audio_only_types = {"csm", "whisper"}

    is_vlm = False
    if model_type not in audio_only_types:
        is_vlm = (
            hasattr(config, "vision_config") or
            hasattr(config, "img_processor") or
            hasattr(config, "image_token_index") or
            (hasattr(config, "architectures") and any(
                x.endswith("ForVisionText2Text")
                for x in config.architectures
            )) or
            model_type in {
                "phi3_v", "llava", "llava_next", "llava_onevision",
                "internvl_chat", "cogvlm2", "minicpmv",
            }
        )

studio/backend/utils/models/model_config.py (560-563)

The removed part of the docstring explaining the return value (Optional[bool]) is important for understanding the function's contract. It clarifies that None indicates a transient failure, which is crucial for the caller's logic. Please consider adding this information back to maintain clarity.

- Remove FileNotFoundError from permanent error classification to avoid stale negative caching for local model dirs still being set up - Return (None, None) instead of (None, FileNotFoundError) for local paths missing config.json -- absence is unknown state, not an error - Add xgenmm and qwen3_vl to _VLM_MODEL_TYPES for completeness

danielhanchen

Review Loop 3 - Consolidated Findings

Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 4/7 APPROVE, 3/7 REQUEST_CHANGES.

Reviewers	Severity	Finding
3/7	High	FileNotFoundError in _classify_detection_error was caching local path misses as permanent False -- stale for in-progress model dirs
2/7	High	_VLM_MODEL_TYPES missing xgenmm and qwen3_vl

Fixes Applied

Removed FileNotFoundError from permanent error classification. Local paths without config.json now return (None, None) instead of (None, FileNotFoundError), treating absence as unknown rather than a cacheable failure.
Added xgenmm and qwen3_vl to _VLM_MODEL_TYPES and subprocess script.

All 3 loops complete. Summary of changes from original PR:

Properly merged with main's caching layer (is_vision_model -> cache -> _is_vision_model_uncached)
Removed ForConditionalGeneration as a VLM signal (caused seq2seq false positives)
Added comprehensive _VLM_MODEL_TYPES as safety net for known VLMs
Added _is_vlm_config() for unified detection across config types (dict and object)
Added _load_model_config_metadata() as raw config.json fallback for t5 models
Added _classify_detection_error() for consistent permanent/transient error handling
Updated _VISION_CHECK_SCRIPT subprocess to match
Preserved all error classification and caching semantics from main

danielhanchen · 2026-04-06T14:48:20Z

/gemini review

danielhanchen · 2026-04-06T14:48:21Z

@codex review

chatgpt-codex-connector · 2026-04-06T14:48:31Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

gemini-code-assist

Code Review

This pull request refactors the VLM detection logic by centralizing it into a new _is_vlm_config helper and introducing a fallback mechanism that retrieves raw config.json metadata when AutoConfig fails. It also expands the list of supported VLM model types and improves error classification for caching purposes. Review feedback suggests removing qwen3_5 from the VLM list to prevent incorrect detection of standard LLMs and recommends using the centralized helper within the detection script to eliminate code duplication.

gemini-code-assist · 2026-04-06T14:53:54Z

+    "gemma4",
+    "qwen2_vl",
+    "qwen2_5_vl",
+    "qwen3_5",


qwen3_5 appears to be the model type for the Qwen 3.5 LLM family (e.g., the MoE models), not specifically a VLM. Including it in _VLM_MODEL_TYPES will cause standard LLMs to be incorrectly detected as vision models. The VLM variant is typically identified as qwen3_vl (which is already in the list).

Suggested change

"qwen3_5",

"paligemma",

gemini-code-assist · 2026-04-06T14:53:55Z

+    model_type = getattr(config, "model_type", None)
+    audio_only_types = {"csm", "whisper"}
+
    is_vlm = False
-    if hasattr(config, "architectures"):
-        is_vlm = any(
-            x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))
-            for x in config.architectures
-        )
-    if not is_vlm and hasattr(config, "vision_config"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "img_processor"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "image_token_index"):
-        is_vlm = True
-    if not is_vlm and hasattr(config, "model_type"):
-        vlm_types = {"phi3_v","llava","llava_next","llava_onevision",
-                      "internvl_chat","cogvlm2","minicpmv"}
-        if config.model_type in vlm_types:
+    if model_type not in audio_only_types:
+        if hasattr(config, "vision_config"):
+            is_vlm = True
+        if not is_vlm and hasattr(config, "img_processor"):
+            is_vlm = True
+        if not is_vlm and hasattr(config, "image_token_index"):
+            is_vlm = True
+        if not is_vlm and hasattr(config, "architectures"):
+            is_vlm = any(
+                x.endswith("ForVisionText2Text")
+                for x in config.architectures
+            )
+        if not is_vlm and model_type in {
+            "phi3_v", "llava", "llava_next", "llava_onevision",
+            "internvl_chat", "cogvlm2", "minicpmv", "gemma3", "gemma4",
+            "qwen2_vl", "qwen2_5_vl", "qwen3_5", "paligemma",
+            "pix2struct", "video_llava", "blip-2", "blip_2",
+            "idefics2", "idefics3", "mllama", "chameleon",
+            "xgenmm", "qwen3_vl",
+        }:
            is_vlm = True


The VLM detection logic and model lists are duplicated inside _VISION_CHECK_SCRIPT. Since backend_dir is added to sys.path, the script can import and use the centralized _is_vlm_config helper. This improves maintainability and ensures consistency between in-process and subprocess detection.

Suggested change

model_type = getattr(config, "model_type", None)

audio_only_types = {"csm", "whisper"}

is_vlm = False

if hasattr(config, "architectures"):

is_vlm = any(

x.endswith(("ForConditionalGeneration", "ForVisionText2Text"))

for x in config.architectures

)

if not is_vlm and hasattr(config, "vision_config"):

is_vlm = True

if not is_vlm and hasattr(config, "img_processor"):

is_vlm = True

if not is_vlm and hasattr(config, "image_token_index"):

is_vlm = True

if not is_vlm and hasattr(config, "model_type"):

vlm_types = {"phi3_v","llava","llava_next","llava_onevision",

"internvl_chat","cogvlm2","minicpmv"}

if config.model_type in vlm_types:

if model_type not in audio_only_types:

if hasattr(config, "vision_config"):

is_vlm = True

if not is_vlm and hasattr(config, "img_processor"):

is_vlm = True

if not is_vlm and hasattr(config, "image_token_index"):

is_vlm = True

if not is_vlm and hasattr(config, "architectures"):

is_vlm = any(

x.endswith("ForVisionText2Text")

for x in config.architectures

)

if not is_vlm and model_type in {

"phi3_v", "llava", "llava_next", "llava_onevision",

"internvl_chat", "cogvlm2", "minicpmv", "gemma3", "gemma4",

"qwen2_vl", "qwen2_5_vl", "qwen3_5", "paligemma",

"pix2struct", "video_llava", "blip-2", "blip_2",

"idefics2", "idefics3", "mllama", "chameleon",

"xgenmm", "qwen3_vl",

}:

is_vlm = True

from utils.models.model_config import _is_vlm_config

is_vlm = _is_vlm_config(config)

model_type = getattr(config, "model_type", None)

- Add image_token_id as an explicit vision signal in _is_vlm_config and subprocess script (used by SmolVLM-style configs) - Add gemma3n, qwen3_vl_moe, smolvlm to _VLM_MODEL_TYPES - Keep subprocess script model type list in sync

danielhanchen

Review Loop 4 - Consolidated Findings

Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 5/7 APPROVE, 2/7 REQUEST_CHANGES.

Reviewers	Severity	Finding
2/7	High	Missing VLM model types (gemma3n, qwen3_vl_moe, smolvlm) and missing image_token_id as a vision signal

Fixes Applied

Added image_token_id as an explicit vision signal in both _is_vlm_config() and _VISION_CHECK_SCRIPT. This is used by SmolVLM-style configs that have image_token_id instead of image_token_index.
Added gemma3n, qwen3_vl_moe, smolvlm to _VLM_MODEL_TYPES and subprocess script.

Verification

SmolVLM2 (model_type=smolvlm) has both vision_config and image_token_id -- detected via vision_config, image_token_id now also serves as a secondary signal.
All previously verified models (Qwen3.5, Gemma4, Whisper, T5, etc.) remain correctly classified.

These VLMs have no explicit vision signals (vision_config, img_processor, image_token_index, image_token_id) in their config.json, relying solely on model_type for detection.

danielhanchen · 2026-04-06T15:06:22Z

/gemini review

danielhanchen · 2026-04-06T15:06:23Z

@codex review

gemini-code-assist · 2026-04-06T15:06:26Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector · 2026-04-06T15:06:35Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

…ation - _is_vlm_config now checks values are not None, not just key presence. Prevents false positives from configs with "vision_config": null - _VISION_CHECK_SCRIPT updated to use getattr with None checks - _classify_detection_error now treats EntryNotFoundError and RevisionNotFoundError as permanent (cacheable as False)

danielhanchen

Review Loop 5 - Consolidated Findings

Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 5/7 APPROVE, 2/7 REQUEST_CHANGES.

Reviewers	Severity	Finding
1/7	High	Null placeholder vision fields (e.g. "vision_config": null) treated as positive VLM signal -- key presence check should verify value is not None
1/7	Medium	EntryNotFoundError/RevisionNotFoundError not in permanent error classification -- causes repeated probes for repos missing config.json

Fixes Applied

Null-safe vision field checks: _is_vlm_config() now retrieves field values and checks is not None instead of just key/attribute presence. Prevents false positives from configs with placeholder null fields. Updated both in-process and subprocess detection.
Added EntryNotFoundError and RevisionNotFoundError to _classify_detection_error() as permanent errors. Repos that exist but don't have config.json are now cached as False instead of retrying on every call.

Use getattr with None default instead of hasattr to prevent TypeError when config.architectures is None (hasattr returns True for None values, but iterating None raises TypeError).

danielhanchen · 2026-04-06T15:31:32Z

Fixes pushed to original PR unslothai#4868. Closing staging copy.

gemini-code-assist Bot reviewed Apr 6, 2026

View reviewed changes