[studio] Fix VLM detection for transformers v5#17
Conversation
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Code Review
This pull request enhances the Vision-Language Model (VLM) detection logic within the is_vision_model function. It introduces a more robust multi-stage detection process that includes in-process configuration checking, a version-specific subprocess check, and a final fallback to raw metadata inspection via _load_model_config_metadata. Additionally, the _is_vlm_config helper was added to unify the identification of VLM characteristics across different configuration formats, and error handling in the subprocess check was updated to return None to facilitate the new fallback mechanism. I have no feedback to provide.
|
Codex Review: Didn't find any major issues. More of your lovely PRs please. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
danielhanchen
left a comment
There was a problem hiding this comment.
Thank you for the PR! The goal of this PR is to fix VLM detection for models that require transformers v5 (like Qwen3.5 and Gemma4) in Unsloth Studio. As a summary, this PR refactors the vision detection logic by introducing a unified _is_vlm_config() helper, a raw config.json fallback via _load_model_config_metadata(), and changes _is_vision_model_subprocess() to return Optional[bool] (None on transient failures) to enable the multi-tier fallback chain.
Review Loop 1 - Consolidated Findings
| Reviewers | Severity | Finding |
|---|---|---|
| 7/7 | High | Merge conflict resolution dropped main's vision-detection cache (_vision_detection_cache, _token_fingerprint, _vision_cache_lock, _is_vision_model_uncached()) from unslothai#4853/unslothai#4855. Every repeated is_vision_model() call now re-runs load_model_config() and may re-spawn the 60s subprocess. |
| 2/7 | Medium | For transformers-5-only models (Qwen3.5, Gemma4), the new flow tries the known-failing in-process load_model_config() before falling back to subprocess/raw-config. This adds avoidable latency and warning log noise for the exact models being fixed. |
| 1/7 | High | _is_vlm_config() false-positives on seq2seq models (T5, BART, mBART, Pegasus, etc.) because ForConditionalGeneration suffix matches both VLMs and seq2seq. Only csm and whisper are excluded. |
| 1/7 | Medium | _VISION_CHECK_SCRIPT subprocess still uses old suffix-only heuristic without audio-only exclusion, so subprocess and in-process detection can disagree (e.g. Whisper returns is_vision: true in subprocess). |
| 1/7 | Medium | _load_model_config_metadata() does not call resolve_cached_repo_id_case(), so different casings of the same repo can bypass HF cache and produce duplicate downloads. |
| 1/7 | Low | Type annotation dict[str, Any] (Python 3.9+) used in _load_model_config_metadata while rest of file uses Dict[str, Any] from typing. |
Concrete suggestions for each finding below.
Fix 1: Restore caching layer (7/7 consensus)
The PR's new detection logic should be placed inside _is_vision_model_uncached() while keeping main's cache wrapper. This is the correct merge resolution.
Fix 2: Subprocess/fallback ordering for t5 models (2/7)
For needs_transformers_5 models, use subprocess first (as main did originally), then fall back to raw config.json. Do not try the known-failing in-process AutoConfig load.
Fix 3: Narrow _is_vlm_config() to exclude seq2seq (1/7)
Add _NON_VLM_CONDITIONAL_MODEL_TYPES set for t5, bart, mbart, mt5, pegasus, blenderbot to prevent false positives.
Fix 4: Update _VISION_CHECK_SCRIPT (1/7)
Add audio-only exclusion logic to the subprocess script to match _is_vlm_config().
Fix 5: Use resolve_cached_repo_id_case() in _load_model_config_metadata (1/7)
Fix 6: Use Dict[str, Any] instead of dict[str, Any] for consistency
Properly combines PR unslothai#4868's new detection logic with main's caching layer: - Add _is_vlm_config() helper for unified VLM detection across config types - Add _load_model_config_metadata() as raw config.json fallback when both AutoConfig and subprocess fail for transformers-5-only models - For needs_transformers_5 models: subprocess first, then raw config fallback - Add _NON_VLM_CONDITIONAL_MODEL_TYPES to prevent seq2seq false positives - Update _VISION_CHECK_SCRIPT with audio-only and seq2seq exclusion - Preserve vision detection cache and error classification from main
d808bea to
d87c8b7
Compare
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request refactors VLM detection logic by centralizing it into a helper function and adding a fallback to raw config metadata when subprocess checks fail. The reviewer identified logic duplication between the module and subprocess script that should be addressed to ensure consistency, and recommended restoring detailed logging in the main detection path to maintain observability.
| audio_only_types = {"csm", "whisper"} | ||
| non_vlm_seq2seq_types = {"bart", "mbart", "t5", "mt5", "pegasus", | ||
| "blenderbot", "blenderbot-small"} | ||
|
|
||
| is_vlm = False | ||
| if hasattr(config, "architectures"): | ||
| is_vlm = any( | ||
| x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) | ||
| for x in config.architectures | ||
| ) | ||
| if not is_vlm and hasattr(config, "vision_config"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "img_processor"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "image_token_index"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "model_type"): | ||
| vlm_types = {"phi3_v","llava","llava_next","llava_onevision", | ||
| "internvl_chat","cogvlm2","minicpmv"} | ||
| if config.model_type in vlm_types: | ||
| if model_type not in audio_only_types and model_type not in non_vlm_seq2seq_types: | ||
| if hasattr(config, "architectures"): | ||
| is_vlm = any( | ||
| x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) | ||
| for x in config.architectures | ||
| ) | ||
| if not is_vlm and hasattr(config, "vision_config"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "img_processor"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "image_token_index"): | ||
| is_vlm = True | ||
| if not is_vlm and model_type in { | ||
| "phi3_v", "llava", "llava_next", "llava_onevision", | ||
| "internvl_chat", "cogvlm2", "minicpmv", | ||
| }: | ||
| is_vlm = True |
There was a problem hiding this comment.
The VLM detection logic and model type sets are duplicated inside the _VISION_CHECK_SCRIPT string. This creates a maintenance risk where updates to the module-level constants (like _AUDIO_ONLY_MODEL_TYPES, _NON_VLM_CONDITIONAL_MODEL_TYPES, or _VLM_MODEL_TYPES) might not be reflected in the subprocess check. Consider using string formatting to inject these constants into the script template to ensure consistency across both detection paths.
|
|
||
| try: | ||
| config = load_model_config(model_name, use_auth = True, token = hf_token) | ||
| return _is_vlm_config(config) |
There was a problem hiding this comment.
The refactoring of _is_vision_model_uncached removed detailed logging that explained why a model was detected as a VLM. While the new _is_vlm_config helper improves code reuse, restoring logging here would maintain observability for the standard detection path, matching the logging added to the subprocess fallback path at line 796.
is_vlm = _is_vlm_config(config)
if is_vlm:
logger.info(
"Model %s detected as VLM: model_type=%s architectures=%s",
model_name,
getattr(config, "model_type", None),
getattr(config, "architectures", []),
)
return is_vlm…ion markers - Remove ForConditionalGeneration from VLM arch suffix heuristic. This caused false positives for seq2seq models (T5, BART, mBART, UMT5, M2M100, Switch Transformers, LongT5, Pegasus, Blenderbot) - Rely on explicit vision signals: vision_config, img_processor, image_token_index, ForVisionText2Text suffix, and known model types - Remove _NON_VLM_CONDITIONAL_MODEL_TYPES (no longer needed) - Add _classify_detection_error helper for permanent vs transient error classification, now used in both t5 fallback and non-t5 except paths - _load_model_config_metadata now returns (data, error) tuple so callers can classify permanent errors (repo not found, gated) for caching - Update _VISION_CHECK_SCRIPT subprocess to match new detection logic
danielhanchen
left a comment
There was a problem hiding this comment.
Review Loop 2 - Consolidated Findings
After loop 1 fixes, ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 4/7 APPROVE, 3/7 REQUEST_CHANGES.
| Reviewers | Severity | Finding |
|---|---|---|
| 1/7 | High | ForConditionalGeneration heuristic still false-positives on UMT5, M2M100, Switch Transformers, LongT5 -- the exclusion set approach is fundamentally fragile |
| 1/7 | High | Seq2seq exclusion can suppress real multimodal configs that use a base model_type (e.g., T5-based VLMs with vision_config) |
| 2/7 | Medium | t5 raw-config fallback path returns None for permanent errors (RepositoryNotFoundError, GatedRepoError), preventing caching |
Fixes Applied (committed)
-
Removed ForConditionalGeneration from VLM detection entirely. Instead of maintaining an ever-growing exclusion list, we now rely on explicit vision signals only:
vision_config(most VLMs: LLaVA, Gemma-3/4, Qwen-VL, etc.)img_processor(Phi-3.5 Vision)image_token_index(common in VLMs)ForVisionText2Textarchitecture suffix (VLM-specific)- Known
_VLM_MODEL_TYPESset
This eliminates all seq2seq false positives while still detecting all real VLMs. Models with T5 base but vision capabilities (like Pix2Struct) are still correctly detected via their
vision_config. -
Removed
_NON_VLM_CONDITIONAL_MODEL_TYPES-- no longer needed since we don't use ForConditionalGeneration. -
Added
_classify_detection_error()helper to classify permanent vs transient errors in both the t5 fallback and non-t5 paths._load_model_config_metadatanow returns(data, error)tuple so the t5 path can cache permanent failures. -
Updated
_VISION_CHECK_SCRIPTsubprocess to match the new detection logic.
Test Results
- 33 simulation tests covering VLMs, audio-only, seq2seq, edge cases: all pass
- 8 subprocess script simulation tests: all pass
- UMT5, M2M100, Switch Transformers, LongT5: all correctly classified as non-VLM
- T5 with vision_config: correctly detected as VLM (vision signals take priority)
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…ration Since ForConditionalGeneration was removed as a VLM signal, known VLM model types that use it need to be in _VLM_MODEL_TYPES as a safety net. In practice, all these models expose vision_config in their real HF configs so they would be detected regardless, but the model_type list ensures compatibility with simplified mock configs (test_vision_cache.py) and edge cases. Added: gemma3, gemma4, qwen2_vl, qwen2_5_vl, qwen3_5, paligemma, pix2struct, video_llava, blip-2, blip_2, idefics2, idefics3, mllama, chameleon. Also updated subprocess script to match.
There was a problem hiding this comment.
Code Review
This pull request refactors VLM detection logic in model_config.py by adding a fallback to raw config metadata and implementing error classification for improved caching. Review feedback recommends simplifying the detection logic's boolean structure for better maintainability and restoring docstring details regarding the function's return values and transient failure handling.
I am having trouble creating individual review comments. Click here to see my feedback.
studio/backend/utils/models/model_config.py (524-543)
This logic for VLM detection can be improved for readability and maintainability.
- The chain of
if not is_vlmchecks can be simplified into a single boolean expression. - The
audio_only_typesset and the VLM model types set are duplicates of module-level constants (_AUDIO_ONLY_MODEL_TYPESand_VLM_MODEL_TYPES). To avoid inconsistencies, consider making_VISION_CHECK_SCRIPTan f-string and injecting these constants.
Here's a suggestion that addresses the first point:
audio_only_types = {"csm", "whisper"}
is_vlm = False
if model_type not in audio_only_types:
is_vlm = (
hasattr(config, "vision_config") or
hasattr(config, "img_processor") or
hasattr(config, "image_token_index") or
(hasattr(config, "architectures") and any(
x.endswith("ForVisionText2Text")
for x in config.architectures
)) or
model_type in {
"phi3_v", "llava", "llava_next", "llava_onevision",
"internvl_chat", "cogvlm2", "minicpmv",
}
)
studio/backend/utils/models/model_config.py (560-563)
The removed part of the docstring explaining the return value (Optional[bool]) is important for understanding the function's contract. It clarifies that None indicates a transient failure, which is crucial for the caller's logic. Please consider adding this information back to maintain clarity.
- Remove FileNotFoundError from permanent error classification to avoid stale negative caching for local model dirs still being set up - Return (None, None) instead of (None, FileNotFoundError) for local paths missing config.json -- absence is unknown state, not an error - Add xgenmm and qwen3_vl to _VLM_MODEL_TYPES for completeness
danielhanchen
left a comment
There was a problem hiding this comment.
Review Loop 3 - Consolidated Findings
Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 4/7 APPROVE, 3/7 REQUEST_CHANGES.
| Reviewers | Severity | Finding |
|---|---|---|
| 3/7 | High | FileNotFoundError in _classify_detection_error was caching local path misses as permanent False -- stale for in-progress model dirs |
| 2/7 | High | _VLM_MODEL_TYPES missing xgenmm and qwen3_vl |
Fixes Applied
-
Removed FileNotFoundError from permanent error classification. Local paths without config.json now return (None, None) instead of (None, FileNotFoundError), treating absence as unknown rather than a cacheable failure.
-
Added xgenmm and qwen3_vl to _VLM_MODEL_TYPES and subprocess script.
All 3 loops complete. Summary of changes from original PR:
- Properly merged with main's caching layer (is_vision_model -> cache -> _is_vision_model_uncached)
- Removed ForConditionalGeneration as a VLM signal (caused seq2seq false positives)
- Added comprehensive _VLM_MODEL_TYPES as safety net for known VLMs
- Added _is_vlm_config() for unified detection across config types (dict and object)
- Added _load_model_config_metadata() as raw config.json fallback for t5 models
- Added _classify_detection_error() for consistent permanent/transient error handling
- Updated _VISION_CHECK_SCRIPT subprocess to match
- Preserved all error classification and caching semantics from main
|
/gemini review |
|
@codex review |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
There was a problem hiding this comment.
Code Review
This pull request refactors the VLM detection logic by centralizing it into a new _is_vlm_config helper and introducing a fallback mechanism that retrieves raw config.json metadata when AutoConfig fails. It also expands the list of supported VLM model types and improves error classification for caching purposes. Review feedback suggests removing qwen3_5 from the VLM list to prevent incorrect detection of standard LLMs and recommends using the centralized helper within the detection script to eliminate code duplication.
| "gemma4", | ||
| "qwen2_vl", | ||
| "qwen2_5_vl", | ||
| "qwen3_5", |
There was a problem hiding this comment.
qwen3_5 appears to be the model type for the Qwen 3.5 LLM family (e.g., the MoE models), not specifically a VLM. Including it in _VLM_MODEL_TYPES will cause standard LLMs to be incorrectly detected as vision models. The VLM variant is typically identified as qwen3_vl (which is already in the list).
| "qwen3_5", | |
| "paligemma", |
| model_type = getattr(config, "model_type", None) | ||
| audio_only_types = {"csm", "whisper"} | ||
|
|
||
| is_vlm = False | ||
| if hasattr(config, "architectures"): | ||
| is_vlm = any( | ||
| x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) | ||
| for x in config.architectures | ||
| ) | ||
| if not is_vlm and hasattr(config, "vision_config"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "img_processor"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "image_token_index"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "model_type"): | ||
| vlm_types = {"phi3_v","llava","llava_next","llava_onevision", | ||
| "internvl_chat","cogvlm2","minicpmv"} | ||
| if config.model_type in vlm_types: | ||
| if model_type not in audio_only_types: | ||
| if hasattr(config, "vision_config"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "img_processor"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "image_token_index"): | ||
| is_vlm = True | ||
| if not is_vlm and hasattr(config, "architectures"): | ||
| is_vlm = any( | ||
| x.endswith("ForVisionText2Text") | ||
| for x in config.architectures | ||
| ) | ||
| if not is_vlm and model_type in { | ||
| "phi3_v", "llava", "llava_next", "llava_onevision", | ||
| "internvl_chat", "cogvlm2", "minicpmv", "gemma3", "gemma4", | ||
| "qwen2_vl", "qwen2_5_vl", "qwen3_5", "paligemma", | ||
| "pix2struct", "video_llava", "blip-2", "blip_2", | ||
| "idefics2", "idefics3", "mllama", "chameleon", | ||
| "xgenmm", "qwen3_vl", | ||
| }: | ||
| is_vlm = True |
There was a problem hiding this comment.
The VLM detection logic and model lists are duplicated inside _VISION_CHECK_SCRIPT. Since backend_dir is added to sys.path, the script can import and use the centralized _is_vlm_config helper. This improves maintainability and ensures consistency between in-process and subprocess detection.
| model_type = getattr(config, "model_type", None) | |
| audio_only_types = {"csm", "whisper"} | |
| is_vlm = False | |
| if hasattr(config, "architectures"): | |
| is_vlm = any( | |
| x.endswith(("ForConditionalGeneration", "ForVisionText2Text")) | |
| for x in config.architectures | |
| ) | |
| if not is_vlm and hasattr(config, "vision_config"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "img_processor"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "image_token_index"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "model_type"): | |
| vlm_types = {"phi3_v","llava","llava_next","llava_onevision", | |
| "internvl_chat","cogvlm2","minicpmv"} | |
| if config.model_type in vlm_types: | |
| if model_type not in audio_only_types: | |
| if hasattr(config, "vision_config"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "img_processor"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "image_token_index"): | |
| is_vlm = True | |
| if not is_vlm and hasattr(config, "architectures"): | |
| is_vlm = any( | |
| x.endswith("ForVisionText2Text") | |
| for x in config.architectures | |
| ) | |
| if not is_vlm and model_type in { | |
| "phi3_v", "llava", "llava_next", "llava_onevision", | |
| "internvl_chat", "cogvlm2", "minicpmv", "gemma3", "gemma4", | |
| "qwen2_vl", "qwen2_5_vl", "qwen3_5", "paligemma", | |
| "pix2struct", "video_llava", "blip-2", "blip_2", | |
| "idefics2", "idefics3", "mllama", "chameleon", | |
| "xgenmm", "qwen3_vl", | |
| }: | |
| is_vlm = True | |
| from utils.models.model_config import _is_vlm_config | |
| is_vlm = _is_vlm_config(config) | |
| model_type = getattr(config, "model_type", None) |
- Add image_token_id as an explicit vision signal in _is_vlm_config and subprocess script (used by SmolVLM-style configs) - Add gemma3n, qwen3_vl_moe, smolvlm to _VLM_MODEL_TYPES - Keep subprocess script model type list in sync
danielhanchen
left a comment
There was a problem hiding this comment.
Review Loop 4 - Consolidated Findings
Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 5/7 APPROVE, 2/7 REQUEST_CHANGES.
| Reviewers | Severity | Finding |
|---|---|---|
| 2/7 | High | Missing VLM model types (gemma3n, qwen3_vl_moe, smolvlm) and missing image_token_id as a vision signal |
Fixes Applied
-
Added
image_token_idas an explicit vision signal in both_is_vlm_config()and_VISION_CHECK_SCRIPT. This is used by SmolVLM-style configs that haveimage_token_idinstead ofimage_token_index. -
Added
gemma3n,qwen3_vl_moe,smolvlmto_VLM_MODEL_TYPESand subprocess script.
Verification
- SmolVLM2 (
model_type=smolvlm) has bothvision_configandimage_token_id-- detected via vision_config, image_token_id now also serves as a secondary signal. - All previously verified models (Qwen3.5, Gemma4, Whisper, T5, etc.) remain correctly classified.
These VLMs have no explicit vision signals (vision_config, img_processor, image_token_index, image_token_id) in their config.json, relying solely on model_type for detection.
|
/gemini review |
|
@codex review |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
…ation - _is_vlm_config now checks values are not None, not just key presence. Prevents false positives from configs with "vision_config": null - _VISION_CHECK_SCRIPT updated to use getattr with None checks - _classify_detection_error now treats EntryNotFoundError and RevisionNotFoundError as permanent (cacheable as False)
danielhanchen
left a comment
There was a problem hiding this comment.
Review Loop 5 - Consolidated Findings
Ran 7 reviewer.py reviews + 3 Sonnet subagents + gemini_review. Results: 5/7 APPROVE, 2/7 REQUEST_CHANGES.
| Reviewers | Severity | Finding |
|---|---|---|
| 1/7 | High | Null placeholder vision fields (e.g. "vision_config": null) treated as positive VLM signal -- key presence check should verify value is not None |
| 1/7 | Medium | EntryNotFoundError/RevisionNotFoundError not in permanent error classification -- causes repeated probes for repos missing config.json |
Fixes Applied
-
Null-safe vision field checks:
_is_vlm_config()now retrieves field values and checksis not Noneinstead of just key/attribute presence. Prevents false positives from configs with placeholder null fields. Updated both in-process and subprocess detection. -
Added EntryNotFoundError and RevisionNotFoundError to
_classify_detection_error()as permanent errors. Repos that exist but don't have config.json are now cached as False instead of retrying on every call.
Use getattr with None default instead of hasattr to prevent TypeError when config.architectures is None (hasattr returns True for None values, but iterating None raises TypeError).
|
Fixes pushed to original PR unslothai#4868. Closing staging copy. |
Staging mirror of unslothai#4868
Original PR: unslothai#4868
Author: Datta0
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Fixes: unslothai#4859
This PR contains code changes only (1 files). Test changes are in a separate PR.
Changed files:
studio/backend/utils/models/model_config.py