Studio: expose image size setting in training UI#5743
Conversation
Studio vision fine-tuning had no explicit way to cap image resolution, so
users could not trade visual detail against context and memory use from the
training UI, YAML config, or API payload. :) Add a nullable `vision_image_size`
setting that keeps the current model default when unset and applies a
max-side resize when provided.
- Add `vision_image_size` to the training request model, route payload, backend
training config, and frontend API/types plumbing.
- Validate the value server-side as either null or an integer in the supported
256-2048 range.
- Surface an Image Size selector for vision LoRA training with Default plus
common preset sizes.
- Include the value in training start payloads only for image-dataset vision
models, and serialize it into vision-aware YAML configs.
- Map backend model defaults back into the training store and reset the value
when reapplying model defaults.
- Pass the resize through the Torch trainer via `UnslothVisionDataCollator`
using max-dimension semantics.
- Apply the same max-dimension resize in the MLX VLM path before mlx-vlm's
internal collation, preserving aspect ratio and avoiding upscaling.
- Add backend validation coverage and MLX resize-size tests for the new
behavior.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request adds support for configurable vision image size in VLM training, including backend resizing logic for MLX and Unsloth, parameter validation, and frontend UI updates. Review feedback recommends unconditionally converting images to RGB mode to handle diverse input formats and prevent processing errors, regardless of whether resizing occurs.
…rray - trainer.py: DeepSeek OCR collator now honors the new vision_image_size setting as image_size. Falls back to 640 when null. base_size stays at 1024 and crop_mode stays True so the Gundam preset's dynamic cropping of large documents keeps working. - worker.py: _resize_mlx_vlm_image returns np.array(image, copy=True) instead of np.asarray(image). The PIL view from np.asarray is not writable, which makes HF VLM processors emit "The given NumPy array is not writable, and PyTorch does not support non-writable tensors..." when they call torch.from_numpy. copy=True keeps the same shape and dtype but produces a writable buffer.
for more information, see https://pre-commit.ci
|
Pushed two small follow-up fixes on top of your branch ( 1. DeepSeek OCR now honors The 2. Writable ndarray on the MLX path (
Both changes leave the Two small nits I did not change but worth considering in a follow-up:
Thanks for the PR. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64d7e7897d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| vision_image_size: | ||
| config.isVisionModel && config.isDatasetImage === true | ||
| ? config.visionImageSize | ||
| : null, |
There was a problem hiding this comment.
Preserve vision_image_size when dataset probe is inconclusive
The payload currently drops vision_image_size unless isDatasetImage === true, so any transient/failed dataset-format check (which leaves isDatasetImage as null) silently forces vision_image_size to null even after the user selected a value. In that case training still proceeds, but the new image-size setting is ignored and backend falls back to model default, making this feature unreliable for valid vision datasets when detection is inconclusive.
Useful? React with 👍 / 👎.
…opdown - training-section.tsx: handleSaveConfig now passes isVisionModel && isDatasetImage === true to serializeConfigToYaml, matching buildTrainingStartPayload. Stops vision_image_size from leaking into exported YAML for text-only datasets where the API would have sent null. - params-section.tsx: add 256 to visionImageSizePresets so the dropdown spans the validator's full [256, 2048] range. Also render a synthetic SelectItem for the current value when it was loaded from YAML or model defaults and is not in the preset list, so the controlled Select always shows the active size.
mapBackendModelConfigToTrainingPatch now mirrors the backend validator at studio/backend/models/training.py:169 by dropping any value that is not an integer in [256, 2048]. Pre-fix, an imported YAML like vision_image_size: 4096 or 640.5 would land in the store and the UI would happily display it, only to fail when Start Training posted to the backend. With this guard the store never holds a value the backend would reject.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Switch the field_validator to mode="before" so True/False surface as bool (not Pydantic's coerced 1/0) and give a precise "must be an integer or null" message instead of the misleading "must be in [256, 2048] (got 1)". Also explicitly accepts numpy Integral and integral Real scalars so YAML or programmatic callers using numpy ints keep working.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Regression guard for the validator switch to mode="before". Pre-fix, vision_image_size: True was rejected with "must be in [256, 2048] (got 1)" because Pydantic coerced before our check ran. New test asserts the message now reads "integer or null".
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Pushed four more follow-up commits on top. Full set of follow-ups now applied:
Total surface area: 7 files, +81/-16. All 27 in-tree backend tests pass; frontend typecheck and Default ( Thanks for the PR. |
Round 2 of follow-up review surfaced three usability issues: - model-defaults.ts: switching to a model whose backend YAML omits vision_image_size now explicitly resets the store value to null. Pre-fix, a stale 2048 from a previous model would silently apply to the new run because every checked-in model-default file omits the key. - training-section.tsx: handleSaveConfig now includes vision fields unless isDatasetImage is definitively false. isDatasetImage is null during dataset checks, after dataset edits, and on import; treating unknown as "drop" would silently lose the user's selection in those windows. Confirmed-text-only datasets still drop the value. - worker.py: _mlx_vlm_max_resized_size now mirrors the Torch collator's integer formula (w * size + size_func // 2) // size_func instead of Python round(), which uses banker's rounding and disagreed by 1px on half-pixel inputs like 333x1000 with target 500 (was 166, now 167). Test_mlx_training_worker_config gains parity assertions.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 2 of the parallel-reviewer pass surfaced three more issues. Pushed
Total surface area now: 8 files, +113/-22. Twenty-nine in-tree backend tests pass. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
mapBackendModelConfigToTrainingPatch resets stale image size on the success path, but if the /api/models/config endpoint throws, training-config-store.ts falls through to checkVisionModel and only updates capability flags. Pre-fix that left a stale 2048 (or any prior selection) in the store, so once dataset detection marked the new dataset as image, the next training start would silently apply the previous model's size. The error branch now also resets to the DEFAULT_HYPERPARAMS.visionImageSize sentinel.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Round 4 of the parallel-reviewer pass flagged that the Torch trainer exclusion I added did not have a matching MLX guard, and that the UI still offered the dropdown for DeepSeek OCR even though the backend ignores it. - worker.py: _run_mlx_training now mirrors the Torch exclusion. When the model name matches DeepSeek OCR, vision_image_size is forced back to None before _adapt_for_mlx_vlm sees it, so dataset images pass through unchanged just like the Torch path. Emits a clear status line when this happens. - params-section.tsx: the Image Size Row is now gated on showVisionImageSize (showVisionLora && !isDeepseekOcr) instead of showVisionLora alone, so DeepSeek OCR users no longer see a control that silently has no effect. - mappers.ts: buildTrainingStartPayload sends null for vision_image_size whenever the selected model is DeepSeek OCR, so the backend log line about ignoring the value never fires from a UI-driven start.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
for more information, see https://pre-commit.ci
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Round 4 of the parallel-reviewer pass: an MLX/Torch symmetry gap. Pushed
Nine files changed across all my follow-ups, +156/-31 net. 27 backend tests pass; frontend typecheck and |
|
Round 5 of the parallel-reviewer pass came back 12/12 APPROVE with zero findings. The PR has converged. Full set of follow-ups pushed on top of the original commit:
Net: 9 commits, 9 files touched, +156/-31 lines. Twenty-seven backend tests pass; frontend typecheck and Default behavior ( Thanks again for the PR. |
Two YAML-path asymmetries that could leak a stale image size into training: - parseYamlConfig now treats a missing training.vision_image_size as null. Without this, importing a YAML saved before this feature (or any config that omits the key) preserved whatever value the user had previously set on a different model. The model-defaults reload path still uses Object.hasOwn so same-model defaults reloads do not wipe a manual selection; only file import normalises the missing key. - handleSaveConfig now passes a DeepSeek-OCR-specific guard to serializeConfigToYaml so saved YAML matches what the API mapper actually sends. Previously a state with visionImageSize set could emit the key even though Studio ignored it at training time for DeepSeek OCR, and a later import for a non-DeepSeek vision model would activate the stale value. serializeConfigToYaml gains an optional third parameter includeVisionImageSize defaulting to includeVisionFields, preserving the existing 2-arg call signature for backwards compatibility.
|
Round 9 of
Verification on head
Studio end-to-end Playwright smoke against the converged pre-round-9 head also captured: VLM model + image dataset shows the Image Size dropdown on the Memory tab, opening + selecting 512 + back to Default work, and switching to a text-only model hides the dropdown. |
Round 9's parseYamlConfig normalization only fired when the YAML had a
training mapping that omitted vision_image_size. A lora-only or
logging-only YAML (or one with `training: null`) still left trainingObj
unset, the mapper saw no vision_image_size key, and the previously
selected store value persisted into the next training run.
Now an absent or null training section is synthesised as
{ vision_image_size: null } so model-defaults.ts always patches
visionImageSize back to Default on file import. Same-model defaults
reloads still preserve manual choices via the existing Object.hasOwn
gate in mapBackendModelConfigToTrainingPatch.
|
Round 10 (12 reviewers on Fixed in Extended |
|
Round 11 (12 reviewers on CI on Summary of the round-9 + round-10 follow-ups since the earlier convergence at
Net additional surface area: 2 files, +42/-3 lines on top of the prior converged state. Aggregate ~720+ simulation assertions and 27 backend pytest still green. |
|
While re-testing this PR across every preset image size × every plausible image shape (1x1 pixel, 14x14 Qwen-patch, 8192x1 extreme aspect, 4k/8k squares, prime sides, off-by-one from powers of 2, A4/HD/cinema/iPhone aspects, LaTeX_OCR-style bands, real Qwen3-VL + Gemma-3 processors), I found one latent bug in unsloth_zoo, not in this PR:
Reachable from this PR's Submitted the one-line fix to unsloth-zoo as unslothai/unsloth-zoo#696. After that PR lands, the same matrix passes 180/180. Aggregate green count on PR 5743 head
The 117 SKIPs are downstream HF Gemma image-processor limitations on shape Net: zero regressions traceable to PR 5743 across ~1610 assertions covering every preset size, every shape, every cap × image-side boundary, every PIL mode, real Qwen3-VL and Gemma-3 processors, the full 78-entry Studio model catalog, and Chromium/Firefox/WebKit. |
A fresh static review (Opus subagent) flagged P3-1: parseYamlConfig
only synthesised vision_image_size: null when raw.training was either
absent or a plain object missing the key. If raw.training is a scalar
or an array (malformed but still parseable), the value was passed
through unchanged, the mapper's Object.hasOwn returned false, and any
previously selected visionImageSize persisted - the same stale-state
leak the lora-only fallback was added to close.
Treat any non-plain-object raw.training (null, array, scalar) as a
malformed/missing section and reset to { vision_image_size: null }.
|
Spawned two fresh Opus subagents to re-audit this PR from scratch in parallel. Static + security audit (Opus, 17 files, no tool fast path)Found 1 P1 (already fixed via unsloth-zoo#696), 2 P2 (one already empirically dismissed by the runtime audit's 2,000-case Torch/MLX equivalence sweep, one acknowledged UX trade-off documented in the existing code comment), and 4 P3 nits. Categories where the auditor explicitly confirmed "no findings": backwards compat with Acted on P3-1 (cheap unification) in Runtime + integration audit (Opus, real GPU + real models + 3,200+ fresh assertions)Designed and ran six brand-new test scripts in addition to re-validating the existing 21-harness battery:
Runtime auditor verdict: APPROVE. Zero functional bugs across ~3,200 fresh assertions. The validator is robust against scale-1000 random fuzz with no uncaught exceptions and no out-of-contract acceptances. Torch and MLX resize formulas are pixel-identical for 2,000 random inputs including known banker's-rounding edge cases. End-to-end VLM training with the new knob completes with stable, finite losses on real data. The DeepSeek-OCR guard resists Cyrillic homoglyph spoofing while matching all legitimate casings. Updated aggregate green count on head
|
|
Tightened code comments across all 10 touched Studio files. Net -34 comment lines (61 removed, 27 added). 5-9 line essays compressed to 1-3 lines each, keeping the WHY where non-obvious and dropping restated mechanical details (line numbers, paraphrases of the code beside them, history of prior reviewer rounds). Full battery still green: 27/27 pytest, all sims OK, 180/180 weird-shapes, 92/92 all-sizes synthetic matrix. Pushed as |
…ntext
Two issues surfaced by a fresh adversarial review of the validator:
1. v.strip().lstrip("+-").isdigit() let "++512" / "--256" / "+-+512"
slip past the gate, then int("++512") raised an uncaught ValueError
and Pydantic surfaced "invalid literal for int() with base 10: '++512'"
instead of the contracted "vision_image_size must be an integer or null".
2. str.isdigit() returns True for Unicode digit families (full-width '512',
Arabic-Indic '٥١٢', Devanagari '१०२४'), and int() coerces them, so the
value reaching the backend wasn't the ASCII the user typed.
Replaced the lstrip+isdigit pair with re.fullmatch(r'[+-]?[0-9]+', stripped),
which rejects both shapes with the precise error and accepts the documented
ones ('256', '+512', ' 1024 '). Added 8 regression test cases covering
multi-sign strings, lone sign, and the three Unicode digit families.
Also restored comment context lost in f9c3933:
- model-defaults.ts: name studio/backend/models/training.py:_check_vision_image_size
as the spec the [256, 2048] range mirrors, so a maintainer changing the
cap in one file can find the other.
- training-section.tsx: enumerate the three windows in which isDatasetImage
is null (before a check, after dataset edits, on import) so a future
maintainer doesn't simplify the gate to `isCheckingDataset`.
- worker.py: qualify the writable-ndarray comment with "when a resize is
requested" so it doesn't misadvertise the resize=None early-return.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Spawned two more fresh Opus subagents to re-audit this PR from scratch. Both treated it as a from-scratch read; one was instructed to be adversarial, the other to design new empirical stress tests. Adversarial reviewer found a real validator bug I missed; pushed the fix as `ab06fa1b` (rebased onto upstream). Bug found`_check_vision_image_size` in `studio/backend/models/training.py` used `v.strip().lstrip("+-").isdigit()` to gate `int(v)`. `lstrip("+-")` strips arbitrarily many sign chars, so `"++512"` / `"--256"` / `"+-+512"` passed the gate, then `int("++512")` raised an uncaught ValueError and Pydantic surfaced:
instead of the contracted `vision_image_size must be an integer or null`. Reachable from any direct API POST. The regression test in this PR (`test_bool_error_says_integer_not_range`) explicitly enforces this contract for booleans; the sign-prefix case bypassed it. A second related issue: `str.isdigit()` returns True for Unicode digit families. Full-width Japanese `'512'`, Arabic-Indic `'٥١٢'`, Devanagari `'१०२४'` all silently coerced to 512 / 1024, so the value reaching the backend wasn't the ASCII the user typed. FixReplaced the `lstrip` + `isdigit` pair with `re.fullmatch(r'[+-]?[0-9]+', stripped)`, which rejects both shapes with the precise contracted error and accepts the documented ones (`'256'`, `'+512'`, `' 1024 '`). Added 8 regression test cases covering multi-sign strings (`'++512'`, `'--256'`, `'+-+512'`, `'+'`, `'-'`) and the three Unicode digit families. Also restored 3 over-tightened commentsThe adversarial reviewer flagged 3 cases where `f9c39331`'s comment shortening lost load-bearing context:
Empirical stress tester verdict: APPROVE162 fresh probes (8 zustand-migration, 22 comment-regression, 27 degenerate-images across 11 PIL modes including animated GIF, 31 REST endpoint round-trips through FastAPI TestClient with monkey-patched auth, 35 YAML-hostile, 58 adversarial validator inputs including JSON-pointer/Decimal/Fraction/FakeInt). All real bugs 0; same Unicode-digit behaviour acknowledged as a "documentation-grade behavioural note" (now fixed). Final aggregate on head `ab06fa1b` (with unsloth-zoo PR #696 applied)
PR continues to be safe to merge once unsloth-zoo#696 lands. |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
On the "only MLX is changed" question: the PR actually changes both training paths. The Torch change is small because End-to-end evidence on PR head
Same Drivers committed locally as |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
Thanks so much for the PR - this is really good! |
One thing that has always bothered me was not being able to change/override the downscalling of images when training for image specific tasks to traide quality for speed or the other way around.. Now this is possible thanks to a new exposed "Image Size" option under the Memory tab in the Parameters page.
It has the following settings in the dropdown:
When Default is selected, nothing changes when compared to before this PR. Higher resolutions require more context, if not enough context is pressent, this Error message will appear upon starting training:
The setting is saved in the YAML training config. I have added tests.
I ran multiple training runs on my NVIDIA GPU to test its robustness using qwen3.5 and gemma4 models using LoRA, QLoRA and Full Finetune. I tested MLX but I did not do a full training run with it. I tested the changes in WSL (Ubuntu).
A custom implementation for MLX was required as I did not find any relevant documentation on any backend features already implementing this. :)