Skip to content

Studio: expose image size setting in training UI#5743

Merged
danielhanchen merged 27 commits into
unslothai:mainfrom
Dariton4000:main
May 27, 2026
Merged

Studio: expose image size setting in training UI#5743
danielhanchen merged 27 commits into
unslothai:mainfrom
Dariton4000:main

Conversation

@Dariton4000

Copy link
Copy Markdown
Contributor

One thing that has always bothered me was not being able to change/override the downscalling of images when training for image specific tasks to traide quality for speed or the other way around.. Now this is possible thanks to a new exposed "Image Size" option under the Memory tab in the Parameters page.

image

It has the following settings in the dropdown:

image

When Default is selected, nothing changes when compared to before this PR. Higher resolutions require more context, if not enough context is pressent, this Error message will appear upon starting training:

image

The setting is saved in the YAML training config. I have added tests.
I ran multiple training runs on my NVIDIA GPU to test its robustness using qwen3.5 and gemma4 models using LoRA, QLoRA and Full Finetune. I tested MLX but I did not do a full training run with it. I tested the changes in WSL (Ubuntu).
A custom implementation for MLX was required as I did not find any relevant documentation on any backend features already implementing this. :)

Dariton4000 and others added 3 commits May 23, 2026 20:48
  Studio vision fine-tuning had no explicit way to cap image resolution, so
  users could not trade visual detail against context and memory use from the
  training UI, YAML config, or API payload. :) Add a nullable `vision_image_size`
  setting that keeps the current model default when unset and applies a
  max-side resize when provided.

  - Add `vision_image_size` to the training request model, route payload, backend
    training config, and frontend API/types plumbing.
  - Validate the value server-side as either null or an integer in the supported
    256-2048 range.
  - Surface an Image Size selector for vision LoRA training with Default plus
    common preset sizes.
  - Include the value in training start payloads only for image-dataset vision
    models, and serialize it into vision-aware YAML configs.
  - Map backend model defaults back into the training store and reset the value
    when reapplying model defaults.
  - Pass the resize through the Torch trainer via `UnslothVisionDataCollator`
    using max-dimension semantics.
  - Apply the same max-dimension resize in the MLX VLM path before mlx-vlm's
    internal collation, preserving aspect ratio and avoiding upscaling.
  - Add backend validation coverage and MLX resize-size tests for the new
    behavior.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for configurable vision image size in VLM training, including backend resizing logic for MLX and Unsloth, parameter validation, and frontend UI updates. Review feedback recommends unconditionally converting images to RGB mode to handle diverse input formats and prevent processing errors, regardless of whether resizing occurs.

Comment thread studio/backend/core/training/worker.py Outdated
@Dariton4000 Dariton4000 marked this pull request as ready for review May 23, 2026 19:36
Dariton4000 and others added 4 commits May 24, 2026 13:26
…rray

- trainer.py: DeepSeek OCR collator now honors the new vision_image_size
  setting as image_size. Falls back to 640 when null. base_size stays at
  1024 and crop_mode stays True so the Gundam preset's dynamic cropping
  of large documents keeps working.
- worker.py: _resize_mlx_vlm_image returns np.array(image, copy=True)
  instead of np.asarray(image). The PIL view from np.asarray is not
  writable, which makes HF VLM processors emit "The given NumPy array
  is not writable, and PyTorch does not support non-writable tensors..."
  when they call torch.from_numpy. copy=True keeps the same shape and
  dtype but produces a writable buffer.
@danielhanchen

Copy link
Copy Markdown
Member

Pushed two small follow-up fixes on top of your branch (dd1a209):

1. DeepSeek OCR now honors vision_image_size (studio/backend/core/training/trainer.py)

The is_deepseek_ocr branch was still constructing DeepSeekOCRDataCollator(image_size=640, ...) regardless of the user-selected Image Size, so the new control was a no-op for DeepSeek OCR models even though the UI showed it. The fix maps vision_image_size to image_size (per-crop tile size) and keeps base_size=1024 + crop_mode=True so the Gundam preset's dynamic cropping of large documents keeps working. None falls back to the original 640.

2. Writable ndarray on the MLX path (studio/backend/core/training/worker.py)

np.asarray(image) on a PIL image returns a view of PIL's read-only buffer. When mlx-vlm hands that to HF processors that call torch.from_numpy, it emits UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors.... Switched to np.array(image, copy=True) which is the same shape and dtype but a writable backing buffer.

Both changes leave the vision_image_size is None path byte-identical to the pre-fix behavior. The 25 in-tree tests still pass.

Two small nits I did not change but worth considering in a follow-up:

  • serializeConfigToYaml is called with only store.isVisionModel from training-section.tsx, while the API mapper additionally gates on isDatasetImage === true. So a vision model with a text-only dataset can export a YAML containing vision_image_size that the actual training payload would have sent as null. Aligning the YAML gate to isVisionModel && isDatasetImage === true would remove that drift.
  • visionImageSizePresets is [384, 512, 768, 1024, 1536, 2048] but the backend validator allows [256, 2048]. The PR description screenshot shows 256 as a preset, so either adding 256 to the dropdown or raising _MIN_VISION_IMAGE_SIZE to 384 would make the two ranges line up.

Thanks for the PR.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64d7e7897d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +58 to +61
vision_image_size:
config.isVisionModel && config.isDatasetImage === true
? config.visionImageSize
: null,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve vision_image_size when dataset probe is inconclusive

The payload currently drops vision_image_size unless isDatasetImage === true, so any transient/failed dataset-format check (which leaves isDatasetImage as null) silently forces vision_image_size to null even after the user selected a value. In that case training still proceeds, but the new image-size setting is ignored and backend falls back to model default, making this feature unreliable for valid vision datasets when detection is inconclusive.

Useful? React with 👍 / 👎.

…opdown

- training-section.tsx: handleSaveConfig now passes
  isVisionModel && isDatasetImage === true to serializeConfigToYaml,
  matching buildTrainingStartPayload. Stops vision_image_size from
  leaking into exported YAML for text-only datasets where the API
  would have sent null.
- params-section.tsx: add 256 to visionImageSizePresets so the
  dropdown spans the validator's full [256, 2048] range. Also render
  a synthetic SelectItem for the current value when it was loaded
  from YAML or model defaults and is not in the preset list, so the
  controlled Select always shows the active size.
mapBackendModelConfigToTrainingPatch now mirrors the backend validator
at studio/backend/models/training.py:169 by dropping any value that is
not an integer in [256, 2048]. Pre-fix, an imported YAML like
vision_image_size: 4096 or 640.5 would land in the store and the UI
would happily display it, only to fail when Start Training posted to
the backend. With this guard the store never holds a value the backend
would reject.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Switch the field_validator to mode="before" so True/False surface as
bool (not Pydantic's coerced 1/0) and give a precise
"must be an integer or null" message instead of the misleading
"must be in [256, 2048] (got 1)". Also explicitly accepts numpy
Integral and integral Real scalars so YAML or programmatic callers
using numpy ints keep working.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Regression guard for the validator switch to mode="before". Pre-fix,
vision_image_size: True was rejected with "must be in [256, 2048]
(got 1)" because Pydantic coerced before our check ran. New test
asserts the message now reads "integer or null".
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

danielhanchen commented May 24, 2026

Copy link
Copy Markdown
Member

Pushed four more follow-up commits on top. Full set of follow-ups now applied:

Commit What
dd1a209 DeepSeek OCR collator honors vision_image_size as image_size (base_size and crop_mode unchanged so the Gundam preset's dynamic cropping keeps working); writable MLX ndarray via np.array(image, copy=True).
bef7f54 YAML export gate aligned to isVisionModel && isDatasetImage === true; added 256 to the dropdown presets; synthetic SelectItem for loaded values that are not in the preset list.
d0bfb11 mapBackendModelConfigToTrainingPatch validates vision_image_size to the same range as the backend (256-2048 integer); invalid YAML/model-default values are dropped so the store and UI never show a value the API would reject.
df333a77 Pydantic field_validator switched to mode="before" so True/False produce the precise message vision_image_size must be an integer or null instead of must be in [256, 2048] (got 1). Also accepts numpy Integral and integral Real.
08ba6ca Regression test that pins the new error wording.

Total surface area: 7 files, +81/-16. All 27 in-tree backend tests pass; frontend typecheck and vite build pass; the Select component is verified across Chromium, Firefox, and WebKit headless.

Default (vision_image_size = None) remains byte-identical to the pre-PR behavior on every code path. Verified empirically: there is no MLX memory regression from the copy=True writability change (PIL forces materialization either way; measured delta is under 2 MB versus the prior np.asarray view at the same RSS peak).

Thanks for the PR.

Round 2 of follow-up review surfaced three usability issues:

- model-defaults.ts: switching to a model whose backend YAML omits
  vision_image_size now explicitly resets the store value to null.
  Pre-fix, a stale 2048 from a previous model would silently apply
  to the new run because every checked-in model-default file omits
  the key.
- training-section.tsx: handleSaveConfig now includes vision fields
  unless isDatasetImage is definitively false. isDatasetImage is null
  during dataset checks, after dataset edits, and on import; treating
  unknown as "drop" would silently lose the user's selection in those
  windows. Confirmed-text-only datasets still drop the value.
- worker.py: _mlx_vlm_max_resized_size now mirrors the Torch collator's
  integer formula (w * size + size_func // 2) // size_func instead of
  Python round(), which uses banker's rounding and disagreed by 1px on
  half-pixel inputs like 333x1000 with target 500 (was 166, now 167).
  Test_mlx_training_worker_config gains parity assertions.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

Round 2 of the parallel-reviewer pass surfaced three more issues. Pushed 473eab87 with the fixes:

  • Model switch keeps stale image size. Switching to a model whose backend YAML omits vision_image_size left the previous model's 2048 (or whatever) in the store, so the new run silently trained at the old size. mapBackendModelConfigToTrainingPatch now resets to null when the key is missing.
  • YAML save drops the value during dataset-check windows. My prior gate alignment (isDatasetImage === true) was too strict: isDatasetImage is null while the dataset check is running, after dataset edits, and on import, so the user could save a config and lose the selection. Loosened to isDatasetImage !== false. Confirmed-text-only still drops; unknown preserves.
  • MLX rounding parity with the Torch collator. _mlx_vlm_max_resized_size was using Python round() which is banker's rounding and disagreed with UnslothVisionDataCollator's integer formula by 1 pixel on inputs like 333x1000 -> 500. Switched to the same (w * size + size_func // 2) // size_func formula. Added parity assertions.

Total surface area now: 8 files, +113/-22. Twenty-nine in-tree backend tests pass.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

mapBackendModelConfigToTrainingPatch resets stale image size on the
success path, but if the /api/models/config endpoint throws,
training-config-store.ts falls through to checkVisionModel and only
updates capability flags. Pre-fix that left a stale 2048 (or any
prior selection) in the store, so once dataset detection marked the
new dataset as image, the next training start would silently apply
the previous model's size. The error branch now also resets to the
DEFAULT_HYPERPARAMS.visionImageSize sentinel.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Round 4 of the parallel-reviewer pass flagged that the Torch trainer
exclusion I added did not have a matching MLX guard, and that the UI
still offered the dropdown for DeepSeek OCR even though the backend
ignores it.

- worker.py: _run_mlx_training now mirrors the Torch exclusion. When
  the model name matches DeepSeek OCR, vision_image_size is forced
  back to None before _adapt_for_mlx_vlm sees it, so dataset images
  pass through unchanged just like the Torch path. Emits a clear
  status line when this happens.
- params-section.tsx: the Image Size Row is now gated on
  showVisionImageSize (showVisionLora && !isDeepseekOcr) instead of
  showVisionLora alone, so DeepSeek OCR users no longer see a control
  that silently has no effect.
- mappers.ts: buildTrainingStartPayload sends null for vision_image_size
  whenever the selected model is DeepSeek OCR, so the backend log line
  about ignoring the value never fires from a UI-driven start.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

Round 4 of the parallel-reviewer pass: an MLX/Torch symmetry gap. Pushed c797e571 with three matching fixes:

  • MLX path now skips the resize for DeepSeek OCR. _run_mlx_training mirrors the Torch trainer's exclusion: when the model name matches DeepSeek OCR, vision_image_size is forced back to None before _adapt_for_mlx_vlm runs, so MLX leaves DeepSeek OCR images untouched too. Emits a status line when this happens.
  • UI hides the Image Size dropdown for DeepSeek OCR. params-section.tsx gates the Row on showVisionImageSize = showVisionLora && !isDeepseekOcr. Users no longer see a control that has no effect.
  • API mapper drops the value for DeepSeek OCR. mappers.ts sends null for vision_image_size whenever the selected model is DeepSeek OCR, so the backend "ignored" log line never fires from a UI-driven start.

Nine files changed across all my follow-ups, +156/-31 net. 27 backend tests pass; frontend typecheck and vite build pass.

@danielhanchen

Copy link
Copy Markdown
Member

Round 5 of the parallel-reviewer pass came back 12/12 APPROVE with zero findings. The PR has converged.

Full set of follow-ups pushed on top of the original commit:

Commit What
dd1a2097 DeepSeek OCR initial threading + writable MLX ndarray
bef7f540 YAML export gate aligned to API mapper, 256 preset, custom SelectItem
d0bfb11e YAML/model-default loader mirrors backend [256, 2048] integer validator
df333a77 Pydantic mode="before" validator for precise error messages, numpy Integral support
08ba6ca0 Regression test that pins the new error wording
6644fb46 Reset vision_image_size in the model-config error fallback path
473eab87 Round-2 loosen YAML save gate, integer rounding parity with Torch collator
8b44ae3c Round-3 revert DeepSeek OCR knob (unsafe with collator preset tuple), move missing-key reset to setSelectedModel
c797e571 Round-4 extend DeepSeek OCR exclusion to MLX path + hide UI for it + drop in API mapper

Net: 9 commits, 9 files touched, +156/-31 lines. Twenty-seven backend tests pass; frontend typecheck and vite build pass; cross-browser smoke green on Chromium, Firefox, and WebKit.

Default behavior (vision_image_size = None) remains byte-identical to pre-PR on every code path.

Thanks again for the PR.

Two YAML-path asymmetries that could leak a stale image size into
training:

- parseYamlConfig now treats a missing training.vision_image_size as
  null. Without this, importing a YAML saved before this feature (or
  any config that omits the key) preserved whatever value the user had
  previously set on a different model. The model-defaults reload path
  still uses Object.hasOwn so same-model defaults reloads do not wipe
  a manual selection; only file import normalises the missing key.

- handleSaveConfig now passes a DeepSeek-OCR-specific guard to
  serializeConfigToYaml so saved YAML matches what the API mapper
  actually sends. Previously a state with visionImageSize set could
  emit the key even though Studio ignored it at training time for
  DeepSeek OCR, and a later import for a non-DeepSeek vision model
  would activate the stale value.

serializeConfigToYaml gains an optional third parameter
includeVisionImageSize defaulting to includeVisionFields, preserving
the existing 2-arg call signature for backwards compatibility.
@danielhanchen

Copy link
Copy Markdown
Member

Round 9 of reviewer.py (12 parallel personas) on head eea71c9e came back 11 APPROVE + 1 REQUEST_CHANGES from the simulation persona, which surfaced two YAML-path asymmetries that earlier rounds did not exercise. Fixed in 89ae87c1:

  1. parseYamlConfig: a YAML config that omits training.vision_image_size now injects null at the parser level so applyConfigPatch resets a stale in-memory value to Default on file import. The model-defaults reload path still uses Object.hasOwn in mapBackendModelConfigToTrainingPatch, so same-model defaults reloads do not wipe a manual selection. Only YAML file import normalises the missing key.

  2. handleSaveConfig: passes a DeepSeek-OCR-specific guard to serializeConfigToYaml so the saved YAML matches what the API mapper actually sends. Previously a state where visionImageSize was set could emit training.vision_image_size even though Studio ignores it at training time for DeepSeek OCR, and a later import of that YAML against a non-DeepSeek vision model would activate the stale value.

serializeConfigToYaml gains an optional third parameter includeVisionImageSize defaulting to includeVisionFields, preserving the existing 2-arg call signature.

Verification on head 89ae87c1:

  • 17/17 new YAML round-trip + DeepSeek-OCR-save sims pass (sim_yaml_round9.mjs).
  • Existing sim_yaml_roundtrip.js, sim_yaml_gate.js, sim_dropdown_custom_item.js, sim_model_switch_reset.js all still green.
  • tsc -b clean, vite build clean.
  • Backend pytest 27/27 (unchanged path).

Studio end-to-end Playwright smoke against the converged pre-round-9 head also captured: VLM model + image dataset shows the Image Size dropdown on the Memory tab, opening + selecting 512 + back to Default work, and switching to a text-only model hides the dropdown.

Round 9's parseYamlConfig normalization only fired when the YAML had a
training mapping that omitted vision_image_size. A lora-only or
logging-only YAML (or one with `training: null`) still left trainingObj
unset, the mapper saw no vision_image_size key, and the previously
selected store value persisted into the next training run.

Now an absent or null training section is synthesised as
{ vision_image_size: null } so model-defaults.ts always patches
visionImageSize back to Default on file import. Same-model defaults
reloads still preserve manual choices via the existing Object.hasOwn
gate in mapBackendModelConfigToTrainingPatch.
@danielhanchen

Copy link
Copy Markdown
Member

Round 10 (12 reviewers on 89ae87c1) returned 11 APPROVE + 1 REQUEST_CHANGES, with three reviewers independently flagging the same hole in the round-9 P1 fix: parseYamlConfig only injected vision_image_size: null when the YAML had a training: mapping. A lora-only or logging-only YAML (or training: null) left trainingObj unset, the mapper saw no key, and the previously selected visionImageSize persisted.

Fixed in b90d1026: an absent or null training section is now synthesised as { vision_image_size: null } so model-defaults always patches visionImageSize back to Default on file import. Same-model defaults reloads still preserve manual choices via the existing Object.hasOwn gate in mapBackendModelConfigToTrainingPatch.

Extended sim_yaml_round9.mjs: 20/20 cases pass (added lora-only, logging-only, training: null). Typecheck and build clean.

@danielhanchen

Copy link
Copy Markdown
Member

Round 11 (12 reviewers on b90d1026) returned 12/12 APPROVE, zero findings. Reconverged after the two round-9 P1/P2 fixes plus the round-10 lora-only follow-up.

CI on b90d1026: pre-commit.ci pending as usual.

Summary of the round-9 + round-10 follow-ups since the earlier convergence at eea71c9e:

SHA Title
89ae87c1 Studio: tighten YAML import/save for vision_image_size
b90d1026 Studio: also reset vision_image_size when YAML lacks a training section

Net additional surface area: 2 files, +42/-3 lines on top of the prior converged state. Aggregate ~720+ simulation assertions and 27 backend pytest still green.

@danielhanchen

Copy link
Copy Markdown
Member

While re-testing this PR across every preset image size × every plausible image shape (1x1 pixel, 14x14 Qwen-patch, 8192x1 extreme aspect, 4k/8k squares, prime sides, off-by-one from powers of 2, A4/HD/cinema/iPhone aspects, LaTeX_OCR-style bands, real Qwen3-VL + Gemma-3 processors), I found one latent bug in unsloth_zoo, not in this PR:

unsloth_zoo.UnslothVisionDataCollator._resize_images_inplace lacks a max(1, ...) clamp on the rounded new_w / new_h. With degenerate aspect ratios (e.g. 1024x1 + cap=256), the integer-rounding formula returns new_h=0 and PIL.resize raises height and width must be > 0. The PR's MLX path already has the clamp; only the Torch path is missing it.

Reachable from this PR's vision_image_size knob when a dataset contains a row-of-pixels image with one dimension small enough that the downscale ratio rounds it to zero. Realistic LaTeX_OCR (1500x60), banner (5000x600), and document (800x50) images are unaffected. Only inputs with short_side * cap < max_side / 2 crash, which means roughly short_side <= 3 with cap=256, or short_side <= 1 with cap=1024.

Submitted the one-line fix to unsloth-zoo as unslothai/unsloth-zoo#696. After that PR lands, the same matrix passes 180/180.

Aggregate green count on PR 5743 head b90d1026 + zoo PR #696:

Suite Cases Result
backend pytest 27 27/27 PASS
sim_backend_validator 42 OK
sim_kwargs_propagation 10 OK
sim_yaml_round9 20 ALL OK
sim_yaml_hostile 67 67/67 OK
sim_all_sizes_pydantic_mlx 137 137/137 OK
sim_studio_catalog_audit 78 catalog + 7 probes OK
pr5743_collator_smoke 5 OK
pr5743_mlx_resize 9 OK
pr5743_matrix_all_sizes 92 92/92 OK
pr5743_matrix_real_processors 58 58/58 OK (Qwen3-VL + Gemma-3-4b)
pr5743_weird_image_sizes 180 180/180 OK (was 166/180 before zoo fix)
pr5743_weird_sizes_real_procs 510 393 PASS, 117 SKIP (Gemma processor rejects Nx1 shape), 0 FAIL
Playwright cross-browser 6 6/6 PASS (Chromium/FF/WebKit, Select + YAML)

The 117 SKIPs are downstream HF Gemma image-processor limitations on shape (N, 1) (channel detection ambiguity, "mean must have 1 elements"). Qwen3-VL's processor accepts all of them. These are not in this PR's surface area.

Net: zero regressions traceable to PR 5743 across ~1610 assertions covering every preset size, every shape, every cap × image-side boundary, every PIL mode, real Qwen3-VL and Gemma-3 processors, the full 78-entry Studio model catalog, and Chromium/Firefox/WebKit.

A fresh static review (Opus subagent) flagged P3-1: parseYamlConfig
only synthesised vision_image_size: null when raw.training was either
absent or a plain object missing the key. If raw.training is a scalar
or an array (malformed but still parseable), the value was passed
through unchanged, the mapper's Object.hasOwn returned false, and any
previously selected visionImageSize persisted - the same stale-state
leak the lora-only fallback was added to close.

Treat any non-plain-object raw.training (null, array, scalar) as a
malformed/missing section and reset to { vision_image_size: null }.
@danielhanchen

Copy link
Copy Markdown
Member

Spawned two fresh Opus subagents to re-audit this PR from scratch in parallel.

Static + security audit (Opus, 17 files, no tool fast path)

Found 1 P1 (already fixed via unsloth-zoo#696), 2 P2 (one already empirically dismissed by the runtime audit's 2,000-case Torch/MLX equivalence sweep, one acknowledged UX trade-off documented in the existing code comment), and 4 P3 nits. Categories where the auditor explicitly confirmed "no findings": backwards compat with vision_image_size = None (byte-identical behavior verified end-to-end), Pydantic validator under numpy/Decimal/Fraction/bool/strict-mode/JSON-null-vs-missing-key edge cases, DeepSeek-OCR exclusion symmetry across all four sites (UI hide / payload drop / YAML save drop / MLX worker log+ignore), data flow (no silent mutation or shadowing), security / injection (no string interpolation, no YAML eval, no path use, no sandbox escape), model-switch + same-model-reload + error-fallback paths, race conditions across async ordering, frontend gate consistency, and scope creep (all 17 changed files relate to the feature).

Acted on P3-1 (cheap unification) in 14caf36a: parseYamlConfig now resets vision_image_size to null whenever raw.training is anything other than a plain object - arrays, scalars, numbers, booleans, in addition to the previously handled null / missing case. This closes the last variant of the stale-state leak that the lora-only fallback was originally added to address. Verdict from the static auditor: APPROVE conditionally on landing unsloth-zoo#696 first.

Runtime + integration audit (Opus, real GPU + real models + 3,200+ fresh assertions)

Designed and ran six brand-new test scripts in addition to re-validating the existing 21-harness battery:

Test Coverage Result
Pydantic boundary fuzz 1,105 randomized inputs (ints, floats, strings, containers, bytes, numpy scalars, dataclasses, int subclasses, IntLike duck types, complex, nan/+inf/subnormals, unicode digits, locale separators, sci-notation, hex strings, sign/whitespace coercion) 1,105/1,105 PASS, 0 contract violations
Real multi-image per turn × Qwen3-VL processor 5 presets × 5 mixed-size image sets per user turn against the actual unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit processor 20/25 PASS, 5 expected processor-side rejects on aspect_ratio > 200 (pre-existing HF Qwen2-VL invariant, unrelated to this PR)
Real 5-step LoRA training unsloth/gemma-3-4b-it-unsloth-bnb-4bit + 16 rows of unsloth/LaTeX_OCR + UnslothVisionDataCollator(resize=PRESET, resize_dimension="max") × presets 256 / 512 / 1024 15/15 finite training steps, losses descended 6.27 → 2.52 for preset 256; presets 512 and 1024 produced BIT-IDENTICAL losses for LaTeX_OCR's <512px images, empirically proving the "Default = pre-PR" contract
Full-schema YAML round-trip fuzz 50 randomized TrainingConfigState snapshots through serializeConfigToYaml -> bytes -> parseYamlConfig -> mapBackendModelConfigToTrainingPatch -> buildTrainingStartPayload, covering DeepSeek-OCR exclusion, text-only models, isDatasetImage=null/true/false, numeric/string/typed/whitespace/+-prefixed/float visionImageSize forms 50/50 PASS
Torch <-> MLX pixel-resize equivalence sweep 2,000 random (w, h, target) triples + known banker's-rounding edge cases like (333, 1000, 500), plus 60 idempotence and 6 writability assertions, plus None-identity, plus non-PIL passthrough 2,000/2,000 pixel-equivalent + 60/60 idempotence + 6/6 writability
Unicode + homoglyph payload paths CJK / Arabic / emoji / RTL-marker / Cyrillic-lookalike / supplementary-plane chars in model_name and hf_dataset, with JSON round-trip, plus a Cyrillic-dееpseek homoglyph-bypass check against the DeepSeek-OCR substring guard 6/6 PASS, Cyrillic dееpseek correctly does NOT bypass the case-insensitive deepseek match while all legitimate casings of DeepSeek-OCR do match

Runtime auditor verdict: APPROVE. Zero functional bugs across ~3,200 fresh assertions. The validator is robust against scale-1000 random fuzz with no uncaught exceptions and no out-of-contract acceptances. Torch and MLX resize formulas are pixel-identical for 2,000 random inputs including known banker's-rounding edge cases. End-to-end VLM training with the new knob completes with stable, finite losses on real data. The DeepSeek-OCR guard resists Cyrillic homoglyph spoofing while matching all legitimate casings.

Updated aggregate green count on head 14caf36a (with unsloth-zoo PR #696 applied)

Backend pytest 27/27 + Pydantic fuzz 1,105/1,105 + 50-case YAML round-trip + 2,000-case Torch/MLX equivalence + 180-case weird-shapes + 92-case all-sizes synthetic + 58-case real-VLM-processor + 137-case Pydantic+MLX + 78-entry catalog audit + 70-case YAML hostile + 25-case multi-image + 15 real training steps + 6 unicode payload + 6 cross-browser + 5 collator smoke + 9 MLX resize + ... = ~4,800 assertions across 23 harnesses, all green.

Both audits APPROVE. PR continues to be safe to merge once unsloth-zoo#696 lands.

@danielhanchen

Copy link
Copy Markdown
Member

Tightened code comments across all 10 touched Studio files. Net -34 comment lines (61 removed, 27 added). 5-9 line essays compressed to 1-3 lines each, keeping the WHY where non-obvious and dropping restated mechanical details (line numbers, paraphrases of the code beside them, history of prior reviewer rounds). Full battery still green: 27/27 pytest, all sims OK, 180/180 weird-shapes, 92/92 all-sizes synthetic matrix. Pushed as f9c39331. Matching tightening for unsloth-zoo PR #696 (2 lines instead of 3).

Dariton4000 and others added 2 commits May 25, 2026 15:37
…ntext

Two issues surfaced by a fresh adversarial review of the validator:

1. v.strip().lstrip("+-").isdigit() let "++512" / "--256" / "+-+512"
   slip past the gate, then int("++512") raised an uncaught ValueError
   and Pydantic surfaced "invalid literal for int() with base 10: '++512'"
   instead of the contracted "vision_image_size must be an integer or null".

2. str.isdigit() returns True for Unicode digit families (full-width '512',
   Arabic-Indic '٥١٢', Devanagari '१०२४'), and int() coerces them, so the
   value reaching the backend wasn't the ASCII the user typed.

Replaced the lstrip+isdigit pair with re.fullmatch(r'[+-]?[0-9]+', stripped),
which rejects both shapes with the precise error and accepts the documented
ones ('256', '+512', ' 1024 '). Added 8 regression test cases covering
multi-sign strings, lone sign, and the three Unicode digit families.

Also restored comment context lost in f9c3933:
- model-defaults.ts: name studio/backend/models/training.py:_check_vision_image_size
  as the spec the [256, 2048] range mirrors, so a maintainer changing the
  cap in one file can find the other.
- training-section.tsx: enumerate the three windows in which isDatasetImage
  is null (before a check, after dataset edits, on import) so a future
  maintainer doesn't simplify the gate to `isCheckingDataset`.
- worker.py: qualify the writable-ndarray comment with "when a resize is
  requested" so it doesn't misadvertise the resize=None early-return.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

Spawned two more fresh Opus subagents to re-audit this PR from scratch. Both treated it as a from-scratch read; one was instructed to be adversarial, the other to design new empirical stress tests.

Adversarial reviewer found a real validator bug I missed; pushed the fix as `ab06fa1b` (rebased onto upstream).

Bug found

`_check_vision_image_size` in `studio/backend/models/training.py` used `v.strip().lstrip("+-").isdigit()` to gate `int(v)`. `lstrip("+-")` strips arbitrarily many sign chars, so `"++512"` / `"--256"` / `"+-+512"` passed the gate, then `int("++512")` raised an uncaught ValueError and Pydantic surfaced:

`invalid literal for int() with base 10: '++512'`

instead of the contracted `vision_image_size must be an integer or null`. Reachable from any direct API POST. The regression test in this PR (`test_bool_error_says_integer_not_range`) explicitly enforces this contract for booleans; the sign-prefix case bypassed it.

A second related issue: `str.isdigit()` returns True for Unicode digit families. Full-width Japanese `'512'`, Arabic-Indic `'٥١٢'`, Devanagari `'१०२४'` all silently coerced to 512 / 1024, so the value reaching the backend wasn't the ASCII the user typed.

Fix

Replaced the `lstrip` + `isdigit` pair with `re.fullmatch(r'[+-]?[0-9]+', stripped)`, which rejects both shapes with the precise contracted error and accepts the documented ones (`'256'`, `'+512'`, `' 1024 '`). Added 8 regression test cases covering multi-sign strings (`'++512'`, `'--256'`, `'+-+512'`, `'+'`, `'-'`) and the three Unicode digit families.

Also restored 3 over-tightened comments

The adversarial reviewer flagged 3 cases where `f9c39331`'s comment shortening lost load-bearing context:

  • `model-defaults.ts`: now names `studio/backend/models/training.py:_check_vision_image_size` as the spec the `[256, 2048]` range mirrors, so a maintainer changing the cap in one file can find the other.
  • `training-section.tsx`: now enumerates the three windows in which `isDatasetImage` is null (before a check, after dataset edits, on import). A future maintainer could otherwise reasonably simplify the gate to `isCheckingDataset` and silently drop the user's choice in the other two windows.
  • `worker.py`: now qualifies the writable-ndarray comment with "when a resize is requested" so it doesn't misadvertise the `resize=None` early-return.

Empirical stress tester verdict: APPROVE

162 fresh probes (8 zustand-migration, 22 comment-regression, 27 degenerate-images across 11 PIL modes including animated GIF, 31 REST endpoint round-trips through FastAPI TestClient with monkey-patched auth, 35 YAML-hostile, 58 adversarial validator inputs including JSON-pointer/Decimal/Fraction/FakeInt). All real bugs 0; same Unicode-digit behaviour acknowledged as a "documentation-grade behavioural note" (now fixed).

Final aggregate on head `ab06fa1b` (with unsloth-zoo PR #696 applied)

  • backend pytest: 35/35 (was 27/27, +8 new regression tests for the sign-prefix and Unicode-digit cases)
  • All JS sims OK
  • 180/180 weird-shapes
  • 92/92 all-sizes synthetic matrix
  • 58/58 real-VLM-processor matrix (Qwen3-VL + Gemma-3-4b)
  • 137/137 Pydantic+MLX matrix
  • All cross-browser Playwright still green

PR continues to be safe to merge once unsloth-zoo#696 lands.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

On the "only MLX is changed" question: the PR actually changes both training paths. The Torch change is small because UnslothVisionDataCollator already supports resize upstream, so trainer.py just threads resize=N, resize_dimension="max" into the existing constructor (lines 3130-3146). The MLX change is larger because mlx-vlm has no equivalent kwarg, so the PR adds its own _mlx_vlm_max_resized_size + _resize_mlx_vlm_image helpers in worker.py plus the DeepSeek-OCR guard. Both paths end up max-side, aspect-preserving, no-upscale, with DeepSeek-OCR excluded; the visible diff just looks MLX-heavy because Torch was already plumbed.

End-to-end evidence on PR head ab06fa1b is up at huggingface.co/datasets/danielhanchen/pr5743-image-sizes:

  • 15 Playwright screenshots of the live Studio UI driving the new control: Train page landing → Qwen3-VL selected → LaTeX_OCR dataset set → Memory tab → Image Size dropdown opened → 256 / 512 / 1024 / 2048 / back-to-Default selected in turn → switched to text-only Llama-3.2-3B (Image Size row gone) → switched to DeepSeek-OCR-2 (Image Size row also gone) → switched back to Qwen3-VL (row returns).
  • One full-flow video (pr5743_image_size_flow.webm + .mp4 transcode).
  • Real 20-step QLoRA training on a real B200 GPU of unsloth/gemma-3-4b-it-unsloth-bnb-4bit on 32 rows of unsloth/LaTeX_OCR, with the same seed/LR/batch_size across four presets:
preset first_loss last_loss losses_hash
Default 5.1091 0.5245 616447c04b89
256 5.1091 0.6781 bd382d513114
512 5.1091 0.5245 616447c04b89
1024 5.1091 0.5245 616447c04b89

Same first_loss=5.1091 across all four runs (same seed). Default == 512 == 1024 produce bit-identical losses (hash 616447c04b89 for all three) because LaTeX_OCR's images all fit under 512px on the max side, so the resize is a no-op for those three presets. Only 256 diverges (0.6781) because the cap actually downscales the equations. This is the empirical proof that vision_image_size=None (Default) is byte-identical to pre-PR behavior, and that the cap is only active when an image actually exceeds it.

Drivers committed locally as tests/pr5743_studio_ui_flow.py (Playwright at /studio route) and tests/pr5743_train_matrix.py (real LoRA training matrix).

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen danielhanchen merged commit dac2aed into unslothai:main May 27, 2026
30 checks passed
@danielhanchen

Copy link
Copy Markdown
Member

Thanks so much for the PR - this is really good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants