Skip to content

MLX Training updates#5656

Open
mmathew23 wants to merge 21 commits into
unslothai:mainfrom
mmathew23:explore/mlx
Open

MLX Training updates#5656
mmathew23 wants to merge 21 commits into
unslothai:mainfrom
mmathew23:explore/mlx

Conversation

@mmathew23

Copy link
Copy Markdown
Collaborator

Expose max grad values and set default random seeds in studio.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for elementwise gradient value clipping (max_grad_value) in MLX training and refines random state handling by separating model and LoRA initialization seeds. The changes span the backend training logic, API models, and frontend types. Feedback was provided regarding a discrepancy between the max_grad_value field description and its actual default implementation in the worker, which should be aligned to avoid confusion.

Comment on lines +270 to +277
max_grad_value: Optional[float] = Field(
None,
ge = 0,
description = (
"MLX-only elementwise gradient value clipping threshold. "
"If unset, MLX uses its runtime default."
),
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for max_grad_value states that MLX uses its runtime default if unset. However, the implementation in worker.py (line 1396) explicitly defaults it to 1.0 if it is None. To avoid confusion and ensure the API documentation matches the implementation, the description should be updated to reflect that it defaults to 1.0 in this environment.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b79ba4235

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/training/worker.py Outdated
Comment on lines +1159 to +1160
model_random_state = config.get("model_random_state", 3407)
lora_random_state = config.get("lora_random_state", 3407)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back MLX init seeds to random_seed

_run_mlx_training now reads model_random_state/lora_random_state directly, defaulting both to 3407, but Studio never sets those keys (the request/config path still only forwards random_seed in routes/training.py and core/training/training.py). As a result, a user-supplied random_seed no longer controls model load or LoRA initialization RNG, so seed-based reproducibility is broken for any non-3407 seed (including the API default path where random_seed is 42).

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/training/worker.py Outdated
Comment on lines +1395 to +1396
max_grad_value = config.get("max_grad_value")
max_grad_value = 1.0 if max_grad_value is None else float(max_grad_value)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor null max_grad_value as MLX runtime default

The new request model documents that leaving max_grad_value unset should defer to MLX defaults, but this code forces None to 1.0 before building MLXTrainingConfig. Because the frontend now sends max_grad_value: null by default, callers cannot actually use the trainer/runtime default and the API behavior contradicts its contract.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e82926845d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/core/training/worker.py Outdated
Comment on lines +1159 to +1160
model_random_state = config.get("model_random_state", 3407)
lora_random_state = config.get("lora_random_state", 3407)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use request random_seed for MLX model/LoRA init

_run_mlx_training now reads model_random_state/lora_random_state, but the request path still only forwards random_seed (routes/training.py and core/training/training.py). In practice, user-provided seeds no longer control from_pretrained(... random_state=...) or get_peft_model(... random_state=...), so runs with non-3407 seeds are not reproducible on MLX even though seed is still set for the trainer loop.

Useful? React with 👍 / 👎.

Comment thread studio/backend/core/training/worker.py Outdated
Comment on lines +1395 to +1396
max_grad_value = config.get("max_grad_value")
max_grad_value = 1.0 if max_grad_value is None else float(max_grad_value)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve null max_grad_value instead of forcing 1.0

The new API contract says max_grad_value can be unset to let MLX use its runtime default (TrainingStartRequest.max_grad_value description), but this code rewrites None to 1.0 before building MLXTrainingConfig. That makes null behaviorally different from the documented contract and prevents callers from actually opting into the trainer/runtime default.

Useful? React with 👍 / 👎.

@Datta0 Datta0 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bfb4203400

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread studio/backend/tests/test_training_raw_support.py

@danielhanchen danielhanchen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, Matthew. Static review (Studio backend MLX worker is Apple Silicon only, so this is review only on the CUDA host I have).

End-to-end trace of max_grad_value:

  • studio/backend/models/training.py:267-281 accepts Optional[float] (None preserved by Pydantic).
  • studio/backend/routes/training.py:218 forwards into the worker dict unchanged.
  • studio/backend/core/training/training.py:218-225 forwards into the config dict unchanged.
  • studio/backend/core/training/worker.py:1389-1397 reads it, leaves it None, only coerces to float when a numeric value is present.
  • studio/frontend/src/features/training/api/mappers.ts:84 sends max_grad_value: null by default.

No x or 1.0 fallback downstream, so the API contract null in -> null out is preserved end-to-end. The weight_decay = 0.001 if weight_decay is None else float(weight_decay) normalization at worker.py:1395-1396 is also cleaner than the previous float(config.get(..., 0.001) or 0.001) (which had the well-known "user explicitly passes 0.0 -> coerced to 0.001" trap).

model_random_state / lora_random_state defaulting to random_seed when absent (worker.py:1156-1170) reads correctly. test_training_backend_forwards_random_seed_without_internal_mlx_seed_keys asserts the absent-key path; one thing the test suite does not assert is the present-but-None case (config["model_random_state"] = None would override the seed with None, because config.get("model_random_state", random_seed) only falls back when the key is missing, not when it's present-and-None). Probably not reachable from Studio today since the request schema doesn't expose those keys, but if anything downstream ever does, the semantics may surprise. Easy fix: config.get("model_random_state") or random_seed if you want explicit-null to mean "inherit", or document the present-vs-absent distinction.

tests/studio/run_real_mlx_smoke.py 30-step refresh matches the gate from #5537. Dropping the eos_id append in _compute_loss_and_grad_norm so the smoke loss probe matches Studio's text dataset path is the right move; the prior 7-step assertion was stale per #5622. I can't actually run the smoke from here (CUDA-only host), so this rides on macOS CI evidence.

This PR pairs with unsloth-zoo#684 - cast_norm_output_to_input_dtype and max_grad_value=None semantics only do useful work once MLXTrainingArguments accepts them. Worth a merge-order note in case #684 lands later.

Approving subject to MLX CI green on the smoke test refresh.

danielhanchen pushed a commit to unslothai/unsloth-staging-1 that referenced this pull request May 24, 2026
test_training_raw_support.py transitively imports the full studio
backend (core.training.training -> matplotlib, etc.). Adding every
transitive dep to the Windows install smoke is whack-a-mole and
defeats the smoke's purpose.

test_mlx_training_worker_config.py already covers PR unslothai#5656's wiring
(model_random_state / lora_random_state fallback, max_grad_value
None preservation, dataset_order=torch_randperm) via source-text
assertions on worker.py. The test stubs out structlog/loggers/utils
itself, so it works with just stdlib.

Drop the broader test from the Windows job.
studio/backend/core/training/worker.py
  `config.get("model_random_state", random_seed)` only fills the
  default when the key is absent. When a caller passes
  `config["model_random_state"] = None` explicitly (which happens
  any time a JSON payload sends an explicit `null`), the old code
  forwarded `None` to FastMLXModel and disabled deterministic init
  silently. Same for `lora_random_state`. Treat absent and explicit
  None the same way: fall back to random_seed.

studio/backend/tests/test_training_raw_support.py
  Update the source-string assertions to match the new lines.
@danielhanchen

Copy link
Copy Markdown
Member

Pushed one small follow-up on top of a404dfd3 (now bff5b443):

studio/backend/core/training/worker.py:1156-1170 was using config.get("model_random_state", random_seed) which only falls back when the key is absent. If a caller serializes {"model_random_state": null} (which Pydantic / JSON happily do for Optional fields), dict.get returns None instead of random_seed and that None reaches FastMLXModel.from_pretrained(random_state=None) and get_peft_model(random_state=None), silently disabling deterministic init. Same for lora_random_state. Reworked to explicit None-check so absent and explicit-null behave identically.

studio/backend/tests/test_training_raw_support.py::test_mlx_worker_falls_back_init_seeds_to_random_seed updated to match the new lines.

The Pydantic schema does not expose model_random_state / lora_random_state today so this is theoretical for the Studio HTTP path, but any non-Studio caller (CLI tests, future REST shape, downstream forks) that did set the keys to null would otherwise get non-reproducible runs. The CUDA workspace I have here cannot run the MLX smoke, but the change is structural and the source-string test pins the new shape.

danielhanchen pushed a commit to unslothai/unsloth-staging-1 that referenced this pull request May 24, 2026
The PR unslothai#684 and PR unslothai#5656 heads were just updated with maintainer
fixes (restored compiler.py UNSLOTH_RETURN_LOGITS elif, GPT-2 ln_*
matching, Qwen3-VL flag wiring, default-branch reseed; plus seed
present-but-None fix). Bump the three workflow files (comment-only)
so the paths filters re-fire and we get a fresh signal on all three
runners against the updated PR heads.
danielhanchen pushed a commit to unslothai/unsloth-staging-1 that referenced this pull request May 24, 2026
Round 2 of reviewer-driven fixes landed on the PR heads:
  zoo PR unslothai#684: 0753b115
    - merged origin/main (restores unslothai#690 / unslothai#691 gpt-oss eager attn)
    - cleaned up norm cast monkey patch in train() finally
    - raise on streaming+dataset_order text combo
    - VLM baseline CE full-sequence forward parity with CCE
    - scheduler test now matches HF linear-no-warmup behavior
  unsloth PR unslothai#5656: bff5b44 (unchanged since last run)

Re-fire all three workflows so we get a fresh signal.
… PR unslothai#5656

The MLX worker now passes `cast_norm_output_to_input_dtype` and
`dataset_order` only when the linked unsloth-zoo dataclass actually
declares them. Released zoo trees that predate the paired PR can still
construct `MLXTrainingConfig` without raising
`TypeError: unexpected keyword argument`. Once the dependency floor is
bumped to a release that contains both fields, the feature-detect
guards become no-ops.

`random_seed = config.get("random_seed", 3407)` was unguarded against
explicit `None` from raw / backend callers. The same value seeded the
trainer and was the fallback target for `model_random_state` /
`lora_random_state`. Normalize once at the top of the function and use
the normalized value everywhere so an explicit `None` cannot reach
FastMLXModel / get_peft_model / MLXTrainingConfig.

Existing seed source-pattern test updated to match the new normalize
helper. New test asserts the feature-detection guards exist and that
the unconditional kwargs do not include the gated fields.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

Pushed 56e32b75 on top of bff5b443. Addresses the two P1 items that the round 2 review consensus flagged.

  1. studio/backend/core/training/worker.py:1411 MLXTrainingConfig kwargs are no longer all-or-nothing. The previous version always passed cast_norm_output_to_input_dtype and dataset_order, which would raise TypeError: unexpected keyword argument against any released unsloth-zoo that predated the paired PR RuntimeError: User specified an unsupported autocast device_type 'meta' #684 change. Switched to building the kwargs dict, then gating the two new fields with getattr(MLXTrainingConfig, "__dataclass_fields__", {}). Released zoo trees that lack the fields keep working; zoo trees that have them get the same behavior as before. Once the dependency floor is bumped to a release containing both fields the guards become no-ops, no behavior change.

  2. studio/backend/core/training/worker.py:1159 random_seed is now normalized once at the top of the function. The previous code used config.get("random_seed", 3407) which only inserts the default for absent keys; an explicit None from raw / backend callers passed straight through to FastMLXModel, get_peft_model, and MLXTrainingConfig(seed=None). After this PR's earlier round 1 fix for model_random_state / lora_random_state, the random_seed source itself was the last leg that still leaked None. Now _raw_seed = config.get("random_seed", 3407) is followed by random_seed = 3407 if _raw_seed is None else int(_raw_seed), and the model / LoRA seed overrides also int() their value when not None. Trainer seed = random_seed reads the normalized value directly.

Tests:

PYTHONPATH=. python -m pytest studio/backend/tests/test_training_raw_support.py studio/backend/tests/test_mlx_training_worker_config.py -q
14 passed

Added one new assertion to test_mlx_worker_falls_back_init_seeds_to_random_seed for the seed normalize helper and one new test test_mlx_worker_feature_detects_optional_mlx_config_fields covering the dataclass field guard.

Not addressing in this PR:

  • Frontend Studio UI control for max_grad_value and cast_norm_output_to_input_dtype (one reviewer P2; the request mapper currently hardcodes max_grad_value: null and has no cast_norm_output_to_input_dtype field). Backend now accepts both via the new schema, and external callers can supply them. Wiring a Studio UI control is a separate frontend task; the API is exposed and validated, the runtime path on the worker side is now portable, so the runtime regression set is closed. Happy to do the UI control in a follow-up.

Yell if anything looks off.

…othai#5656

Round-3 review consensus: the per-field guards that landed in the MLX
worker only protect the MLX path. The same `TrainingBackend.start_training`
config still reaches the CUDA/text trainer at `worker.py:2267`, the
embedding LoRA init at `worker.py:2450`, and embedding TrainingArguments
at `worker.py:2624` with raw `None` values, so an explicit
`random_seed=None` from a raw / backend caller still breaks non-MLX
training even after the previous fix.

Move the normalization into `TrainingBackend.start_training` itself,
where it runs once for every training mode:

- `_coerce_seed(value)`: explicit `None`, non-int, or absent all become
  3407. Every downstream worker now sees an int.
- `_coerce_optional_bool(value, default)`: explicit `None` falls back
  to `default` instead of `bool(None) == False`. Also normalizes the
  common raw-config / YAML string aliases ("true" / "false" / "0" /
  "1"). Used for `cast_norm_output_to_input_dtype`.
- `_coerce_optional_nonneg_float(name, value)`: rejects negative
  numerics from raw / backend callers, matching the Pydantic
  `ge=0` constraint the HTTP route already enforces. Used for
  `max_grad_value`.

worker.py MLX path: the existing `bool(config.get(key, True))` for
`cast_norm_output_to_input_dtype` was changed to also fall back on
explicit `None`, so direct worker callers (bypassing
`TrainingBackend.start_training`) are equally safe. `max_grad_value`
also raises on negative values inside the worker for the same reason.

TrainingStartRequest.random_seed default bumped from 42 to 3407 so
direct REST callers that omit the field receive the same default as
the Studio frontend and the MLX worker.

New regression test exercises the three new helpers across explicit
None, valid values, string aliases, and negative-value rejection.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

danielhanchen pushed a commit to unslothai/unsloth-staging-1 that referenced this pull request May 24, 2026
@danielhanchen

Copy link
Copy Markdown
Member

Pushed 1a026435 on top of 56e32b75. Addresses the consensus P1 / P2 items from a fresh 12-reviewer pass after round 2 landed.

  1. studio/backend/core/training/training.py:225 (asymmetric seed normalization). The round-2 fix added the random_seed=None fallback only inside the MLX worker, but TrainingBackend.start_training still stored None in the shared config. That config is also consumed by the non-MLX trainer (worker.py:2267), embedding LoRA init (worker.py:2450), and embedding TrainingArguments (worker.py:2624). A raw / backend caller passing random_seed=None for any non-MLX run would still hit transformers.set_seed(None) -> TypeError. Moved the normalize into a small _coerce_seed helper that runs ONCE in start_training for every training mode.

  2. studio/backend/core/training/training.py:222 (cast_norm_output_to_input_dtype=None becomes False). Round-2 added the field via kwargs.get(key, True) and the worker did bool(config.get(key, True)). Both have the same explicit-None blind spot: a raw / backend caller passing None rebinds the field to False and silently disables the MLX norm-output cast even though the documented and schema default is True. Added _coerce_optional_bool(value, default) that maps explicit None and common string aliases ("true" / "false" / "0" / "1" / "yes" / "no") through to the boolean default, and applied it at both the backend boundary and the worker for direct callers.

  3. studio/backend/core/training/training.py:221 (no validation on max_grad_value raw path). The Pydantic route model already rejects negative max_grad_value with ge=0, but TrainingBackend.start_training(**kwargs) accepts arbitrary kwargs without validation, so a raw / backend caller passing max_grad_value=-1 reached the MLX trainer as -1.0. unsloth-zoo treats non-positive elementwise clip as "off", silently disabling the new public knob. Added _coerce_optional_nonneg_float(name, value) which preserves None, coerces numerics, and raises ValueError on negatives. Worker mirrors the check for direct callers.

  4. studio/backend/models/training.py:285 (REST schema default was random_seed=42). The Studio frontend, backend default, and worker fallback are all 3407; only the REST schema still defaulted to 42, so HTTP clients that omitted random_seed got a different seed than every other Studio entry point. Bumped to 3407 to match.

  5. New test test_training_backend_normalizes_explicit_none_seed_and_dtypes exercises the three helpers across explicit None, valid values, string aliases, and negative-value rejection. Updated test_mlx_worker_falls_back_init_seeds_to_random_seed and test_mlx_worker_preserves_null_max_grad_value_for_trainer_default to match the new worker source.

Background context (already in earlier rounds):

  • Feature-detection on MLXTrainingConfig.__dataclass_fields__ so the worker still constructs MLXTrainingConfig against released unsloth-zoo (which predates cast_norm_output_to_input_dtype and dataset_order). Once the floor bumps, the guards become no-ops.
  • All staging CI legs (mlx-compiler-linux, mlx-smoke-macos, install-smoke-windows) ran green against the prior round-2 head; re-triggering against 1a026435 + 23751c84 now.

Not changed in this PR:

  • Studio frontend wiring for max_grad_value / cast_norm_output_to_input_dtype. The backend schema accepts both, raw / backend callers can supply them, and the worker side is now portable across unsloth-zoo releases. Adding the UI control is a separate frontend task.

Yell if anything looks off.

The block-extraction used , which stops at the
first inner closing paren (e.g. )
and would silently miss a future unconditional
/  added later in the same dict literal. Switched to
proper paren-depth tracking so the unconditional block is checked end-to-end.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@mmathew23

Copy link
Copy Markdown
Collaborator Author

Added commit 65cd01954 to wire the new zoo append_eos option from Studio by training mode.

Rationale / parity check:

  • unsloth-zoo now defaults MLXTrainingConfig.append_eos=True, which is correct for generic/direct raw text callers because it preserves the old mlx-lm-style EOS behavior.
  • Studio SFT formatting is different: for Alpaca/chat-template formatted rows, Studio has already rendered the final text example, and CUDA Studio/TRL does not add an extra EOS for the Qwen3 formatted SFT text path we are using for parity.
  • I verified on the CUDA side with unsloth/Qwen3-0.6B: the processed SFT row remains length 44 and does not end with eos_token_id.
  • On MLX with zoo default append_eos=True, the same fixture changes the data surface: 2-step smoke reports 88 tokens and first losses [4.8962, 4.8962].
  • With Studio passing append_eos=False for SFT formatted text, MLX returns to the parity surface: 2-step smoke reports 86 tokens and first losses [4.6446, 4.6446].

The patch is intentionally conditional:

raw_text_mode = training_type == "Continued Pretraining" or format_type == "raw"
mlx_config_kwargs["append_eos"] = bool(raw_text_mode)

So raw/CPT text still lets MLX append EOS, matching the CUDA raw-text path, while formatted SFT text does not get an extra EOS behind Studio's back.

Validation run locally:

pytest -q studio/backend/tests/test_training_raw_support.py
# 11 passed

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Rationale / guardrails for the local Studio/vision push:

When callers provide explicit VLM LoRA target_modules together with layer filters, FastVisionModel still needs to route the explicit targets through get_peft_regex. Otherwise the layer filters are ignored and adapters can be attached outside the requested language/vision scope.

Do not revert this to plain list(target_modules) for explicit module lists. The CUDA/Studio-facing contract is that explicit targets and layer filters compose: target_modules selects module names, while finetune_language_layers / finetune_vision_layers / finetune_attention_modules / finetune_mlp_modules constrain where those targets are allowed.

The regression test covers the language-only explicit q_proj case and source-checks that explicit targets are wrapped through get_peft_regex when filters are active.
@mmathew23

Copy link
Copy Markdown
Collaborator Author

Reviewer / maintainer guardrail for the next Studio/vision push:

The local VLM LoRA targeting fix is intentional and should not be reverted to plain list(target_modules) handling.

When callers provide explicit VLM target_modules together with layer filters, FastVisionModel still needs to route those explicit targets through get_peft_regex. The intended contract is compositional:

  • target_modules selects module names, and
  • finetune_language_layers / finetune_vision_layers / finetune_attention_modules / finetune_mlp_modules constrain where those targets are allowed.

Without the regex wrapping, explicit target lists can ignore the language/vision layer filters and attach adapters outside the requested scope. The added test covers the language-only explicit q_proj case and source-checks that explicit targets are wrapped through get_peft_regex when filters are active.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@danielhanchen

Copy link
Copy Markdown
Member

Verified the Studio-side wiring for max_grad_leaf_norm (commit d66f4a7) against the underlying unsloth-zoo trainer change. Posting end-to-end evidence so the API contract is clear.

Plumbing path verified

request.max_grad_leaf_norm (Pydantic Optional[float], ge=0)
-> TrainingBackend.start_training (coerced via _coerce_optional_nonneg_float)
-> studio/backend/core/training/worker.py (re-validates non-negative, feature-detects __dataclass_fields__ on MLXTrainingConfig)
-> MLXTrainingConfig(max_grad_leaf_norm=...) on the worker side.

Feature-detect guard keeps backwards compat with older unsloth-zoo releases that predate the field: if the dataclass doesn't expose it, the kwarg is dropped and the worker falls through to the trainer's runtime default. So this PR is safe to land before or after PR unslothai/unsloth-zoo#684.

Tests passing (15)

studio/backend/tests/test_mlx_training_worker_config.py    4 passed
studio/backend/tests/test_training_raw_support.py         11 passed

Including the new test_training_backend_forwards_grad_clipping_controls which pins the kwarg surface against silent regression.

Why the new default matters in Studio

Studio users default to LoRA training on Apple Silicon where memory headroom is tight. The new MLX default (max_grad_leaf_norm=1.0 instead of max_grad_value=1.0):

  • Preserves each tensor's gradient direction (closer to CUDA max_grad_norm dynamics, the HF reference).
  • Pays no cross-tree reduction memory cost (verified on macos-14: max_grad_norm is +2.7 MB peak and +9-10% step time vs leaf_norm on gemma-3-270m, scales linearly with trainable params).
  • Doesn't break existing Studio runs: identical convergence step on a 30-step memorisation fixture (mean abs delta 0.01 loss vs the prior elementwise default).

Detailed parity data in unslothai/unsloth-zoo#684 review comment.

Tested across precedence

Studio request input Resolved trainer mode
nothing set ("leaf_norm", 1.0) (new MLX default)
max_grad_value=1.5 ("value", 1.5) (preserves API meaning)
max_grad_leaf_norm=2.5 ("leaf_norm", 2.5)
max_grad_norm=1.0 ("global_norm", 1.0)
both max_grad_value=1.0 and max_grad_leaf_norm=1.0 ("value", 1.0) (explicit value wins)

Each resolution path covered by tests/test_mlx_max_grad_value_none.py in the zoo PR (13 tests).

LGTM from my side on the Studio plumbing.

Trim the 11-line comment block to 5 lines and correct the stale claim
that MLXTrainingConfig defaults to max_grad_value=1.0. The new default
is max_grad_leaf_norm=1.0 (same memory profile as elementwise but
direction-preserving). The smoke still pins max_grad_value=1.0
explicitly to keep the 13-seed pass-rate fixture stable.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Merges 116 main commits (gemini provider, oxc validator package-lock,
uninstall script relocation, lockfile audit, etc). Two content conflicts
resolved:

  - studio/backend/tests/test_mlx_training_worker_config.py: both branches
    appended a new test (HEAD's tokenizer dual-purpose check, main's VLM
    resize math). Both kept side-by-side; both pass.
  - tests/studio/run_real_mlx_smoke.py: HEAD's stronger len + train_steps
    assertion kept; main's auto-following comment kept.

16 Studio backend tests pass post-merge.
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Comment thread unsloth/models/vision.py
tuple,
str,
)
if type(target_modules) in (list, tuple) and (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Should we at least warn instead that both are mentioned and choosing one over the other or smth?

"weight_decay": request.weight_decay,
"max_grad_norm": request.max_grad_norm,
"max_grad_value": request.max_grad_value,
"cast_norm_output_to_input_dtype": request.cast_norm_output_to_input_dtype,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: There should be max_grad_leaf_norm entry here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants