Skip to content

studio: improve onboarding UX, tooltips, and training defaults#4355

Merged
danielhanchen merged 64 commits into
mainfrom
studio/onboarding-improvements
Mar 17, 2026
Merged

studio: improve onboarding UX, tooltips, and training defaults#4355
danielhanchen merged 64 commits into
mainfrom
studio/onboarding-improvements

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

  • Change splash text to "Train and run LLMs locally"
  • Add "Chat Only" card with BubbleChatIcon on step 1 to skip directly to /chat
  • Add Skip/Skip to Chat buttons in sidebar (step 1 = "Skip to Chat", steps 2+ = "Skip") and footer (step 1 Back button = "Skip" returning to splash screen)
  • Make all tooltips clickable in addition to hover (via React context in tooltip component)
  • Strip surrounding quotes from pasted HF tokens (fixes "invalid or expired token" when pasting "hf_...")
  • Rename "Eval Split" to "Evaluation Split"
  • Add SparklesIcon to "Auto Detect" format option in dataset step
  • Change step 4 heading from "Training" to "Choose your training parameters"
  • Default max_steps changed to 60
  • Learning rate displayed in scientific notation (e.g. 2e-4) with +/- steppers following natural sequence (1e-4, 2e-4, ..., 9e-4, 1e-3, ...)
  • Context length options capped by model's max_position_embeddings fetched via AutoConfig; falls back to 64K if unavailable
  • Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step
  • Change "Watch video guide" to "Get started with our guide" linking to https://unsloth.ai/docs/new/studio/start
  • Update intro text to "Choose a path - fine-tune LLMs, vision, embedding, audio models or just chat"
  • Backend: add max_position_embeddings field to model config endpoint response (via AutoConfig with text_config fallback)

Test plan

  • Verify onboarding splash screen shows "Train and run LLMs locally"
  • Click "Chat Only" card on step 1 -- should navigate to /chat
  • Verify Skip/Skip to Chat buttons appear correctly per step
  • Click tooltip info icons -- should toggle on click and hover
  • Paste HF token with quotes -- should strip quotes and validate
  • Verify context length dropdown is capped by model's max context
  • Verify learning rate shows scientific notation and +/- works
  • Verify summary step shows QLoRA/LoRA instead of QLORA/LORA

- Change splash text to "Train and run LLMs locally"
- Add "Chat Only" card with BubbleChatIcon to skip directly to chat
- Add Skip/Skip to Chat buttons in sidebar and footer
- Back button on step 1 returns to splash screen instead of being disabled
- Change "Watch video guide" to "Get started with our guide" with new URL
- Update intro text to mention all model types + chat
- Make all tooltips clickable (in addition to hover) via React context
- Strip surrounding quotes from pasted HF tokens
- Rename "Eval Split" to "Evaluation Split"
- Add SparklesIcon to "Auto Detect" format option
- Change step 4 heading to "Choose your training parameters"
- Default max_steps to 60
- Learning rate displayed in scientific notation with +/- stepper
- Context length options capped by model's max_position_embeddings (via AutoConfig)
- Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step
- Backend: add max_position_embeddings to model config endpoint
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the user experience of the Unsloth Studio onboarding process. It introduces clearer navigation options, improves the usability of interactive elements like tooltips and input fields, and refines the presentation of training parameters. Additionally, it adds backend support for model context length, allowing for more intelligent default settings and user guidance during model configuration.

Highlights

  • Onboarding UX Improvements: Updated splash screen text, introduced a 'Chat Only' card for direct navigation, and added 'Skip'/'Skip to Chat' buttons in the sidebar and footer.
  • Tooltip Functionality: Made all tooltips clickable in addition to their existing hover functionality.
  • Hugging Face Token Handling: Implemented automatic stripping of surrounding quotes from pasted Hugging Face tokens to prevent validation errors.
  • Training Parameter Refinements: Renamed 'Eval Split' to 'Evaluation Split', added a SparklesIcon to 'Auto Detect' dataset format, changed the step 4 heading, and defaulted max_steps to 60.
  • Learning Rate Display and Input: Updated learning rate display to scientific notation and introduced steppers that follow a natural sequence (e.g., 1e-4, 2e-4, ..., 1e-3).
  • Context Length Configuration: Capped context length options based on the model's max_position_embeddings, with a fallback to 64K if not available.
  • Terminology Consistency: Corrected 'QLORA'/'LORA' to 'QLoRA'/'LoRA' in the summary step.
  • Documentation and Guidance: Updated the introductory text and changed 'Watch video guide' to 'Get started with our guide' with a new link.
  • Backend Model Configuration: Added max_position_embeddings to the model config endpoint response, fetched via AutoConfig.
Changelog
  • studio/backend/models/models.py
    • Added max_position_embeddings field to ModelDetails for storing the maximum context length.
  • studio/backend/routes/models.py
    • Implemented logic to extract max_position_embeddings from ModelConfig or AutoConfig and include it in the ModelDetails response.
  • studio/frontend/src/components/ui/tooltip.tsx
    • Modified the Tooltip component to support toggling visibility on click, in addition to hover.
  • studio/frontend/src/config/training.ts
    • Updated the default maxSteps hyperparameter from 0 to 60.
  • studio/frontend/src/features/onboarding/components/splash-screen.tsx
    • Changed the splash screen subtitle text to "Train and run LLMs locally".
  • studio/frontend/src/features/onboarding/components/steps/dataset-step.tsx
    • Imported SparklesIcon and added it next to the "Auto Detect" option in the dataset format selection.
  • studio/frontend/src/features/onboarding/components/steps/hyperparameters-step.tsx
    • Removed unused Input component import.
    • Added formatLR and stepLR utility functions for scientific notation learning rate handling.
    • Updated the FieldLegend text for the training parameters section.
    • Implemented logic to filter CONTEXT_LENGTHS based on the model's maxPositionEmbeddings.
    • Replaced the learning rate input field with custom stepper buttons that display and adjust values in scientific notation.
  • studio/frontend/src/features/onboarding/components/steps/model-type-step.tsx
    • Imported BubbleChatIcon, markOnboardingDone, and useNavigate.
    • Updated the introductory text and the link for the video guide.
    • Added a new "Chat Only" card that allows users to skip directly to the chat interface.
  • studio/frontend/src/features/onboarding/components/steps/summary-step.tsx
    • Corrected the display of training methods from "QLORA"/"LORA" to "QLoRA"/"LoRA".
    • Changed the learning rate display to use exponential notation (toExponential()).
  • studio/frontend/src/features/onboarding/components/wizard-footer.tsx
    • Modified the "Back" button to act as a "Skip" button, returning to the splash screen when on the first step.
  • studio/frontend/src/features/onboarding/components/wizard-layout.tsx
    • Passed the onBackToSplash callback to WizardFooter.
  • studio/frontend/src/features/onboarding/components/wizard-sidebar.tsx
    • Added "Skip" and "Skip to Chat" buttons to the sidebar, visible on both mobile and desktop.
  • studio/frontend/src/features/training/api/models-api.ts
    • Added max_position_embeddings as an optional field to the ModelConfigResponse interface.
  • studio/frontend/src/features/training/components/hf-dataset-subset-split-selectors.tsx
    • Renamed the "Eval Split" label to "Evaluation Split" in the dataset selectors.
  • studio/frontend/src/features/training/stores/training-config-store.ts
    • Added maxPositionEmbeddings to the initial state and non-persisted keys.
    • Updated loadAndApplyModelDefaults to store max_position_embeddings from model details.
    • Modified setHfToken to automatically strip surrounding quotes from the input token.
  • studio/frontend/src/features/training/types/config.ts
    • Added maxPositionEmbeddings to the TrainingConfigState interface.
  • studio/frontend/src/hooks/use-hf-token-validation.ts
    • Implemented logic to strip surrounding quotes from the debounced Hugging Face token before validation.
Activity
  • No specific activity (comments, reviews, progress updates) was provided in the context for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable user experience enhancements to the Studio onboarding process. The changes, which span both the frontend and backend, include making tooltips clickable, improving training parameter controls with scientific notation for learning rates, and adding more flexible navigation options like skipping to chat. The backend now correctly determines and exposes the model's maximum context length. The code is generally well-structured, but I've noted a couple of areas for improvement: one regarding code duplication in the backend that could be refactored for better maintainability, and a minor UX inconsistency in a navigation flow.

Comment on lines +400 to +403
if max_position_embeddings is None:
try:
from transformers import AutoConfig as _AutoConfig

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for extracting max_position_embeddings is a duplicate of the logic in lines 388-391. To improve maintainability and reduce redundancy, consider extracting this into a helper function.

For example:

def _get_max_pos_embeddings(config_obj):
    if hasattr(config_obj, "max_position_embeddings"):
        return config_obj.max_position_embeddings
    if hasattr(config_obj, "text_config") and hasattr(config_obj.text_config, "max_position_embeddings"):
        return config_obj.text_config.max_position_embeddings
    return None

You could then call this helper in both places to keep the code DRY.

className="mt-3 hidden text-xs text-muted-foreground md:flex"
onClick={() => {
markOnboardingDone();
navigate({ to: "/studio" });

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a minor UX inconsistency here. The button's label is "Skip to Chat" on the first step, but it navigates to /studio. For a more intuitive experience, it should navigate to /chat to match the label, similar to how the "Chat Only" card on the same step behaves.

Suggested change
navigate({ to: "/studio" });
navigate({ to: currentStep === 1 ? "/chat" : "/studio" });

Manan17 and others added 20 commits March 17, 2026 10:28
- Change Qwen3.5 thinking threshold from <=2B to <9B (0.8B, 2B, 4B
  all disable thinking by default; 9B+ enables it)
- Always pass enable_thinking=False in AI Assist helper calls
  (_run_with_helper and _generate_with_backend) regardless of chat
  thinking settings
- Extract _get_max_position_embeddings helper to DRY config extraction
- Fix "Skip to Chat" to navigate to /chat on step 1 (was /studio)
While streaming SVG content, the syntax highlighter (Shiki) re-parses
the entire growing SVG on every token, blocking the main thread and
freezing the code area until the fence closes. Show a plain-text
preview for incomplete SVG fences instead, similar to how Mermaid
diagrams show a placeholder while streaming.
Per Qwen3.5 docs (unsloth.ai/docs/models/qwen3.5), top_k should be 20
for both thinking and non-thinking modes. The model-specific config in
inference_defaults.json already had top_k=20 for Qwen3.5, but the
generic fallback defaults were wrong:
- Frontend DEFAULT_INFERENCE_PARAMS.topK: 50 -> 20
- Backend generate_chat_completion top_k: 40 -> 20
- Backend generate_chat_completion_with_tools top_k: 40 -> 20
- Frontend title generation top_k: 40 -> 20
Default params for any model without specific config:
  temperature=0.6, top_p=0.95, top_k=20, min_p=0.01,
  presence_penalty=0.0, repetition_penalty=1.0

Models with entries in inference_defaults.json (Qwen3.5, Gemma-3,
Llama, etc.) override these with their recommended values.

Updated in: frontend DEFAULT_INFERENCE_PARAMS, backend Pydantic
request models, and backend generate_chat_completion defaults.
Only set trust_remote_code=True when the model name starts with
"unsloth/". All other models default to False for safety.
The "Generating" spinner was below the send message bar, causing
the bar to jump up and down. Move it above the composer in both
the regular thread view and the welcome/empty view.
Move the X close button on toasts (like "Starting model...") from
top-1.5 to top-3 and add right-3, giving more breathing room from
the top-right corner.
Reduce gap from 1.5 to 0.5, padding from px-2.5/py-1 to px-2/py-0.5,
and icon from size-3.5 to size-3.
- Move Generating spinner above composer (fixes jumping send bar)
- Make Think button smaller with tighter icon-text gap
- Chat card now inside grid (same size as Audio/Embeddings cards)
- Rename "Chat Only" to "Chat"
- Chat card requires Continue to proceed (no auto-advance)
- Continue on Chat selection skips onboarding and goes to /chat
- Tooltip (i) click on Chat card doesn't trigger navigation
- Step 1 footer Back button goes back to splash (label is "Back")
- Splash "Skip Onboarding" renamed to "Skip to Chat", navigates to /chat
- Toast close button moved away from edge
- Sidebar "Skip to Chat" now uses primary (green) Button style with
  arrow icon, full width, aligned like step items. Shows on all steps.
- Footer: added "Skip" outline button next to Continue that goes
  directly to /studio with progress saved (markOnboardingDone)
The DEFAULT_MAX_STEPS in use-max-steps-epochs-toggle.ts was still 30,
used as fallback when toggling from epochs back to max steps.
CONTEXT_LENGTHS now includes 65536, 131072, 262144 in addition to
the existing 512-32768 range. The onboarding step filters these by
the model's max_position_embeddings (e.g. Nemotron-3-Nano-4B has
262144), showing powers of 2 up to the model's maximum.
After selecting a model in onboarding, detect the total model weight
file size from HF Hub (safetensors/bin files). Then estimate memory
needed: model_size_gb * 1.5 * context_scale, where context_scale is:
  - <=8192 tokens: 1.0x
  - >8192 tokens: 1.7x
  - >=16384 tokens: 2.0x
  - >=32768 tokens: 4.0x

If the estimate fits in free GPU VRAM, default to LoRA (16-bit).
Otherwise default to QLoRA (4-bit).

Backend changes:
- Add model_size_bytes to ModelDetails (models.py)
- Add _get_model_size_bytes() using HfApi.repo_info (routes/models.py)
- Add vram_free_gb to get_gpu_summary (hardware.py)

Frontend changes:
- Add autoSelectTrainingMethod() in training-config-store.ts
- Called after model defaults are loaded
- Add model_size_bytes to ModelConfigResponse type
- Add vramFreeGb to HardwareInfo hook
danielhanchen and others added 26 commits March 17, 2026 12:28
For GGUF repos, the trash icon now appears on each downloaded variant
row inside the quantization expander instead of on the repo-level row.
Backend accepts optional variant param to delete specific GGUF files
(blob + symlink) rather than the entire repo cache.
The Max Tokens slider was capped at 32768 on page refresh because
ggufContextLength was not restored from the status response.
Now set it from statusRes.context_length on reconnect.
The train-on-responses-only feature uses template markers to find
where the assistant response starts. The Qwen3.5 response marker
included '<think>\n' which is only present when thinking mode is
enabled. With thinking disabled (default for <9B), the marker
never matched, causing 100% of samples to be dropped.

Changed response marker from '<|im_start|>assistant\n<think>\n'
to '<|im_start|>assistant\n' which works regardless of thinking mode.
Register python and terminal tools alongside web search. Python
executor validates imports (stdlib only) via unsloth_zoo
rl_environments, runs code in a subprocess sandbox with 5-min
timeout and cancel support. Terminal executor blocks dangerous
commands (rm, sudo, etc.) and runs in a temp directory.

Update llama_cpp tool loop to show tool-specific status messages
and pass cancel_event through to executors. Rename composer
toggle from "Search" to "Tools" and show TerminalIcon for
execution status pills.
… port binding

Backend:
- Dynamic transformers 5.x detection via tokenizer_config.json fetch
  (checks for TokenizersBackend class, cached per-model)
- Bump transformers 5.x version from 5.2.0 to 5.3.0 across all workers,
  setup scripts (setup.sh, setup.ps1)
- Auto-enable trust_remote_code for unsloth/* models needing transformers 5.x
  (workaround for NemotronH config parsing bug in transformers)
- Auto-install mamba-ssm/causal-conv1d for SSM models (NemotronH, Falcon-H1)
  with --no-build-isolation --no-deps to avoid torch version conflicts
- Add SO_REUSEADDR to port check in run.py (fixes Colab proxy stale connection
  falsely reporting port as in-use)

Frontend:
- Fix "Skip to Chat" navigation: use window.location.href instead of React
  Router navigate() to bypass useEffect redirect race
- Fix "Skip Onboarding" on splash: navigates to /studio (not /chat)
- Fix onboarding guard: only check isOnboardingDone() on initial mount
- Fix Chat card on step 1: add sr-only spacer for consistent alignment
- Fix Chat+Text both selected: clear RadioGroup value when Chat is selected
Replace the single "Tools" toggle with two independent toggles:
- "Search" (globe icon) enables web search only
- "Code" (terminal icon) enables Python and terminal execution

Add enabled_tools list field to the inference payload so the
backend only registers the tools the user has toggled on. Both
toggles appear in the main composer and the compare composer.
Replace unsloth_zoo-dependent import checker with a standalone
ast-based validator using sys.stdlib_module_names. This properly
blocks non-stdlib imports (numpy, requests, etc.) and returns a
clear error message to the model so it can rewrite using only
stdlib.

Add full traceback to tool streaming error logs for debugging.
gpt-oss models emit multi-channel output via harmony protocol tokens
(<|channel|>analysis<|message|>... and <|channel|>final<|message|>...).
TextIteratorStreamer with skip_special_tokens=True strips the special
tokens but leaves channel names concatenated with content, producing
garbled output like "analysisWe need to...assistantfinalHello!".

Add HarmonyTextStreamer that decodes with skip_special_tokens=False,
parses harmony markup via regex, and emits <think>analysis</think>
for the analysis channel and plain text for the final channel --
reusing the existing frontend reasoning UI.

Also expose supports_reasoning=True for non-GGUF gpt-oss models in
the /status endpoint so the frontend enables the Think toggle.
Set UNSLOTH_IS_PRESENT=1 and import check_python_modules and
check_signal_escape_patterns directly from unsloth_zoo instead
of a standalone fallback. This gives us the full Unsloth
validation including stdlib-only import checks and signal/timeout
escape pattern detection.
Remove stdlib-only import restriction. Keep signal escape
pattern detection via unsloth_zoo for safety.
The 0.5s read timeout used for cancel-checking during streaming
also fires when waiting for the first response from llama-server
(e.g. reasoning model thinking for 15+ seconds). Add
_stream_with_retry() context manager that retries on ReadTimeout
while checking cancel_event, so the model has unlimited time to
think before producing the first token. Applied to both the
regular streaming path and the tool-calling final pass.
The delta-on-transformed approach had two critical bugs:

1. Before the full <|channel|>X<|message|> pattern was complete, the
   strip-tokens fallback emitted "analysis" as plain text. Then when
   the regex matched, _transform returned a completely different format
   (<think>...</think>) and the delta was computed against the wrong
   base string, producing fragments like "think>", "nk>", ">".

2. Even with full matches, the closing </think> tag shifted position
   as content grew, so text[prev_len:] produced garbled deltas.

Replace with stateful incremental parsing that:
- Buffers until a complete channel+message pair is seen
- Emits <think> once when analysis channel first appears
- Streams analysis content deltas (computed on channel content directly)
- Emits </think> once when final channel first appears
- Streams final content deltas
- Closes open think tags in end()

Also skip the generic all_special_tokens stripping in
_clean_generated_text for gpt-oss since HarmonyTextStreamer already
produces clean output and the generic stripping was mangling <think>
tags.
Integrates generalized model comparison into the onboarding-improvements
branch. Resolves import conflict in shared-composer.tsx and fixes unused
variable in compare flow.
…bset

The gpt-oss tokenizer has added tokens like <|return|> (id=200002) that
are not part of the harmony channel protocol but can leak into output.
The previous regex only stripped channel|message|start|end tokens.

Broaden the _clean_generated_text regex for gpt-oss to <\|[a-z_]+\|>
which catches all pipe-delimited tokens (return, constrain, reserved,
etc.) without matching <think>/<\/think> tags.

Verified: gpt-oss all_special_tokens are only <|return|>,
<|reserved_200017|>, <|startoftext|> -- none overlap with <think>.
The harmony tokens (channel, message, start, end) are added_tokens
but not in all_special_tokens.
Repos that only have metadata/config files cached (no .safetensors or
.bin weight files) were showing up in the Downloaded list with tiny
sizes like "1.8 KB" or "24 KB". These are just leftover config
snapshots from architecture checks, not usable models.

Filter the cached-models endpoint to only include repos that contain
actual model weight files (.safetensors or .bin).
Add explicit !text-muted-foreground to toast description classNames
so secondary text (e.g. "Releases VRAM and resets inference state.")
is readable in dark mode.
Replace sr-only span (takes no space) with a size-4 shrink-0 div
matching the RadioGroupItem dimensions in other cards, so the Chat
icon aligns vertically with Text/Audio/Vision/Embeddings icons.
@danielhanchen danielhanchen merged commit 0acd1c7 into main Mar 17, 2026
5 checks passed
@danielhanchen danielhanchen deleted the studio/onboarding-improvements branch March 17, 2026 14:46
shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…thai#4355)

* studio: improve onboarding UX, tooltips, and training defaults

- Change splash text to "Train and run LLMs locally"
- Add "Chat Only" card with BubbleChatIcon to skip directly to chat
- Add Skip/Skip to Chat buttons in sidebar and footer
- Back button on step 1 returns to splash screen instead of being disabled
- Change "Watch video guide" to "Get started with our guide" with new URL
- Update intro text to mention all model types + chat
- Make all tooltips clickable (in addition to hover) via React context
- Strip surrounding quotes from pasted HF tokens
- Rename "Eval Split" to "Evaluation Split"
- Add SparklesIcon to "Auto Detect" format option
- Change step 4 heading to "Choose your training parameters"
- Default max_steps to 60
- Learning rate displayed in scientific notation with +/- stepper
- Context length options capped by model's max_position_embeddings (via AutoConfig)
- Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step
- Backend: add max_position_embeddings to model config endpoint

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* compare for 2 diff models

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* resolving gemini comments

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: disable thinking for Qwen3.5 <9B and always for AI Assist

- Change Qwen3.5 thinking threshold from <=2B to <9B (0.8B, 2B, 4B
  all disable thinking by default; 9B+ enables it)
- Always pass enable_thinking=False in AI Assist helper calls
  (_run_with_helper and _generate_with_backend) regardless of chat
  thinking settings

* studio: address PR review comments

- Extract _get_max_position_embeddings helper to DRY config extraction
- Fix "Skip to Chat" to navigate to /chat on step 1 (was /studio)

* fix: comment out debug print statements

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: skip Shiki highlighting for incomplete SVG code fences

While streaming SVG content, the syntax highlighter (Shiki) re-parses
the entire growing SVG on every token, blocking the main thread and
freezing the code area until the fence closes. Show a plain-text
preview for incomplete SVG fences instead, similar to how Mermaid
diagrams show a placeholder while streaming.

* studio: fix default top_k from 50/40 to 20 for chat inference

Per Qwen3.5 docs (unsloth.ai/docs/models/qwen3.5), top_k should be 20
for both thinking and non-thinking modes. The model-specific config in
inference_defaults.json already had top_k=20 for Qwen3.5, but the
generic fallback defaults were wrong:
- Frontend DEFAULT_INFERENCE_PARAMS.topK: 50 -> 20
- Backend generate_chat_completion top_k: 40 -> 20
- Backend generate_chat_completion_with_tools top_k: 40 -> 20
- Frontend title generation top_k: 40 -> 20

* studio: set universal inference defaults for unknown models

Default params for any model without specific config:
  temperature=0.6, top_p=0.95, top_k=20, min_p=0.01,
  presence_penalty=0.0, repetition_penalty=1.0

Models with entries in inference_defaults.json (Qwen3.5, Gemma-3,
Llama, etc.) override these with their recommended values.

Updated in: frontend DEFAULT_INFERENCE_PARAMS, backend Pydantic
request models, and backend generate_chat_completion defaults.

* studio: only trust_remote_code for unsloth/ models in AutoConfig

Only set trust_remote_code=True when the model name starts with
"unsloth/". All other models default to False for safety.

* studio: move Generating spinner above the composer

The "Generating" spinner was below the send message bar, causing
the bar to jump up and down. Move it above the composer in both
the regular thread view and the welcome/empty view.

* studio: adjust toast close button position away from edge

Move the X close button on toasts (like "Starting model...") from
top-1.5 to top-3 and add right-3, giving more breathing room from
the top-right corner.

* studio: make Think button smaller with tighter icon-text gap

Reduce gap from 1.5 to 0.5, padding from px-2.5/py-1 to px-2/py-0.5,
and icon from size-3.5 to size-3.

* studio: multiple onboarding and chat UX improvements

- Move Generating spinner above composer (fixes jumping send bar)
- Make Think button smaller with tighter icon-text gap
- Chat card now inside grid (same size as Audio/Embeddings cards)
- Rename "Chat Only" to "Chat"
- Chat card requires Continue to proceed (no auto-advance)
- Continue on Chat selection skips onboarding and goes to /chat
- Tooltip (i) click on Chat card doesn't trigger navigation
- Step 1 footer Back button goes back to splash (label is "Back")
- Splash "Skip Onboarding" renamed to "Skip to Chat", navigates to /chat
- Toast close button moved away from edge

* studio: align Skip to Chat button, add Skip to footer

- Sidebar "Skip to Chat" now uses primary (green) Button style with
  arrow icon, full width, aligned like step items. Shows on all steps.
- Footer: added "Skip" outline button next to Continue that goes
  directly to /studio with progress saved (markOnboardingDone)

* studio: change default max steps from 30 to 60 in toggle hook

The DEFAULT_MAX_STEPS in use-max-steps-epochs-toggle.ts was still 30,
used as fallback when toggling from epochs back to max steps.

* studio: extend context length options to 262K

CONTEXT_LENGTHS now includes 65536, 131072, 262144 in addition to
the existing 512-32768 range. The onboarding step filters these by
the model's max_position_embeddings (e.g. Nemotron-3-Nano-4B has
262144), showing powers of 2 up to the model's maximum.

* studio: auto-select LoRA vs QLoRA based on model size and GPU memory

After selecting a model in onboarding, detect the total model weight
file size from HF Hub (safetensors/bin files). Then estimate memory
needed: model_size_gb * 1.5 * context_scale, where context_scale is:
  - <=8192 tokens: 1.0x
  - >8192 tokens: 1.7x
  - >=16384 tokens: 2.0x
  - >=32768 tokens: 4.0x

If the estimate fits in free GPU VRAM, default to LoRA (16-bit).
Otherwise default to QLoRA (4-bit).

Backend changes:
- Add model_size_bytes to ModelDetails (models.py)
- Add _get_model_size_bytes() using HfApi.repo_info (routes/models.py)
- Add vram_free_gb to get_gpu_summary (hardware.py)

Frontend changes:
- Add autoSelectTrainingMethod() in training-config-store.ts
- Called after model defaults are loaded
- Add model_size_bytes to ModelConfigResponse type
- Add vramFreeGb to HardwareInfo hook

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: rename "Importing ML libraries..." to "Importing Unsloth..."

* studio: show model/dataset in training status, fix LoRA/QLoRA casing

- Training status now shows 'Training "model_name"' and 'Dataset = ...'
  instead of generic "Starting training..."
- Fix Studio progress section to show QLoRA/LoRA instead of QLORA/LORA

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: rename 'Skip to Chat' to 'Skip Onboarding' on splash screen

* studio: add presence_penalty support for chat inference

Add presence_penalty as a parameter across the full stack:
- Backend: llama_cpp.py generate_chat_completion/with_tools, Pydantic
  models (inference.py), routes/inference.py pass-through
- Frontend: InferenceParams type, DEFAULT_INFERENCE_PARAMS (0.0),
  chat-adapter.ts payload, chat-settings-sheet.tsx slider (0-2),
  model defaults loading from inference_defaults.json
- Set Qwen3.5 default presence_penalty to 1.5 per official docs
- Default for unknown models is 0.0 (off)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: fix Chat card deselecting Text and aligning with other cards

* studio: fix presence_penalty not loading from inference defaults

The inference_config.py load_inference_config() was not including
presence_penalty in the returned config dict, so the Qwen3.5
default of 1.5 from inference_defaults.json never reached the
frontend. Added it to the config builder.

* studio: add delete button for cached models in model selector

Add trash icon on each downloaded model row (GGUF and safetensors) with
confirmation dialog. Backend DELETE /api/models/delete-cached endpoint
uses huggingface_hub scan_cache_dir + delete_revisions to cleanly remove
cached repos, refusing if the model is currently loaded.

* studio: restore inference defaults, reasoning, and tools on page refresh

On page refresh with a model already loaded, the frontend was not
re-applying model-specific inference defaults (presence_penalty,
temperature, etc.) or restoring reasoning/tools support flags.

Backend: Add inference config, supports_reasoning, supports_tools,
and context_length to InferenceStatusResponse.

Frontend: In the refresh callback, when an active model is detected,
apply mergeRecommendedInference and restore reasoning/tools flags
with proper Qwen3.5 size-based defaults.

* studio: fix delete dialog closing before async completes

Prevent AlertDialogAction's default close behavior with
e.preventDefault() so the dialog stays open during deletion.
Also block onOpenChange dismiss while deleting is in progress.

* fix: add Dict and Any imports to inference models

* studio: fix Qwen3.5 reasoning threshold in frontend load path

The frontend loadModel handler had the old threshold (<=2) for
disabling reasoning on small Qwen3.5 models. Changed to <9 to
match the backend. This was causing 4B to not properly disable
thinking by default when auto-loaded.

* studio: move GGUF delete to per-variant level

For GGUF repos, the trash icon now appears on each downloaded variant
row inside the quantization expander instead of on the repo-level row.
Backend accepts optional variant param to delete specific GGUF files
(blob + symlink) rather than the entire repo cache.

* studio: restore ggufContextLength on page refresh

The Max Tokens slider was capped at 32768 on page refresh because
ggufContextLength was not restored from the status response.
Now set it from statusRes.context_length on reconnect.

* fix: remove <think> from Qwen3.5 response template marker

The train-on-responses-only feature uses template markers to find
where the assistant response starts. The Qwen3.5 response marker
included '<think>\n' which is only present when thinking mode is
enabled. With thinking disabled (default for <9B), the marker
never matched, causing 100% of samples to be dropped.

Changed response marker from '<|im_start|>assistant\n<think>\n'
to '<|im_start|>assistant\n' which works regardless of thinking mode.

* studio: fix sloth ASCII art alignment in training overlay

* fix: correct sloth ASCII art alignment to match Unsloth banner

* studio: add Python and terminal tool calling to chat

Register python and terminal tools alongside web search. Python
executor validates imports (stdlib only) via unsloth_zoo
rl_environments, runs code in a subprocess sandbox with 5-min
timeout and cancel support. Terminal executor blocks dangerous
commands (rm, sudo, etc.) and runs in a temp directory.

Update llama_cpp tool loop to show tool-specific status messages
and pass cancel_event through to executors. Rename composer
toggle from "Search" to "Tools" and show TerminalIcon for
execution status pills.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: fix Nemotron/transformers 5.x support, onboarding navigation, port binding

Backend:
- Dynamic transformers 5.x detection via tokenizer_config.json fetch
  (checks for TokenizersBackend class, cached per-model)
- Bump transformers 5.x version from 5.2.0 to 5.3.0 across all workers,
  setup scripts (setup.sh, setup.ps1)
- Auto-enable trust_remote_code for unsloth/* models needing transformers 5.x
  (workaround for NemotronH config parsing bug in transformers)
- Auto-install mamba-ssm/causal-conv1d for SSM models (NemotronH, Falcon-H1)
  with --no-build-isolation --no-deps to avoid torch version conflicts
- Add SO_REUSEADDR to port check in run.py (fixes Colab proxy stale connection
  falsely reporting port as in-use)

Frontend:
- Fix "Skip to Chat" navigation: use window.location.href instead of React
  Router navigate() to bypass useEffect redirect race
- Fix "Skip Onboarding" on splash: navigates to /studio (not /chat)
- Fix onboarding guard: only check isOnboardingDone() on initial mount
- Fix Chat card on step 1: add sr-only spacer for consistent alignment
- Fix Chat+Text both selected: clear RadioGroup value when Chat is selected

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: split tools toggle into Search and Code buttons

Replace the single "Tools" toggle with two independent toggles:
- "Search" (globe icon) enables web search only
- "Code" (terminal icon) enables Python and terminal execution

Add enabled_tools list field to the inference payload so the
backend only registers the tools the user has toggled on. Both
toggles appear in the main composer and the compare composer.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: fix tool calling import validation and error logging

Replace unsloth_zoo-dependent import checker with a standalone
ast-based validator using sys.stdlib_module_names. This properly
blocks non-stdlib imports (numpy, requests, etc.) and returns a
clear error message to the model so it can rewrite using only
stdlib.

Add full traceback to tool streaming error logs for debugging.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: parse gpt-oss harmony channels for clean safetensors chat output

gpt-oss models emit multi-channel output via harmony protocol tokens
(<|channel|>analysis<|message|>... and <|channel|>final<|message|>...).
TextIteratorStreamer with skip_special_tokens=True strips the special
tokens but leaves channel names concatenated with content, producing
garbled output like "analysisWe need to...assistantfinalHello!".

Add HarmonyTextStreamer that decodes with skip_special_tokens=False,
parses harmony markup via regex, and emits <think>analysis</think>
for the analysis channel and plain text for the final channel --
reusing the existing frontend reasoning UI.

Also expose supports_reasoning=True for non-GGUF gpt-oss models in
the /status endpoint so the frontend enables the Think toggle.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: use unsloth_zoo for Python sandbox validation

Set UNSLOTH_IS_PRESENT=1 and import check_python_modules and
check_signal_escape_patterns directly from unsloth_zoo instead
of a standalone fallback. This gives us the full Unsloth
validation including stdlib-only import checks and signal/timeout
escape pattern detection.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: allow all imports in Python tool sandbox

Remove stdlib-only import restriction. Keep signal escape
pattern detection via unsloth_zoo for safety.

* studio: fix ReadTimeout on tool streaming final pass

The 0.5s read timeout used for cancel-checking during streaming
also fires when waiting for the first response from llama-server
(e.g. reasoning model thinking for 15+ seconds). Add
_stream_with_retry() context manager that retries on ReadTimeout
while checking cancel_event, so the model has unlimited time to
think before producing the first token. Applied to both the
regular streaming path and the tool-calling final pass.

* fix: rewrite HarmonyTextStreamer with stateful incremental parsing

The delta-on-transformed approach had two critical bugs:

1. Before the full <|channel|>X<|message|> pattern was complete, the
   strip-tokens fallback emitted "analysis" as plain text. Then when
   the regex matched, _transform returned a completely different format
   (<think>...</think>) and the delta was computed against the wrong
   base string, producing fragments like "think>", "nk>", ">".

2. Even with full matches, the closing </think> tag shifted position
   as content grew, so text[prev_len:] produced garbled deltas.

Replace with stateful incremental parsing that:
- Buffers until a complete channel+message pair is seen
- Emits <think> once when analysis channel first appears
- Streams analysis content deltas (computed on channel content directly)
- Emits </think> once when final channel first appears
- Streams final content deltas
- Closes open think tags in end()

Also skip the generic all_special_tokens stripping in
_clean_generated_text for gpt-oss since HarmonyTextStreamer already
produces clean output and the generic stripping was mangling <think>
tags.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: strip all <|...|> tokens in gpt-oss cleanup, not just harmony subset

The gpt-oss tokenizer has added tokens like <|return|> (id=200002) that
are not part of the harmony channel protocol but can leak into output.
The previous regex only stripped channel|message|start|end tokens.

Broaden the _clean_generated_text regex for gpt-oss to <\|[a-z_]+\|>
which catches all pipe-delimited tokens (return, constrain, reserved,
etc.) without matching <think>/<\/think> tags.

Verified: gpt-oss all_special_tokens are only <|return|>,
<|reserved_200017|>, <|startoftext|> -- none overlap with <think>.
The harmony tokens (channel, message, start, end) are added_tokens
but not in all_special_tokens.

* fix: hide config-only model repos from cached models list

Repos that only have metadata/config files cached (no .safetensors or
.bin weight files) were showing up in the Downloaded list with tiny
sizes like "1.8 KB" or "24 KB". These are just leftover config
snapshots from architecture checks, not usable models.

Filter the cached-models endpoint to only include repos that contain
actual model weight files (.safetensors or .bin).

* studio: fix toast description text contrast in dark mode

Add explicit !text-muted-foreground to toast description classNames
so secondary text (e.g. "Releases VRAM and resets inference state.")
is readable in dark mode.

* studio: fix Chat card icon alignment with size-4 spacer

Replace sr-only span (takes no space) with a size-4 shrink-0 div
matching the RadioGroupItem dimensions in other cards, so the Chat
icon aligns vertically with Text/Audio/Vision/Embeddings icons.

---------

Co-authored-by: workspace <user@workspace.local>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Manan17 <shahmanan170602@gmail.com>
Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants