studio: improve onboarding UX, tooltips, and training defaults#4355
Conversation
- Change splash text to "Train and run LLMs locally" - Add "Chat Only" card with BubbleChatIcon to skip directly to chat - Add Skip/Skip to Chat buttons in sidebar and footer - Back button on step 1 returns to splash screen instead of being disabled - Change "Watch video guide" to "Get started with our guide" with new URL - Update intro text to mention all model types + chat - Make all tooltips clickable (in addition to hover) via React context - Strip surrounding quotes from pasted HF tokens - Rename "Eval Split" to "Evaluation Split" - Add SparklesIcon to "Auto Detect" format option - Change step 4 heading to "Choose your training parameters" - Default max_steps to 60 - Learning rate displayed in scientific notation with +/- stepper - Context length options capped by model's max_position_embeddings (via AutoConfig) - Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step - Backend: add max_position_embeddings to model config endpoint
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
for more information, see https://pre-commit.ci
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the user experience of the Unsloth Studio onboarding process. It introduces clearer navigation options, improves the usability of interactive elements like tooltips and input fields, and refines the presentation of training parameters. Additionally, it adds backend support for model context length, allowing for more intelligent default settings and user guidance during model configuration. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Code Review
This pull request introduces several valuable user experience enhancements to the Studio onboarding process. The changes, which span both the frontend and backend, include making tooltips clickable, improving training parameter controls with scientific notation for learning rates, and adding more flexible navigation options like skipping to chat. The backend now correctly determines and exposes the model's maximum context length. The code is generally well-structured, but I've noted a couple of areas for improvement: one regarding code duplication in the backend that could be refactored for better maintainability, and a minor UX inconsistency in a navigation flow.
| if max_position_embeddings is None: | ||
| try: | ||
| from transformers import AutoConfig as _AutoConfig | ||
|
|
There was a problem hiding this comment.
This logic for extracting max_position_embeddings is a duplicate of the logic in lines 388-391. To improve maintainability and reduce redundancy, consider extracting this into a helper function.
For example:
def _get_max_pos_embeddings(config_obj):
if hasattr(config_obj, "max_position_embeddings"):
return config_obj.max_position_embeddings
if hasattr(config_obj, "text_config") and hasattr(config_obj.text_config, "max_position_embeddings"):
return config_obj.text_config.max_position_embeddings
return NoneYou could then call this helper in both places to keep the code DRY.
| className="mt-3 hidden text-xs text-muted-foreground md:flex" | ||
| onClick={() => { | ||
| markOnboardingDone(); | ||
| navigate({ to: "/studio" }); |
There was a problem hiding this comment.
There's a minor UX inconsistency here. The button's label is "Skip to Chat" on the first step, but it navigates to /studio. For a more intuitive experience, it should navigate to /chat to match the label, similar to how the "Chat Only" card on the same step behaves.
| navigate({ to: "/studio" }); | |
| navigate({ to: currentStep === 1 ? "/chat" : "/studio" }); |
for more information, see https://pre-commit.ci
- Change Qwen3.5 thinking threshold from <=2B to <9B (0.8B, 2B, 4B all disable thinking by default; 9B+ enables it) - Always pass enable_thinking=False in AI Assist helper calls (_run_with_helper and _generate_with_backend) regardless of chat thinking settings
- Extract _get_max_position_embeddings helper to DRY config extraction - Fix "Skip to Chat" to navigate to /chat on step 1 (was /studio)
for more information, see https://pre-commit.ci
While streaming SVG content, the syntax highlighter (Shiki) re-parses the entire growing SVG on every token, blocking the main thread and freezing the code area until the fence closes. Show a plain-text preview for incomplete SVG fences instead, similar to how Mermaid diagrams show a placeholder while streaming.
Per Qwen3.5 docs (unsloth.ai/docs/models/qwen3.5), top_k should be 20 for both thinking and non-thinking modes. The model-specific config in inference_defaults.json already had top_k=20 for Qwen3.5, but the generic fallback defaults were wrong: - Frontend DEFAULT_INFERENCE_PARAMS.topK: 50 -> 20 - Backend generate_chat_completion top_k: 40 -> 20 - Backend generate_chat_completion_with_tools top_k: 40 -> 20 - Frontend title generation top_k: 40 -> 20
Default params for any model without specific config: temperature=0.6, top_p=0.95, top_k=20, min_p=0.01, presence_penalty=0.0, repetition_penalty=1.0 Models with entries in inference_defaults.json (Qwen3.5, Gemma-3, Llama, etc.) override these with their recommended values. Updated in: frontend DEFAULT_INFERENCE_PARAMS, backend Pydantic request models, and backend generate_chat_completion defaults.
Only set trust_remote_code=True when the model name starts with "unsloth/". All other models default to False for safety.
The "Generating" spinner was below the send message bar, causing the bar to jump up and down. Move it above the composer in both the regular thread view and the welcome/empty view.
Move the X close button on toasts (like "Starting model...") from top-1.5 to top-3 and add right-3, giving more breathing room from the top-right corner.
Reduce gap from 1.5 to 0.5, padding from px-2.5/py-1 to px-2/py-0.5, and icon from size-3.5 to size-3.
- Move Generating spinner above composer (fixes jumping send bar) - Make Think button smaller with tighter icon-text gap - Chat card now inside grid (same size as Audio/Embeddings cards) - Rename "Chat Only" to "Chat" - Chat card requires Continue to proceed (no auto-advance) - Continue on Chat selection skips onboarding and goes to /chat - Tooltip (i) click on Chat card doesn't trigger navigation - Step 1 footer Back button goes back to splash (label is "Back") - Splash "Skip Onboarding" renamed to "Skip to Chat", navigates to /chat - Toast close button moved away from edge
- Sidebar "Skip to Chat" now uses primary (green) Button style with arrow icon, full width, aligned like step items. Shows on all steps. - Footer: added "Skip" outline button next to Continue that goes directly to /studio with progress saved (markOnboardingDone)
The DEFAULT_MAX_STEPS in use-max-steps-epochs-toggle.ts was still 30, used as fallback when toggling from epochs back to max steps.
CONTEXT_LENGTHS now includes 65536, 131072, 262144 in addition to the existing 512-32768 range. The onboarding step filters these by the model's max_position_embeddings (e.g. Nemotron-3-Nano-4B has 262144), showing powers of 2 up to the model's maximum.
After selecting a model in onboarding, detect the total model weight file size from HF Hub (safetensors/bin files). Then estimate memory needed: model_size_gb * 1.5 * context_scale, where context_scale is: - <=8192 tokens: 1.0x - >8192 tokens: 1.7x - >=16384 tokens: 2.0x - >=32768 tokens: 4.0x If the estimate fits in free GPU VRAM, default to LoRA (16-bit). Otherwise default to QLoRA (4-bit). Backend changes: - Add model_size_bytes to ModelDetails (models.py) - Add _get_model_size_bytes() using HfApi.repo_info (routes/models.py) - Add vram_free_gb to get_gpu_summary (hardware.py) Frontend changes: - Add autoSelectTrainingMethod() in training-config-store.ts - Called after model defaults are loaded - Add model_size_bytes to ModelConfigResponse type - Add vramFreeGb to HardwareInfo hook
for more information, see https://pre-commit.ci
For GGUF repos, the trash icon now appears on each downloaded variant row inside the quantization expander instead of on the repo-level row. Backend accepts optional variant param to delete specific GGUF files (blob + symlink) rather than the entire repo cache.
The Max Tokens slider was capped at 32768 on page refresh because ggufContextLength was not restored from the status response. Now set it from statusRes.context_length on reconnect.
The train-on-responses-only feature uses template markers to find where the assistant response starts. The Qwen3.5 response marker included '<think>\n' which is only present when thinking mode is enabled. With thinking disabled (default for <9B), the marker never matched, causing 100% of samples to be dropped. Changed response marker from '<|im_start|>assistant\n<think>\n' to '<|im_start|>assistant\n' which works regardless of thinking mode.
Register python and terminal tools alongside web search. Python executor validates imports (stdlib only) via unsloth_zoo rl_environments, runs code in a subprocess sandbox with 5-min timeout and cancel support. Terminal executor blocks dangerous commands (rm, sudo, etc.) and runs in a temp directory. Update llama_cpp tool loop to show tool-specific status messages and pass cancel_event through to executors. Rename composer toggle from "Search" to "Tools" and show TerminalIcon for execution status pills.
for more information, see https://pre-commit.ci
… port binding Backend: - Dynamic transformers 5.x detection via tokenizer_config.json fetch (checks for TokenizersBackend class, cached per-model) - Bump transformers 5.x version from 5.2.0 to 5.3.0 across all workers, setup scripts (setup.sh, setup.ps1) - Auto-enable trust_remote_code for unsloth/* models needing transformers 5.x (workaround for NemotronH config parsing bug in transformers) - Auto-install mamba-ssm/causal-conv1d for SSM models (NemotronH, Falcon-H1) with --no-build-isolation --no-deps to avoid torch version conflicts - Add SO_REUSEADDR to port check in run.py (fixes Colab proxy stale connection falsely reporting port as in-use) Frontend: - Fix "Skip to Chat" navigation: use window.location.href instead of React Router navigate() to bypass useEffect redirect race - Fix "Skip Onboarding" on splash: navigates to /studio (not /chat) - Fix onboarding guard: only check isOnboardingDone() on initial mount - Fix Chat card on step 1: add sr-only spacer for consistent alignment - Fix Chat+Text both selected: clear RadioGroup value when Chat is selected
for more information, see https://pre-commit.ci
Replace the single "Tools" toggle with two independent toggles: - "Search" (globe icon) enables web search only - "Code" (terminal icon) enables Python and terminal execution Add enabled_tools list field to the inference payload so the backend only registers the tools the user has toggled on. Both toggles appear in the main composer and the compare composer.
for more information, see https://pre-commit.ci
Replace unsloth_zoo-dependent import checker with a standalone ast-based validator using sys.stdlib_module_names. This properly blocks non-stdlib imports (numpy, requests, etc.) and returns a clear error message to the model so it can rewrite using only stdlib. Add full traceback to tool streaming error logs for debugging.
for more information, see https://pre-commit.ci
gpt-oss models emit multi-channel output via harmony protocol tokens (<|channel|>analysis<|message|>... and <|channel|>final<|message|>...). TextIteratorStreamer with skip_special_tokens=True strips the special tokens but leaves channel names concatenated with content, producing garbled output like "analysisWe need to...assistantfinalHello!". Add HarmonyTextStreamer that decodes with skip_special_tokens=False, parses harmony markup via regex, and emits <think>analysis</think> for the analysis channel and plain text for the final channel -- reusing the existing frontend reasoning UI. Also expose supports_reasoning=True for non-GGUF gpt-oss models in the /status endpoint so the frontend enables the Think toggle.
for more information, see https://pre-commit.ci
Set UNSLOTH_IS_PRESENT=1 and import check_python_modules and check_signal_escape_patterns directly from unsloth_zoo instead of a standalone fallback. This gives us the full Unsloth validation including stdlib-only import checks and signal/timeout escape pattern detection.
for more information, see https://pre-commit.ci
Remove stdlib-only import restriction. Keep signal escape pattern detection via unsloth_zoo for safety.
The 0.5s read timeout used for cancel-checking during streaming also fires when waiting for the first response from llama-server (e.g. reasoning model thinking for 15+ seconds). Add _stream_with_retry() context manager that retries on ReadTimeout while checking cancel_event, so the model has unlimited time to think before producing the first token. Applied to both the regular streaming path and the tool-calling final pass.
The delta-on-transformed approach had two critical bugs: 1. Before the full <|channel|>X<|message|> pattern was complete, the strip-tokens fallback emitted "analysis" as plain text. Then when the regex matched, _transform returned a completely different format (<think>...</think>) and the delta was computed against the wrong base string, producing fragments like "think>", "nk>", ">". 2. Even with full matches, the closing </think> tag shifted position as content grew, so text[prev_len:] produced garbled deltas. Replace with stateful incremental parsing that: - Buffers until a complete channel+message pair is seen - Emits <think> once when analysis channel first appears - Streams analysis content deltas (computed on channel content directly) - Emits </think> once when final channel first appears - Streams final content deltas - Closes open think tags in end() Also skip the generic all_special_tokens stripping in _clean_generated_text for gpt-oss since HarmonyTextStreamer already produces clean output and the generic stripping was mangling <think> tags.
for more information, see https://pre-commit.ci
Integrates generalized model comparison into the onboarding-improvements branch. Resolves import conflict in shared-composer.tsx and fixes unused variable in compare flow.
…bset The gpt-oss tokenizer has added tokens like <|return|> (id=200002) that are not part of the harmony channel protocol but can leak into output. The previous regex only stripped channel|message|start|end tokens. Broaden the _clean_generated_text regex for gpt-oss to <\|[a-z_]+\|> which catches all pipe-delimited tokens (return, constrain, reserved, etc.) without matching <think>/<\/think> tags. Verified: gpt-oss all_special_tokens are only <|return|>, <|reserved_200017|>, <|startoftext|> -- none overlap with <think>. The harmony tokens (channel, message, start, end) are added_tokens but not in all_special_tokens.
Repos that only have metadata/config files cached (no .safetensors or .bin weight files) were showing up in the Downloaded list with tiny sizes like "1.8 KB" or "24 KB". These are just leftover config snapshots from architecture checks, not usable models. Filter the cached-models endpoint to only include repos that contain actual model weight files (.safetensors or .bin).
Add explicit !text-muted-foreground to toast description classNames so secondary text (e.g. "Releases VRAM and resets inference state.") is readable in dark mode.
Replace sr-only span (takes no space) with a size-4 shrink-0 div matching the RadioGroupItem dimensions in other cards, so the Chat icon aligns vertically with Text/Audio/Vision/Embeddings icons.
…thai#4355) * studio: improve onboarding UX, tooltips, and training defaults - Change splash text to "Train and run LLMs locally" - Add "Chat Only" card with BubbleChatIcon to skip directly to chat - Add Skip/Skip to Chat buttons in sidebar and footer - Back button on step 1 returns to splash screen instead of being disabled - Change "Watch video guide" to "Get started with our guide" with new URL - Update intro text to mention all model types + chat - Make all tooltips clickable (in addition to hover) via React context - Strip surrounding quotes from pasted HF tokens - Rename "Eval Split" to "Evaluation Split" - Add SparklesIcon to "Auto Detect" format option - Change step 4 heading to "Choose your training parameters" - Default max_steps to 60 - Learning rate displayed in scientific notation with +/- stepper - Context length options capped by model's max_position_embeddings (via AutoConfig) - Fix "QLORA"/"LORA" to "QLoRA"/"LoRA" in summary step - Backend: add max_position_embeddings to model config endpoint * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * compare for 2 diff models * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolving gemini comments * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: disable thinking for Qwen3.5 <9B and always for AI Assist - Change Qwen3.5 thinking threshold from <=2B to <9B (0.8B, 2B, 4B all disable thinking by default; 9B+ enables it) - Always pass enable_thinking=False in AI Assist helper calls (_run_with_helper and _generate_with_backend) regardless of chat thinking settings * studio: address PR review comments - Extract _get_max_position_embeddings helper to DRY config extraction - Fix "Skip to Chat" to navigate to /chat on step 1 (was /studio) * fix: comment out debug print statements * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: skip Shiki highlighting for incomplete SVG code fences While streaming SVG content, the syntax highlighter (Shiki) re-parses the entire growing SVG on every token, blocking the main thread and freezing the code area until the fence closes. Show a plain-text preview for incomplete SVG fences instead, similar to how Mermaid diagrams show a placeholder while streaming. * studio: fix default top_k from 50/40 to 20 for chat inference Per Qwen3.5 docs (unsloth.ai/docs/models/qwen3.5), top_k should be 20 for both thinking and non-thinking modes. The model-specific config in inference_defaults.json already had top_k=20 for Qwen3.5, but the generic fallback defaults were wrong: - Frontend DEFAULT_INFERENCE_PARAMS.topK: 50 -> 20 - Backend generate_chat_completion top_k: 40 -> 20 - Backend generate_chat_completion_with_tools top_k: 40 -> 20 - Frontend title generation top_k: 40 -> 20 * studio: set universal inference defaults for unknown models Default params for any model without specific config: temperature=0.6, top_p=0.95, top_k=20, min_p=0.01, presence_penalty=0.0, repetition_penalty=1.0 Models with entries in inference_defaults.json (Qwen3.5, Gemma-3, Llama, etc.) override these with their recommended values. Updated in: frontend DEFAULT_INFERENCE_PARAMS, backend Pydantic request models, and backend generate_chat_completion defaults. * studio: only trust_remote_code for unsloth/ models in AutoConfig Only set trust_remote_code=True when the model name starts with "unsloth/". All other models default to False for safety. * studio: move Generating spinner above the composer The "Generating" spinner was below the send message bar, causing the bar to jump up and down. Move it above the composer in both the regular thread view and the welcome/empty view. * studio: adjust toast close button position away from edge Move the X close button on toasts (like "Starting model...") from top-1.5 to top-3 and add right-3, giving more breathing room from the top-right corner. * studio: make Think button smaller with tighter icon-text gap Reduce gap from 1.5 to 0.5, padding from px-2.5/py-1 to px-2/py-0.5, and icon from size-3.5 to size-3. * studio: multiple onboarding and chat UX improvements - Move Generating spinner above composer (fixes jumping send bar) - Make Think button smaller with tighter icon-text gap - Chat card now inside grid (same size as Audio/Embeddings cards) - Rename "Chat Only" to "Chat" - Chat card requires Continue to proceed (no auto-advance) - Continue on Chat selection skips onboarding and goes to /chat - Tooltip (i) click on Chat card doesn't trigger navigation - Step 1 footer Back button goes back to splash (label is "Back") - Splash "Skip Onboarding" renamed to "Skip to Chat", navigates to /chat - Toast close button moved away from edge * studio: align Skip to Chat button, add Skip to footer - Sidebar "Skip to Chat" now uses primary (green) Button style with arrow icon, full width, aligned like step items. Shows on all steps. - Footer: added "Skip" outline button next to Continue that goes directly to /studio with progress saved (markOnboardingDone) * studio: change default max steps from 30 to 60 in toggle hook The DEFAULT_MAX_STEPS in use-max-steps-epochs-toggle.ts was still 30, used as fallback when toggling from epochs back to max steps. * studio: extend context length options to 262K CONTEXT_LENGTHS now includes 65536, 131072, 262144 in addition to the existing 512-32768 range. The onboarding step filters these by the model's max_position_embeddings (e.g. Nemotron-3-Nano-4B has 262144), showing powers of 2 up to the model's maximum. * studio: auto-select LoRA vs QLoRA based on model size and GPU memory After selecting a model in onboarding, detect the total model weight file size from HF Hub (safetensors/bin files). Then estimate memory needed: model_size_gb * 1.5 * context_scale, where context_scale is: - <=8192 tokens: 1.0x - >8192 tokens: 1.7x - >=16384 tokens: 2.0x - >=32768 tokens: 4.0x If the estimate fits in free GPU VRAM, default to LoRA (16-bit). Otherwise default to QLoRA (4-bit). Backend changes: - Add model_size_bytes to ModelDetails (models.py) - Add _get_model_size_bytes() using HfApi.repo_info (routes/models.py) - Add vram_free_gb to get_gpu_summary (hardware.py) Frontend changes: - Add autoSelectTrainingMethod() in training-config-store.ts - Called after model defaults are loaded - Add model_size_bytes to ModelConfigResponse type - Add vramFreeGb to HardwareInfo hook * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: rename "Importing ML libraries..." to "Importing Unsloth..." * studio: show model/dataset in training status, fix LoRA/QLoRA casing - Training status now shows 'Training "model_name"' and 'Dataset = ...' instead of generic "Starting training..." - Fix Studio progress section to show QLoRA/LoRA instead of QLORA/LORA * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: rename 'Skip to Chat' to 'Skip Onboarding' on splash screen * studio: add presence_penalty support for chat inference Add presence_penalty as a parameter across the full stack: - Backend: llama_cpp.py generate_chat_completion/with_tools, Pydantic models (inference.py), routes/inference.py pass-through - Frontend: InferenceParams type, DEFAULT_INFERENCE_PARAMS (0.0), chat-adapter.ts payload, chat-settings-sheet.tsx slider (0-2), model defaults loading from inference_defaults.json - Set Qwen3.5 default presence_penalty to 1.5 per official docs - Default for unknown models is 0.0 (off) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: fix Chat card deselecting Text and aligning with other cards * studio: fix presence_penalty not loading from inference defaults The inference_config.py load_inference_config() was not including presence_penalty in the returned config dict, so the Qwen3.5 default of 1.5 from inference_defaults.json never reached the frontend. Added it to the config builder. * studio: add delete button for cached models in model selector Add trash icon on each downloaded model row (GGUF and safetensors) with confirmation dialog. Backend DELETE /api/models/delete-cached endpoint uses huggingface_hub scan_cache_dir + delete_revisions to cleanly remove cached repos, refusing if the model is currently loaded. * studio: restore inference defaults, reasoning, and tools on page refresh On page refresh with a model already loaded, the frontend was not re-applying model-specific inference defaults (presence_penalty, temperature, etc.) or restoring reasoning/tools support flags. Backend: Add inference config, supports_reasoning, supports_tools, and context_length to InferenceStatusResponse. Frontend: In the refresh callback, when an active model is detected, apply mergeRecommendedInference and restore reasoning/tools flags with proper Qwen3.5 size-based defaults. * studio: fix delete dialog closing before async completes Prevent AlertDialogAction's default close behavior with e.preventDefault() so the dialog stays open during deletion. Also block onOpenChange dismiss while deleting is in progress. * fix: add Dict and Any imports to inference models * studio: fix Qwen3.5 reasoning threshold in frontend load path The frontend loadModel handler had the old threshold (<=2) for disabling reasoning on small Qwen3.5 models. Changed to <9 to match the backend. This was causing 4B to not properly disable thinking by default when auto-loaded. * studio: move GGUF delete to per-variant level For GGUF repos, the trash icon now appears on each downloaded variant row inside the quantization expander instead of on the repo-level row. Backend accepts optional variant param to delete specific GGUF files (blob + symlink) rather than the entire repo cache. * studio: restore ggufContextLength on page refresh The Max Tokens slider was capped at 32768 on page refresh because ggufContextLength was not restored from the status response. Now set it from statusRes.context_length on reconnect. * fix: remove <think> from Qwen3.5 response template marker The train-on-responses-only feature uses template markers to find where the assistant response starts. The Qwen3.5 response marker included '<think>\n' which is only present when thinking mode is enabled. With thinking disabled (default for <9B), the marker never matched, causing 100% of samples to be dropped. Changed response marker from '<|im_start|>assistant\n<think>\n' to '<|im_start|>assistant\n' which works regardless of thinking mode. * studio: fix sloth ASCII art alignment in training overlay * fix: correct sloth ASCII art alignment to match Unsloth banner * studio: add Python and terminal tool calling to chat Register python and terminal tools alongside web search. Python executor validates imports (stdlib only) via unsloth_zoo rl_environments, runs code in a subprocess sandbox with 5-min timeout and cancel support. Terminal executor blocks dangerous commands (rm, sudo, etc.) and runs in a temp directory. Update llama_cpp tool loop to show tool-specific status messages and pass cancel_event through to executors. Rename composer toggle from "Search" to "Tools" and show TerminalIcon for execution status pills. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: fix Nemotron/transformers 5.x support, onboarding navigation, port binding Backend: - Dynamic transformers 5.x detection via tokenizer_config.json fetch (checks for TokenizersBackend class, cached per-model) - Bump transformers 5.x version from 5.2.0 to 5.3.0 across all workers, setup scripts (setup.sh, setup.ps1) - Auto-enable trust_remote_code for unsloth/* models needing transformers 5.x (workaround for NemotronH config parsing bug in transformers) - Auto-install mamba-ssm/causal-conv1d for SSM models (NemotronH, Falcon-H1) with --no-build-isolation --no-deps to avoid torch version conflicts - Add SO_REUSEADDR to port check in run.py (fixes Colab proxy stale connection falsely reporting port as in-use) Frontend: - Fix "Skip to Chat" navigation: use window.location.href instead of React Router navigate() to bypass useEffect redirect race - Fix "Skip Onboarding" on splash: navigates to /studio (not /chat) - Fix onboarding guard: only check isOnboardingDone() on initial mount - Fix Chat card on step 1: add sr-only spacer for consistent alignment - Fix Chat+Text both selected: clear RadioGroup value when Chat is selected * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: split tools toggle into Search and Code buttons Replace the single "Tools" toggle with two independent toggles: - "Search" (globe icon) enables web search only - "Code" (terminal icon) enables Python and terminal execution Add enabled_tools list field to the inference payload so the backend only registers the tools the user has toggled on. Both toggles appear in the main composer and the compare composer. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: fix tool calling import validation and error logging Replace unsloth_zoo-dependent import checker with a standalone ast-based validator using sys.stdlib_module_names. This properly blocks non-stdlib imports (numpy, requests, etc.) and returns a clear error message to the model so it can rewrite using only stdlib. Add full traceback to tool streaming error logs for debugging. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: parse gpt-oss harmony channels for clean safetensors chat output gpt-oss models emit multi-channel output via harmony protocol tokens (<|channel|>analysis<|message|>... and <|channel|>final<|message|>...). TextIteratorStreamer with skip_special_tokens=True strips the special tokens but leaves channel names concatenated with content, producing garbled output like "analysisWe need to...assistantfinalHello!". Add HarmonyTextStreamer that decodes with skip_special_tokens=False, parses harmony markup via regex, and emits <think>analysis</think> for the analysis channel and plain text for the final channel -- reusing the existing frontend reasoning UI. Also expose supports_reasoning=True for non-GGUF gpt-oss models in the /status endpoint so the frontend enables the Think toggle. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: use unsloth_zoo for Python sandbox validation Set UNSLOTH_IS_PRESENT=1 and import check_python_modules and check_signal_escape_patterns directly from unsloth_zoo instead of a standalone fallback. This gives us the full Unsloth validation including stdlib-only import checks and signal/timeout escape pattern detection. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: allow all imports in Python tool sandbox Remove stdlib-only import restriction. Keep signal escape pattern detection via unsloth_zoo for safety. * studio: fix ReadTimeout on tool streaming final pass The 0.5s read timeout used for cancel-checking during streaming also fires when waiting for the first response from llama-server (e.g. reasoning model thinking for 15+ seconds). Add _stream_with_retry() context manager that retries on ReadTimeout while checking cancel_event, so the model has unlimited time to think before producing the first token. Applied to both the regular streaming path and the tool-calling final pass. * fix: rewrite HarmonyTextStreamer with stateful incremental parsing The delta-on-transformed approach had two critical bugs: 1. Before the full <|channel|>X<|message|> pattern was complete, the strip-tokens fallback emitted "analysis" as plain text. Then when the regex matched, _transform returned a completely different format (<think>...</think>) and the delta was computed against the wrong base string, producing fragments like "think>", "nk>", ">". 2. Even with full matches, the closing </think> tag shifted position as content grew, so text[prev_len:] produced garbled deltas. Replace with stateful incremental parsing that: - Buffers until a complete channel+message pair is seen - Emits <think> once when analysis channel first appears - Streams analysis content deltas (computed on channel content directly) - Emits </think> once when final channel first appears - Streams final content deltas - Closes open think tags in end() Also skip the generic all_special_tokens stripping in _clean_generated_text for gpt-oss since HarmonyTextStreamer already produces clean output and the generic stripping was mangling <think> tags. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: strip all <|...|> tokens in gpt-oss cleanup, not just harmony subset The gpt-oss tokenizer has added tokens like <|return|> (id=200002) that are not part of the harmony channel protocol but can leak into output. The previous regex only stripped channel|message|start|end tokens. Broaden the _clean_generated_text regex for gpt-oss to <\|[a-z_]+\|> which catches all pipe-delimited tokens (return, constrain, reserved, etc.) without matching <think>/<\/think> tags. Verified: gpt-oss all_special_tokens are only <|return|>, <|reserved_200017|>, <|startoftext|> -- none overlap with <think>. The harmony tokens (channel, message, start, end) are added_tokens but not in all_special_tokens. * fix: hide config-only model repos from cached models list Repos that only have metadata/config files cached (no .safetensors or .bin weight files) were showing up in the Downloaded list with tiny sizes like "1.8 KB" or "24 KB". These are just leftover config snapshots from architecture checks, not usable models. Filter the cached-models endpoint to only include repos that contain actual model weight files (.safetensors or .bin). * studio: fix toast description text contrast in dark mode Add explicit !text-muted-foreground to toast description classNames so secondary text (e.g. "Releases VRAM and resets inference state.") is readable in dark mode. * studio: fix Chat card icon alignment with size-4 spacer Replace sr-only span (takes no space) with a size-4 shrink-0 div matching the RadioGroupItem dimensions in other cards, so the Chat icon aligns vertically with Text/Audio/Vision/Embeddings icons. --------- Co-authored-by: workspace <user@workspace.local> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Manan17 <shahmanan170602@gmail.com> Co-authored-by: Roland Tannous <rolandtannous@gravityq.ai>
Summary
"hf_...")Test plan