Fix Studio q2_k_l GGUF export and new llama.cpp converter package layout#667
Conversation
Two distinct Studio export bugs surfaced during macOS / Apple Silicon
training support testing. Both land in unsloth_zoo.llama_cpp so every
caller (CUDA save path, MLX / Studio export, notebooks) shares one path.
Bug 1: q2_k_l preset crashes the MLX export path
q2_k_l is an Unsloth preset that maps to
llama-quantize --output-tensor-type q8_0 --token-embedding-type q8_0
IN OUT q2_k NTHREADS
not a native llama.cpp ftype. The CUDA path expanded it via
_quantize_q2_k_l in unsloth.save; the MLX path in
unsloth_zoo.mlx.utils.save_pretrained_gguf forwarded the literal
string and llama-quantize aborted with
main: invalid ftype 'q2_k_l'
on Apple Silicon.
Fix: quantize_gguf now recognises the preset and expands it before
building the command. Preset-only: triggers only on
quant_type.strip().lower() == "q2_k_l". Every other ftype, including
the real llama.cpp q3_k_l, takes the existing code path byte-for-byte
unchanged. shell=True + _quote + IS_WINDOWS path-translation
preserved. The display log + error messages keep the preset name
(q2_k_l), not the internal rewrite (q2_k).
Bug 2: patcher drift on the new conversion/ package layout
Upstream llama.cpp split convert_hf_to_gguf.py into a thin entrypoint
plus a conversion/ package (conversion/base.py, conversion/qwen.py).
Three regex patches in _download_convert_hf_to_gguf still targeted
the old monolithic file and silently no-op'd:
- "No supported architectures (TEXT or VISION) could be determined"
ModelBase._model_classes is empty because per-arch modules
register lazily via load_all_models().
- "Metadata branding patch target not found"
self.metadata = gguf.Metadata.load(...) moved to
conversion/base.py, so produced GGUFs no longer carried
quantized_by="Unsloth", repo_url, and tags.
- "Qwen2MoE num_experts patch target not found"
conversion/qwen.py already calls
self.find_hparam(["num_local_experts", "num_experts"]) so the
legacy patch is obsolete.
Fix: _detect_converter_layout distinguishes monolith vs package
structurally (entrypoint contains `from conversion import` and
conversion/__init__.py + conversion/base.py exist on disk). On the
new layout:
- Arch allowlist is AST-extracted from
conversion/__init__.py:TEXT_MODEL_MAP / MMPROJ_MODEL_MAP, the
static dicts that already enumerate every supported arch. Cheaper
and safer than calling load_all_models() which imports every
per-arch module.
- Branding patch is applied in place to conversion/base.py with a
one-line idempotency marker (# UNSLOTH_BRANDING_APPLIED).
- Qwen patch is skipped with an info log when both expert-key
literals are already present in conversion/qwen.py.
Old monolithic layout: every existing patch runs unchanged.
Cache-key extension: _download_convert_hf_to_gguf folds
conversion/{__init__,base,qwen}.py mtimes + sizes into the
@lru_cache(1) key so a re-pulled llama.cpp checkout re-runs the
patcher.
Public return shape (patched_filename, text_archs, vision_archs)
preserved; callers in unsloth.save and unsloth_zoo.mlx.utils stay
source-compatible.
Cross-OS safety:
- q2_k_l branch fires only on that exact preset; q3_k_l (real
llama.cpp ftype) and every other ftype hit the original code path
unchanged.
- Windows shell=True + _quote + build/bin/Release path translation
untouched.
- On old monolithic llama.cpp checkouts the layout flag stays
"monolith" and behaviour is byte-for-byte identical to before.
Tests
tests/test_quantize_gguf_q2_k_l.py (5 tests):
- q2_k_l expands to q2_k + both q8_0 tensor-type flags
- case-insensitive (Q2_K_L / q2_K_L / padded whitespace)
- 17 other ftypes (q2_k, q3_k_s/m/l, q4_*, q5_*, q6_k, q8_0, bf16,
f16, f32) untouched -- the Linux + Windows non-regression gate
- print_output logs both the request and the expansion
- error messages keep the preset name, not the rewrite
tests/test_convert_hf_to_gguf_patcher.py (18 tests):
- synthetic package + monolith fixtures: detection, AST extraction,
branding-patch idempotency, lines around the target preserved,
Qwen alias detection, cache-key invalidation
- live upstream fetch of current llama.cpp master from
raw.githubusercontent.com: layout=package, branding patch
applies, Qwen aliases already handled, TEXT_MODEL_MAP contains
LlamaForCausalLM + Qwen*. Skipped cleanly on offline / rate-
limited runners.
CI
.github/workflows/studio-export-fix-ci.yml runs the unit tests on
ubuntu-latest, macos-14, and windows-latest. paths: filter so
unrelated PRs do not re-fire it. max-parallel: 3 so Windows stays
under the 5-concurrent-runner cap when this job runs alongside
others. Pure-Python; each job finishes in well under 2 minutes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b6fc286a1c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # load_all_models() runs, so AST-parse the static TEXT_MODEL_MAP / | ||
| # MMPROJ_MODEL_MAP in conversion/__init__.py instead. Monolith | ||
| # layout keeps the original _model_classes introspection. | ||
| _layout = _detect_converter_layout(original_content, LLAMA_CPP_DEFAULT_DIR) |
There was a problem hiding this comment.
Detect the package layout beside the selected converter
When UNSLOTH_LLAMA_CPP_SCRIPTS_DIR points at a checkout with the new conversion/ package (and the Studio flow similarly installs/checks ./llama.cpp before calling this helper), this still inspects LLAMA_CPP_DEFAULT_DIR instead of the directory that contains the converter being patched. If ~/.unsloth/llama.cpp does not contain the matching conversion/__init__.py and conversion/base.py, the package entrypoint is treated as monolith (or its imports fail during the earlier temp-module load), so the new arch enumeration and branding patches do not apply and the generated converter can import a package that is not present. Resolve the package directory from the selected local converter/install rather than always using the default path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch. Fixed in b3b2766. Added a single helper _get_llama_cpp_dir(local_script_info) that returns os.path.dirname(local_script_info[0]) when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is set, LLAMA_CPP_DEFAULT_DIR otherwise. Threaded the result through layout detection, AST extraction of TEXT_MODEL_MAP / MMPROJ_MODEL_MAP, the branding patch on conversion/base.py, the Qwen alias check on conversion/qwen.py, and the sibling-info cache key.
Added three tests under tests/test_convert_hf_to_gguf_patcher.py:
test_llama_cpp_dir_defaults_when_no_local_script— default fallback returnsLLAMA_CPP_DEFAULT_DIR.test_llama_cpp_dir_resolves_to_source_dir_when_local_script_set— synthetic 3-tuple override returns the source dir.test_patcher_anchors_on_custom_dir_when_override_set— full custom-tree round-trip: layout, sibling info, AST extraction all anchor on the override dir.
No behaviour change when the env var is unset (the common path).
There was a problem hiding this comment.
Code Review
This pull request updates the llama.cpp conversion script patcher to accommodate the transition from a monolithic script to a package-based layout. It introduces AST-based parsing for architecture extraction and implements layout-aware branding and Qwen-specific patches. Additionally, it adds support for the Q2_K_L quantization preset. The reviewer feedback identifies an opportunity to refactor duplicated directory resolution logic into a shared utility and points out a redundant newline in the branding patch implementation.
| # Resolve env vars + sibling mtimes on every call; both are folded into | ||
| # the @lru_cache key so re-pulled llama.cpp checkouts re-run the patcher. | ||
| return _download_convert_hf_to_gguf_cached( | ||
| name, | ||
| _resolve_local_convert_script(), | ||
| _conversion_sibling_info(LLAMA_CPP_DEFAULT_DIR), | ||
| ) |
There was a problem hiding this comment.
The logic for determining the llama.cpp directory is repeated multiple times. Refactor this into a shared helper function like _get_llama_cpp_dir to improve maintainability, as per the repository's practice for system-specific paths.
| # Resolve env vars + sibling mtimes on every call; both are folded into | |
| # the @lru_cache key so re-pulled llama.cpp checkouts re-run the patcher. | |
| return _download_convert_hf_to_gguf_cached( | |
| name, | |
| _resolve_local_convert_script(), | |
| _conversion_sibling_info(LLAMA_CPP_DEFAULT_DIR), | |
| ) | |
| _local_script_info = _resolve_local_convert_script() | |
| _llama_cpp_dir = _get_llama_cpp_dir(_local_script_info) | |
| return _download_convert_hf_to_gguf_cached( | |
| name, | |
| _local_script_info, | |
| _conversion_sibling_info(_llama_cpp_dir), | |
| ) |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
There was a problem hiding this comment.
Done in b3b2766. added; threaded through the wrapper, layout detection, AST extraction, branding patch, Qwen check, and sibling-info cache key. The other five suggestions in this review covering the same refactor are addressed by the same commit.
There was a problem hiding this comment.
Done in b3b2766. _get_llama_cpp_dir(local_script_info) added; threaded through the wrapper, layout detection, AST extraction, branding patch, Qwen check, and sibling-info cache key. The other five suggestions in this review covering the same refactor are addressed by the same commit.
| temp_original_file_path = None # Initialize for finally block | ||
| # Set by introspection; read by Patch 2 + Patch 3 below. Default to | ||
| # 'monolith' so a failed introspection still drives the legacy patches. | ||
| _layout = "monolith" |
There was a problem hiding this comment.
Use the shared helper function _get_llama_cpp_dir to derive the base llama.cpp directory from the local script path.
| _layout = "monolith" | |
| _layout = "monolith" | |
| _llama_cpp_dir = _get_llama_cpp_dir(_local_script_info) |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
| # load_all_models() runs, so AST-parse the static TEXT_MODEL_MAP / | ||
| # MMPROJ_MODEL_MAP in conversion/__init__.py instead. Monolith | ||
| # layout keeps the original _model_classes introspection. | ||
| _layout = _detect_converter_layout(original_content, LLAMA_CPP_DEFAULT_DIR) |
There was a problem hiding this comment.
Layout detection should use the directory derived from the shared helper function to ensure consistency.
| _layout = _detect_converter_layout(original_content, LLAMA_CPP_DEFAULT_DIR) | |
| _layout = _detect_converter_layout(original_content, _llama_cpp_dir) |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
| logger.info(f"Unsloth: convert_hf_to_gguf layout detected: {_layout}") | ||
|
|
||
| if _layout == "package": | ||
| conv_init_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "__init__.py") |
There was a problem hiding this comment.
Use the derived _llama_cpp_dir to locate the conversion package.
| conv_init_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "__init__.py") | |
| conv_init_py = os.path.join(_llama_cpp_dir, "conversion", "__init__.py") |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
| if re.search(rb"self\.metadata \= gguf\.Metadata\.load\(", patched_content): logger.warning("Unsloth: Metadata branding patch target found, but regex failed to apply.") | ||
| else: logger.warning("Unsloth: Metadata branding patch target 'self.metadata = gguf.Metadata.load(...)' not found.") | ||
| if _layout == "package": | ||
| conv_base_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "base.py") |
There was a problem hiding this comment.
Use the derived _llama_cpp_dir to locate base.py for patching.
| conv_base_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "base.py") | |
| conv_base_py = os.path.join(_llama_cpp_dir, "conversion", "base.py") |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
| ) | ||
| _qwen_handled = False | ||
| if _layout == "package": | ||
| conv_qwen_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "qwen.py") |
There was a problem hiding this comment.
Use the derived _llama_cpp_dir to locate qwen.py for alias checking.
| conv_qwen_py = os.path.join(LLAMA_CPP_DEFAULT_DIR, "conversion", "qwen.py") | |
| conv_qwen_py = os.path.join(_llama_cpp_dir, "conversion", "qwen.py") |
References
- Refactor duplicated logic for finding system-specific paths into a shared helper function to improve maintainability.
| + indent + b"if hasattr(self.metadata, 'quantized_by'): self.metadata.quantized_by = 'Unsloth'\n" | ||
| + indent + b"if hasattr(self.metadata, 'repo_url'): self.metadata.repo_url = 'https://huggingface.co/unsloth'\n" | ||
| + indent + b"if hasattr(self.metadata, 'tags'): self.metadata.tags = ['unsloth', 'llama.cpp']\n" |
There was a problem hiding this comment.
The current implementation adds an extra newline before the original next line because suffix already starts with a newline and an additional \n is appended to the last branding line. Removing the trailing \n from the last branding line will result in cleaner file content.
| + indent + b"if hasattr(self.metadata, 'quantized_by'): self.metadata.quantized_by = 'Unsloth'\n" | |
| + indent + b"if hasattr(self.metadata, 'repo_url'): self.metadata.repo_url = 'https://huggingface.co/unsloth'\n" | |
| + indent + b"if hasattr(self.metadata, 'tags'): self.metadata.tags = ['unsloth', 'llama.cpp']\n" | |
| + indent + b"if hasattr(self.metadata, 'quantized_by'): self.metadata.quantized_by = 'Unsloth'\n" | |
| + indent + b"if hasattr(self.metadata, 'repo_url'): self.metadata.repo_url = 'https://huggingface.co/unsloth'\n" | |
| + indent + b"if hasattr(self.metadata, 'tags'): self.metadata.tags = ['unsloth', 'llama.cpp']" |
There was a problem hiding this comment.
Addressed in 3a9a23c. Dropped the trailing \n on the last branding line so the captured suffix (which already starts with newline + indent) doesn't produce an extra blank line. Verified with the existing test_branding_patch_preserves_lines_around_target test and the live-upstream round-trip.
PR #667 review (Codex + Gemini bots): when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR points at a checkout with the new conversion/ package, the layout detection, AST extraction, branding patch, Qwen check, and sibling-info cache key should all resolve against THAT dir, not the hard-coded default. The previous code always inspected LLAMA_CPP_DEFAULT_DIR, which silently mis-detects the layout when the override dir differs from the default install. Introduce a single helper _get_llama_cpp_dir(local_script_info) that returns os.path.dirname(local_script_info[0]) when the env var is set, LLAMA_CPP_DEFAULT_DIR otherwise. Thread the result through: - _download_convert_hf_to_gguf wrapper (sibling-info cache key) - _detect_converter_layout call site - conversion/__init__.py AST extraction - conversion/base.py branding patch target - conversion/qwen.py expert-alias check Adds three tests: - default fallback returns LLAMA_CPP_DEFAULT_DIR - synthetic 3-tuple override returns the source dir - full custom-tree round-trip: layout, sibling info, AST extraction all anchor on the override dir No behaviour change when UNSLOTH_LLAMA_CPP_SCRIPTS_DIR is unset (the common path). When set, the patcher now correctly targets the user's checkout instead of falling through to monolith and no-op'ing.
Three small follow-ups from PR #667 review: - Branding insertion: drop the trailing newline from the last branding line. The match group `suffix` already starts with a newline + indent (it captured the boundary into the line that follows Metadata.load), so the trailing \n on the last branding line produced a blank line between the branding block and the next statement. Aesthetic only; Python parses both forms identically. - Add a one-line explanatory comment inside `_detect_converter_layout`'s bare `except Exception:`. Behaviour unchanged. - Remove unused `logging` and `textwrap` imports from tests/test_convert_hf_to_gguf_patcher.py.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3a9a23c7e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # load_all_models() runs, so AST-parse the static TEXT_MODEL_MAP / | ||
| # MMPROJ_MODEL_MAP in conversion/__init__.py instead. Monolith | ||
| # layout keeps the original _model_classes introspection. | ||
| _layout = _detect_converter_layout(original_content, _llama_cpp_dir) |
There was a problem hiding this comment.
Move package detection before importing local converters
When UNSLOTH_LLAMA_CPP_SCRIPTS_DIR points to a new package-layout checkout outside LLAMA_CPP_DEFAULT_DIR, this layout detection runs only after _load_module_from_path() has already imported a temp copy from LLAMA_CPP_DEFAULT_DIR (lines 1135-1150). If that default directory does not also contain the matching conversion/ package, the entrypoint's from conversion import ... raises ModuleNotFoundError, so the AST-based arch parsing and branding patch are never reached. Fresh evidence in the current revision is that _llama_cpp_dir is now resolved from the selected converter, but the import still happens before this newly added package branch can use it.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 0a3256e. Hoisted _detect_converter_layout above the temp-write + _load_module_from_path step. On the package layout we now skip the module-load entirely and go straight to AST-parsing TEXT_MODEL_MAP / MMPROJ_MODEL_MAP from the resolved converter directory. The original monolith path is preserved verbatim, just nested in the else branch.
original_module_name is initialized to None at the top of the function so the cleanup if original_module_name is not None and original_module_name in sys.modules stays safe on the package branch.
Added test_package_layout_does_not_require_module_import (tests/test_convert_hf_to_gguf_patcher.py): it monkeypatches _load_module_from_path with a sentinel that raises AssertionError if called, builds a synthetic package-layout tree under UNSLOTH_LLAMA_CPP_SCRIPTS_DIR, drives the patcher end-to-end, and asserts the sentinel never fired, base.py was branded in place under the override dir, and LlamaForCausalLM made it into text_archs. 27/27 local tests still pass.
When UNSLOTH_LLAMA_CPP_SCRIPTS_DIR points at a checkout outside LLAMA_CPP_DEFAULT_DIR, the previous order would temp-write the entrypoint inside the default dir and try to import it; on the new package layout, `from conversion import ...` then resolves against the default dir's missing conversion/ package and raises ModuleNotFoundError before AST arch extraction or branding can run. Hoist _detect_converter_layout above the temp-write/module-load step. On "package" layout, skip the module load entirely and go straight to AST-parsing TEXT_MODEL_MAP / MMPROJ_MODEL_MAP from the selected dir. Monolith path keeps the original behaviour. Added test_package_layout_does_not_require_module_import: a sentinel replaces _load_module_from_path and asserts it is NOT called on the package-layout path.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a3256e80e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _layout == "package": | ||
| conv_base_py = os.path.join(_llama_cpp_dir, "conversion", "base.py") | ||
| _branding_status = _apply_branding_patch_to_base(conv_base_py) |
There was a problem hiding this comment.
Save package-layout converters beside their package
When UNSLOTH_LLAMA_CPP_SCRIPTS_DIR points to a package-layout checkout outside LLAMA_CPP_DEFAULT_DIR, this branch patches conversion/base.py in the override tree, but the generated entrypoint is still written under LLAMA_CPP_DEFAULT_DIR later. Running that generated script resolves from conversion import ... relative to its saved location, so installs without a matching LLAMA_CPP_DEFAULT_DIR/conversion fail with ModuleNotFoundError (or import an unrelated package). Fresh evidence in the current revision is that the package branch still targets _llama_cpp_dir here while the patched entrypoint is saved elsewhere.
Useful? React with 👍 / 👎.
| original_module_name = None # Only set on the monolith branch | ||
| # Set by introspection; read by Patch 2 + Patch 3 below. Default to | ||
| # 'monolith' so a failed introspection still drives the legacy patches. | ||
| _layout = "monolith" |
| if local_script_info is not None: | ||
| return os.path.dirname(local_script_info[0]) | ||
| return LLAMA_CPP_DEFAULT_DIR | ||
| pass |
| _stat(base_py), | ||
| _stat(qwen_py) if os.path.isfile(qwen_py) else None, | ||
| ) | ||
| pass |
| # to monolith so the legacy regex patches still run. | ||
| pass | ||
| return "monolith" | ||
| pass |
| if isinstance(node.target, ast.Name) and node.target.id == dict_name: | ||
| _harvest(node.value) | ||
| return keys | ||
| pass |
| except OSError: | ||
| return "pattern-missing" | ||
| return "applied" | ||
| pass |
| except OSError: | ||
| return False | ||
| return (b"num_local_experts" in content) and (b"num_experts" in content) | ||
| pass |
…#677) * Fix Q2_K_L recipe: bumps were too aggressive, output was garbage The recipe shipped in #667 expanded q2_k_l to: llama-quantize --output-tensor-type q8_0 --token-embedding-type q8_0 IN OUT q2_k N which produced garbage output. The correct Unsloth Q2_K_L recipe is: llama-quantize \ --tensor-type ffn_down=Q3_K \ --output-tensor-type Q6_K \ --token-embedding-type Q4_K \ IN OUT q2_k N ffn_down=Q3_K relies on llama-quantize's substring matching to cover both ffn_down.weight (dense FFN) and ffn_down_exps.weight (MoE experts) across every layer id with one flag. Tests updated to match the new recipe; added a coverage test documenting that the bare "ffn_down" pattern intentionally hits both dense + MoE down-projection tensors so the override is not split into per-layer enumerations later. * Q2_K_L: chain anchored regex for ffn_down_exps + ffn_down llama-quantize matches --tensor-type with std::regex_search (not substring) and iterates first-match-wins, so we switch from one bare "ffn_down" pattern to two anchored regex patterns: --tensor-type "\.ffn_down_exps=Q3_K" --tensor-type "\.ffn_down=Q3_K" The leading \. anchors on the GGUF path-separator dot, ruling out accidental hits on hypothetical tensor names that merely contain "ffn_down" as a substring (e.g. attn_ffn_downsample). The more-specific MoE pattern is listed first because the loop in src/llama-quant.cpp breaks on first match -- otherwise the dense regex would absorb every MoE expert by prefix and the explicit override would be dead code. Tests updated: assert both patterns are present, assert MoE pattern precedes dense pattern, drop the substring-matching test in favour of an order-invariant test that captures the new contract. --------- Co-authored-by: Daniel Han <info@unsloth.ai>
|
Re-reviewed end-to-end after #677. Splitting the verdict because the two pieces of this PR landed with very different correctness: 1. Converter-package-layout patcher + tests + CI workflow — clean and unchanged on main.
2. q2_k_l recipe — shipped wrong, fixed by #677. On Expansion gave q2_k base with output=q8_0, embed=q8_0 and no ffn_down override — bloats file size without addressing the dominant inference-quality lever (ffn_down at Q2_K is the main degradation source on a q2_k base). On Output q8_0 → Q6_K, embed q8_0 → Q4_K, plus two Net: the converter-layout half was a clean, regression-free fix and is still useful on main. The q2_k_l half shipped broken and required a same-week followup (#677). The takeaway for next time: split a quant-recipe change from a converter-package-layout change so a recipe regression can be reverted independently, and gate the recipe behind a real quant-quality probe rather than a recipe-shape unit test. Cross-OS validation for similar Studio/MLX backend tests is pinned in danielhanchen/unsloth-staging-2#137 for the parent repo's cohort. |
Summary
Two distinct Studio export bugs surfaced during macOS / Apple Silicon training support testing. Both land in
unsloth_zoo.llama_cppso every caller (CUDA save path, MLX / Studio export, notebooks) shares one path.Bug 1: q2_k_l preset crashes the MLX export path
q2_k_lis an Unsloth preset, not a native llama.cpp ftype. It maps to:The CUDA path expanded it via
_quantize_q2_k_linunsloth.save. The MLX path inunsloth_zoo.mlx.utils.save_pretrained_ggufforwarded the literal string andllama-quantizeaborted on Apple Silicon with:Fix:
quantize_ggufnow recognises the preset and expands it before building the command. Preset-only: triggers only onquant_type.strip().lower() == "q2_k_l". Every other ftype, including the real llama.cppq3_k_l, takes the existing code path byte-for-byte unchanged.shell=True+_quote+IS_WINDOWSpath translation preserved. Display log and error messages keep the preset name (q2_k_l), not the internal rewrite (q2_k).Bug 2: patcher drift on the new conversion/ package layout
Upstream
llama.cppsplitconvert_hf_to_gguf.pyinto a thin entrypoint plus aconversion/package (conversion/base.py,conversion/qwen.py). Three regex patches in_download_convert_hf_to_ggufstill targeted the old monolithic file and silently no-op'd:Unsloth: No supported architectures (TEXT or VISION) could be determined from the original script.ModelBase._model_classesis empty because per-arch modules register lazily viaload_all_models().Unsloth: Metadata branding patch target 'self.metadata = gguf.Metadata.load(...)' not found.That line moved to
conversion/base.py, so produced GGUFs lostquantized_by="Unsloth",repo_url, andtags.Unsloth: Qwen2MoE num_experts patch target not found.conversion/qwen.pyalready callsself.find_hparam(["num_local_experts", "num_experts"])so the legacy patch is obsolete.Fix:
_detect_converter_layoutdistinguishes monolith vs package structurally (entrypoint containsfrom conversion importand bothconversion/__init__.py+conversion/base.pyexist on disk). On the new layout:conversion/__init__.py:TEXT_MODEL_MAP/MMPROJ_MODEL_MAP, the static dicts that already enumerate every supported arch. Cheaper and safer than callingload_all_models()which imports every per-arch module.conversion/base.pywith a one-line idempotency marker# UNSLOTH_BRANDING_APPLIED.conversion/qwen.py.Old monolithic layout: every existing patch runs unchanged.
Cache-key extension:
_download_convert_hf_to_gguffoldsconversion/{__init__,base,qwen}.pymtimes and sizes into the@lru_cache(1)key, so re-pulledllama.cppcheckouts re-run the patcher.Public return shape
(patched_filename, text_archs, vision_archs)preserved; callers inunsloth.saveandunsloth_zoo.mlx.utilsstay source-compatible.Cross-OS safety
q2_k_lbranch fires only on that exact preset.q3_k_l(a real llama.cpp ftype) and every other ftype hit the original code path unchanged.shell=True+_quote+build/bin/Releasepath translation untouched.llama.cppcheckouts the layout flag stays"monolith"and behaviour is byte-for-byte identical to before.Test plan
tests/test_quantize_gguf_q2_k_l.py(5 tests):q3_k_l(Linux + Windows non-regression gate)tests/test_convert_hf_to_gguf_patcher.py(18 tests):llama.cppmaster fromraw.githubusercontent.com: layout=package, branding patch applies, Qwen aliases already handled, TEXT_MODEL_MAP containsLlamaForCausalLMandQwen*. Skipped cleanly on offline / rate-limited runners.CI:
.github/workflows/studio-export-fix-ci.ymlruns the unit tests onubuntu-latest,macos-14, andwindows-latest.paths:filter so unrelated PRs do not re-fire it.max-parallel: 3so Windows stays under the 5-concurrent-runner cap when this job runs alongside others. Pure-Python; each job finishes in well under 2 minutes.Validated end-to-end on all three OSes via staging-fork PRs (unsloth-zoo-staging-1#18 and unsloth-staging-2#127) before opening this one.
Checklist
unsloth.saveleft untouched (byte-for-byte identical behaviour for non-q2_k_l ftypes)