Skip to content

Conversation

@compilade
Copy link
Collaborator

@compilade compilade commented Oct 24, 2025

Follow-up to #14810.

I accidentally broke MXFP4 GPT-OSS conversion in #14810 :

  File "/.../llama.cpp/convert_hf_to_gguf.py", line 656, in __init__
    super().__init__(*args, **kwargs)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/.../llama.cpp/convert_hf_to_gguf.py", line 155, in __init__
    self.dequant_model()
    ~~~~~~~~~~~~~~~~~~^^
  File "/.../llama.cpp/convert_hf_to_gguf.py", line 375, in dequant_model
    raise NotImplementedError(f"Quant method is not yet supported: {quant_method!r}")
NotImplementedError: Quant method is not yet supported: 'mxfp4'

This PR makes the convert script avoid running dequant_model() for GPT-OSS when the quant_method of the quantization_config is mxfp4.

Tested with a --dry-run conversion of https://huggingface.co/openai/gpt-oss-20b, which no longer fails.


Make sure to read the contributing guidelines before submitting a PR

@compilade compilade requested a review from CISC as a code owner October 24, 2025 11:59
@compilade compilade added bugfix fixes an issue or bug python python script changes labels Oct 24, 2025
@compilade compilade merged commit 5cca254 into master Oct 25, 2025
10 checks passed
wqerrewetw added a commit to wqerrewetw/llama.cpp that referenced this pull request Oct 25, 2025
* model-conversion : add trust_remote_code for orig model run [no ci] (ggml-org#16751)

This commit add the trust_remote_code=True argument when loading models
using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run
original model script.

The motivation for this is that some models require custom code to be
loaded properly, and setting trust_remote_code=True avoids a prompt
asking for user confirmation:
```console
(venv) $ make causal-run-original-model
The repository /path/to/model contains custom code which must be
executed to correctly load the model. You can inspect the repository
content at /path/to/model.

Do you wish to run the custom code? [y/N] N
```

Having this as the default seems like a safe choice as we have to clone
or download the models we convert and would be expecting to run any
custom code they have.

* webui: support q URL parameter (ggml-org#16728)

* webui: support q URL parameter

Fixes ggml-org#16722
I’ve checked that it works with Firefox’s AI tools

* webui: apply suggestions from code review

Co-authored-by: Aleksander Grygier <[email protected]>

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <[email protected]>

* CUDA: use CUB for arbitary size argsort (ggml-org#16754)

* ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (ggml-org#16742)

* Fix CUDA grid launch condition for large block_nums.y

* add backend ops test

* reduce test  repetitions

* convert : avoid dequantizing mxfp4 for GPT-OSS (ggml-org#16756)

* vulkan: Optimize SSM_SCAN (ggml-org#16645)

* vulkan: delete dead code (ggml-org#16732)

ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.

Signed-off-by: Giuseppe Scrivano <[email protected]>

* model : set res->t_embd in PLaMo2 models (ggml-org#16766)

---------

Signed-off-by: Giuseppe Scrivano <[email protected]>
Co-authored-by: Daniel Bevenius <[email protected]>
Co-authored-by: Florian Badie <[email protected]>
Co-authored-by: Aleksander Grygier <[email protected]>
Co-authored-by: Aman Gupta <[email protected]>
Co-authored-by: leejet <[email protected]>
Co-authored-by: compilade <[email protected]>
Co-authored-by: Jeff Bolz <[email protected]>
Co-authored-by: Giuseppe Scrivano <[email protected]>
Co-authored-by: Shunta Saito <[email protected]>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix fixes an issue or bug python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants