Skip to content

Commit

Permalink
update gguf docs (pytorch#794)
Browse files Browse the repository at this point in the history
  • Loading branch information
metascroy authored and malfet committed Jul 17, 2024
1 parent bcdbe5e commit 594414f
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 41 deletions.
44 changes: 4 additions & 40 deletions docs/ADVANCED-USERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,22 +132,10 @@ GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently,
the F16, F32, Q4_0, and Q6_K formats are supported and converted into
native torchchat models.

You may also dequantize GGUF models with the GGUF quantize tool, and
then load and requantize with torchchat native quantization options.

| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
|-----|--------|-------|-----|-----|-----|-----|
| llama-2-7b.Q4_0.gguf | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 |

You may also dequantize GGUF models with the GGUF quantize tool, and
then load and requantize with torchchat native quantization options.

**Please note that quantizing and dequantizing is a lossy process, and
you will get the best results by starting with the original
unquantized model checkpoint, not a previously quantized and then
dequantized model.**


## Conventions used in this document

We use several variables in this example, which may be set as a
Expand Down Expand Up @@ -232,7 +220,7 @@ submission guidelines.)

Torchchat supports several devices. You may also let torchchat use
heuristics to select the best device from available devices using
torchchat's virtual device named `fast`.
torchchat's virtual device named `fast`.

Torchchat supports execution using several floating-point datatypes.
Please note that the selection of execution floating point type may
Expand Down Expand Up @@ -398,9 +386,9 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |

## Model precision (dtype precision setting)
On top of quantizing models with quantization schemes mentioned above, models can be converted
to lower precision floating point representations to reduce the memory bandwidth requirement and
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
On top of quantizing models with quantization schemes mentioned above, models can be converted
to lower precision floating point representations to reduce the memory bandwidth requirement and
take advantage of higher density compute available. For example, many GPUs and some of the CPUs
have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.

[skip default]: begin
Expand Down Expand Up @@ -439,30 +427,6 @@ may dequantize them using GGUF tools, and then laod the model into
torchchat to quantize with torchchat's quantization workflow.)


## Loading unsupported GGUF formats in torchchat

GGUF formats not presently supported natively in torchchat may be
converted to one of the supported formats with GGUF's
`${GGUF}/quantize` utility to be loaded in torchchat. If you convert
to the FP16 or FP32 formats with GGUF's `quantize` utility, you may
then requantize these models with torchchat's quantization workflow.

**Note that quantizing and dequantizing is a lossy process, and you will
get the best results by starting with the original unquantized model
checkpoint, not a previously quantized and then dequantized
model.** Thus, while you can convert your q4_1 model to FP16 or FP32
GGUF formats and then requantize, you might get better results if you
start with the original FP16 or FP32 GGUF format.

To use the quantize tool, install the GGML tools at ${GGUF} . Then,
you can, for example, convert a quantized model to f16 format:

[end default]: end
```
${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_model.gguf f16
```


## Optimizing your model for server, desktop and mobile devices

While we have shown the export and execution of a small model on CPU
Expand Down
25 changes: 24 additions & 1 deletion docs/GGUF.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,6 @@ python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --dso-path ${GGUF_S
```


### ExecuTorch export + generate
Before running this example, you must first [Set-up ExecuTorch](executorch_setup.md).
```
Expand All @@ -67,4 +66,28 @@ python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${G
python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
```

### Advanced: loading unsupported GGUF formats in torchchat
GGUF formats not presently supported natively in torchchat can be
converted to one of the supported formats with GGUF's
[quantize](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize) utility.
If you convert to the FP16 or FP32 formats with GGUF's quantize utility, you can
then requantize these models with torchchat's native quantization workflow.

**Please note that quantizing and dequantizing is a lossy process, and
you will get the best results by starting with the original
unquantized model, not a previously quantized and then
dequantized model.**

As an example, suppose you have [llama.cpp cloned and installed](https://github.com/ggerganov/llama.cpp) at ~/repos/llama.cpp.
You can then convert a model to FP16 with the following command:

[skip default]: begin
```
~/repos/llama.cpp/quantize --allow-requantize path_of_model_you_are_converting_from.gguf path_for_model_you_are_converting_to.gguf fp16
```
[skip default]: end

After the model is converted to a supported format like FP16, you can proceed using the instructions above.


[end default]: end

0 comments on commit 594414f

Please sign in to comment.