Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion examples/text-generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,7 @@ You will also need to add `--torch_compile` and `--parallel_strategy="tp"` in yo
Here is an example:
```bash
PT_ENABLE_INT64_SUPPORT=1 PT_HPU_LAZY_MODE=0 python ../gaudi_spawn.py --world_size 8 run_generation.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--trim_logits \
--use_kv_cache \
--attn_softmax_bf16 \
Expand Down Expand Up @@ -593,6 +593,10 @@ For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Mo
Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
Currently, the support is for UINT4 inference of pre-quantized models only.

```bash
BUILD_CUDA_EXT=0 python -m pip install -vvv --no-build-isolation git+https://github.com/HabanaAI/AutoGPTQ.git
```

You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables:
`SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command,
and by adding the argument `--load_quantized_model_with_autogptq`.
Expand Down