Skip to content

Support -fa auto in llama-bench#23714

Merged
gaugarg-nv merged 2 commits into
ggml-org:masterfrom
gaugarg-nv:fa_auto_llama_bench
May 30, 2026
Merged

Support -fa auto in llama-bench#23714
gaugarg-nv merged 2 commits into
ggml-org:masterfrom
gaugarg-nv:fa_auto_llama_bench

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

Support -fa on|off|auto in llama-bench, similar to other tools. The default is still kept as -fa off not to change the existing behavior, but using -fa auto allows enabling llama-server and llama-cli behavior in llama-bench.

Make the default value of -ngl -1, similar to other tools. For most models, this won't have any impact as the previous default was 99.

Update README with the latest usage and examples.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, to update the readme and for code review.

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that we should just change the llama-bench default to LLAMA_FLASH_ATTN_TYPE_AUTO to be consistent with the rest of the codebase.

Comment thread tools/llama-bench/llama-bench.cpp Outdated
Comment thread tools/llama-bench/README.md Outdated
@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

My opinion is that we should just change the llama-bench default to LLAMA_FLASH_ATTN_TYPE_AUTO to be consistent with the rest of the codebase.

Sure, made it the default now.

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but llama-bench has a lot of stakeholders.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

Thanks @JohannesGaessler .

@ggerganov could you please take a look as well?

To give you some background. Request for some of these changes is coming from our automation team, which is using llama-bench for some of the regression testing. I would like to ensure llama-bench behavior stays as close to llama-cli and llama-server as possible.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

@ggml-org/maintainers, can I get second approval, please?

@gaugarg-nv gaugarg-nv merged commit aa46bda into ggml-org:master May 30, 2026
27 checks passed
@gaugarg-nv gaugarg-nv deleted the fa_auto_llama_bench branch May 30, 2026 20:34
o7si added a commit to o7si/llama.cpp that referenced this pull request May 31, 2026
…wercase

* upstream/master: (27 commits)
  vocab : add tokenizer support for jina-embeddings-v2-base-zh (ggml-org#18756)
  ui: fix ETag truncation with MSVC compiler (ggml-org#23917)
  docs : update ZenDNN docs for Q8 support (ggml-org#23791)
  llama: only use one iGPU device by default (ggml-org#23897)
  webui: add custom CSS injection via config (ggml-org#23904)
  Support `-fa auto` in llama-bench (ggml-org#23714)
  opencl: support bf16 by converting to f16 (ggml-org#23839)
  ui: exclude generated build dirs from prettier and eslint so lint errors stop being masked (ggml-org#23910)
  TP: fix granularity for Qwen 3.5/3.6 + 3 GPUs (ggml-org#23843)
  metal : restore im2col implementation for large kernels (ggml-org#23901)
  test: (test-llama-archs) log the config name first (ggml-org#23885)
  ci : update ios-xcode release job to macos-26 (ggml-org#23906)
  ggml : add some lsx support (ggml-org#23798)
  vulkan: add Flash Attention support for BFloat16 KV cache (ggml-org#23420)
  ci : fix s390x release job (ggml-org#23898)
  ci : clear cache instead of "no timestamp" keys + fix macos (ggml-org#23895)
  llama : do not skip iGPU when only RPC devices are present (ggml-org#23868)
  server: in SSE mode, send HTTP headers when slot starts (ggml-org#23884)
  ggml-webgpu: Check earlier for WebGPU required features (ggml-org#23879)
  ggml-webgpu: add q4_0/q8_0 SET_ROWS (ggml-org#23760)
  ...

# Conflicts:
#	gguf-py/gguf/vocab.py
#	src/llama-vocab.cpp
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* Support `-fa auto` in llama-bench

Make the default value of `-ngl` -1, similar to other tools.

Update README with latest usage and examples

* Address review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants