Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 1, 2025

Mirrored from ggml-org/llama.cpp#16906

This pull request introduces support for the Janus‑Pro 1B and Janus‑Pro 7B models within the llama.cpp framework.

The focus of this update is on image understanding (i.e., visual-input → textual or conceptual output).
Image generation is not covered by this PR.

Usage & Current Progress

Convert models to GGUF files:

# Convert the base Janus-Pro 1B model
python convert_hf_to_gguf.py deepseek-community/Janus-Pro-1B \
    --outfile janus-pro-1b-f16.gguf \
    --remote \
    --outtype f16

# Convert the multimodal projection (mmproj) component
python convert_hf_to_gguf.py deepseek-community/Janus-Pro-1B \
    --outfile mmproj-janus-pro-1b-f16.gguf \
    --remote \
    --outtype f16 \
    --mmproj

Run the model

# Build the project:
cmake -B build
cmake --build build --target llama-mtmd-cli

./build/bin/llama-mtmd-cli \
    -m janus-pro-1b-f16.gguf \
    --mmproj mmproj-janus-pro-1b-f16.gguf \
    --chat-template deepseek

References

Janus-Pro 1B model card (Hugging Face):
https://huggingface.co/deepseek-community/Janus-Pro-1B

Janus-Pro 7B model card (Hugging Face):
https://huggingface.co/deepseek-community/Janus-Pro-7B

Configurations:
https://huggingface.co/deepseek-community/Janus-Pro-1B/blob/main/config.json
https://huggingface.co/deepseek-community/Janus-Pro-7B/blob/main/config.json

HF Implementation:
https://github.com/huggingface/transformers/tree/main/src/transformers/models/janus

ggerganov and others added 30 commits September 29, 2025 17:43
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426)

* sync : whisper.cpp
* ggml: add spacemit backend

Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23

* add new line at end of file

Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2

* add riscv zba extension limit

Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce

* fixed for review comments, file renamed and format

Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce

* fixed for code format, after clang-format

Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2

* use _Float16 instead of __fp16

Change-Id: I039fb02bb95270e641bc4442204e658735859d43

* add ci for riscv64-spacemit-ime-native

Change-Id: I711c1033061df1a289ea77891b2997599dfe8279

* update debian-13-riscv64-spacemit-ime-native ci label

Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a

* remove license comment for spacemit ime

Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3

* upgrade binutils for gcc ime

Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45

* add spacemit ime cross jobs

Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6

* remove native compile for riscv64-spacemit-ime

Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e

* ci : add caching for spacemit ime cross toolchain

Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de

* ci: bug fixed for cache path and env

Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a

* Update .github/workflows/build-linux-cross.yml for cache path

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* bugfixed for  build-linux-cross.yml,  syntax error

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: cailinxi <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ci : add AMD runners and workflows

* ci : move AMD jobs to separate workflow

* cont : fix paths
…locks (#16326)

* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <[email protected]>
…ounding differences (#16295)

* tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences

* apply similar error bounds to test_cpy
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <[email protected]>
* metal : dynamic simdgroups for MV kernels

* cont : minor
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <[email protected]>
* Work on rope

* Simplify inplace operation generation and combine mul/add generation

* Work on rope variants

* implement neox rope

* rope complete

* Add sub,div,glu operators

* implement scale op

* Update cpy shader to handle cont/more types

* formatting

* Update test vars printing for rope,rms_norm

* Avoid ROPE hardcoded constants

* Add TODO to change ROPE constants to enum

Co-authored-by: Georgi Gerganov <[email protected]>

* fix TODO comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
* common : disable progress bar without a tty

Signed-off-by: Adrien Gallouët <[email protected]>

* Add missing headers

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
* fix ccache key for ubuntu-cpu-cmake

* set it for release as well [no ci]
…#16359)

* Make a few GLM tensors not required

layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work

* Update llama-model.cpp

layer.nextn.shared_head_norm also not required in case of future models
…(#16345)

* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369)

* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
CISC and others added 7 commits October 29, 2025 14:09
* sync minja.hpp

Adds Call/EndCall support, used in MiniCPM3 and MiniCPM4-MCP.

* remove spurious semicolon

* sync from ochafik/minja
* CUDA: use fastdiv in set-rows

* add assert about value fitting in u32
* hexagon: remove dspqueue callbacks and do all read processing inplace

* hexagon: there is no need to ref/deref the buffers at this point

We're not going to release the buffers without flushing the session queue.
So there is no need to inc/dec the refcounts for every request.
We also don't need to include those bufs in the response.

* hexagon: bump the thread count in the adb wrapper scripts

We can use more CPU cores now that the dedicated dspqueue polling threads are not used (ie no contention).
Also enable more agressive polling for now since we still map Flash Attention (and a few other kernels) to
the CPU and those dspqueue threads were keeping the CPU cores are higher clock freqs.

* hexagon: add lhez as the second code owner
* vulkan: add mmq q2_k integer dot support

* Refactor mmq caching

* Reduce mmq register use

* Load 4 quant blocks into shared memory in one step

* Pack q2_k blocks into caches of 32

* Use 32-bit accumulators for integer dot matmul

* Add q4_k mmq

* Add q3_k mmq

* Add q5_k mmq

* Add q6_k mmq

* Add mxfp4 mmq, enable MMQ MUL_MAT_ID

* Fix mmv dm loads
* vulkan: Update topk_moe fusion to handle gpt's late softmax

Based on #16649.

* Add ggml_check_edges

* Add sync logging to show fusion effects

* handle clamp added in #16655

* Update ggml/src/ggml-impl.h

Co-authored-by: Diego Devesa <[email protected]>
* llama: store mrope data in KV cell

* correct x,y ordering

* address review comments

* add consistency checks

* Update src/llama-kv-cache.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* add TODO

* fix asan error

* kv-cells : improve ext handling

* cont : fix headers

---------

Co-authored-by: Georgi Gerganov <[email protected]>
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: LLaMA.cpp Critical Functions

Critical Function Performance Status

All core inference and processing functions show no measurable performance changes between versions:

Inference Functions

  • llama_decode: 49,003,720 ns response time (no change)
  • llama_encode: 12,329,177 ns response time (no change)
  • llama_tokenize: 834,827 ns response time (no change)

Model Management Functions

  • llama_model_load_from_file: 333,126,340 ns response time (no change)
  • llama_batch_init: 257 ns response time (no change)
  • llama_memory_clear: 49 ns response time (no change)

Function Modification Status

All analyzed critical functions report "is_modified": false, indicating no code changes between versions.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Status: No impact on inference throughput

  • llama_decode: No change in 49 million ns response time
  • llama_encode: No change in 12 million ns response time
  • llama_tokenize: No change in 835,000 ns response time

Reference Impact: Based on the provided benchmark (7% tokens/sec reduction for 2ms llama_decode slowdown), the absence of changes in these functions indicates no tokens per second degradation.

2. Power Consumption

Status: Minimal impact on binary level

  • build.bin.libllama.so: -0.0% change (306,978.33 nJ → 306,978.09 nJ)
  • build.bin.libggml-base.so: 0.0% change (90,434 nJ)
  • build.bin.libggml-cpu.so: 0.0% change (151,692 nJ)
  • build.bin.libggml.so: 0.0% change (6,339 nJ)

Impacted Functions: The 0.24 nJ reduction in libllama.so correlates with the __copy_move_b function micro-optimization rather than core inference changes.

3. Quantization Efficiency

Status: No impact

  • llama_model_quantize: Function not present in performance data, indicating no execution during profiling
  • Quantization support functions: No changes detected in core quantization pathways

4. Memory Usage

Status: No impact on memory management functions

  • llama_memory_clear: 49 ns (no change)
  • KV cache functions: No performance changes detected
  • Memory allocation patterns: Stable across versions

5. Batch Processing

Status: No impact on batch operations

  • llama_batch_init: 257 ns (no change)
  • llama_decode batch processing: 49 million ns (no change)
  • Batch allocation functions: No performance degradation

Action Items for Performance Optimization

Build System Optimizations

  1. Address STL micro-regressions: The __copy_move_b function shows 0.08 ns increase

    • Apply link-time optimization (-flto) to reduce PLT overhead
    • Use profile-guided optimization (-fprofile-use) for hot path optimization
    • Consider -fno-plt compiler flag to eliminate procedure linkage table overhead
  2. Compiler optimization consistency:

    • Ensure consistent compiler flags across builds
    • Verify identical optimization levels between versions
    • Check for compiler version differences affecting code generation

Code-Level Optimizations

  1. Token data structure efficiency: The __copy_move_b regression affects llama_token_data copying

    • Review structure layout for cache alignment
    • Consider structure-of-arrays vs array-of-structures for bulk operations
    • Evaluate vectorization opportunities for 12-byte token data elements
  2. Memory access pattern optimization:

    • Profile memory access patterns in token copying operations
    • Consider prefetching strategies for large token arrays
    • Evaluate SIMD instruction usage for bulk token operations

Performance Impact Assessment

The analysis reveals stable performance across all critical inference functions. The only measurable change is a 0.08 ns micro-regression in STL copy operations, representing less than 0.1% impact on any performance metric.

Key Findings:

  • Core inference pipeline remains unchanged
  • No functional modifications to critical paths
  • Power consumption effectively unchanged
  • All KPIs maintain baseline performance levels

The performance stability indicates that the Janus Pro model additions in PR #32 successfully isolate new functionality without impacting existing inference performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.