Integrate FARA-7B model by apsonawane · Pull Request #1902 · microsoft/onnxruntime-genai

apsonawane · 2025-12-03T00:33:33Z

This pull request introduces significant improvements to the configuration system and inference pipeline, especially for vision and multimodal models. It adds support for object-style pipeline definitions, enhances flexibility in model configuration, and refines input handling for sliding window models. The changes also include a new Python example for Fara VLM inference using the updated system.

Configuration and Pipeline Enhancements

Added support for object-style pipeline definitions for both decoder and vision models, allowing pipelines to be specified as objects with named stages and parameters, increasing flexibility for complex model architectures. (src/config.cpp [1] [2] [3] [4] [5]; src/config.h [6]
Introduced new configuration fields for vision pipelines, including PipelineModel and WindowIndexing, supporting advanced vision processing workflows. (src/config.h src/config.hR162-R179)
Added support for image_token_id in the model configuration to enable multimodal token handling. (src/config.cpp [1]; src/config.h [2]
Improved handling of alternate naming for image size inputs, mapping image_grid_thw to image_sizes for greater compatibility. (src/config.cpp src/config.cppR622-R623)

Inference and Example Updates

Added a comprehensive Python example (fara_inference.py) demonstrating multimodal inference with image and text inputs, including detailed prompt formatting and output handling for tool calls. (examples/python/fara_inference.py examples/python/fara_inference.pyR1-R142)

Input Handling Improvements

Refined padding logic for input IDs in sliding window models, correctly handling left and right alignment during device allocation for improved inference accuracy. (src/generators.cpp src/generators.cppL321-R338)
Updated decoder-only state initialization to use the new position input creation logic, reflecting changes in the configuration system. (src/models/decoder_only.cpp src/models/decoder_only.cppL19-R21)

Dependency Update

Updated the onnxruntime_extensions dependency to a newer commit, ensuring compatibility with recent features. (cmake/deps.txt cmake/deps.txtL17-R17)

These changes collectively make the system more flexible for multimodal and vision models, improve configuration clarity and extensibility, and provide a robust example for users to follow.

tianleiwu · 2025-12-03T19:16:22Z

From AI:

Code Review & Suggestions

1. NPY Binary Parsing Security (`src/models/fara_vl_vision.cpp`)

The custom .npy parser Load1DNpyIndices ([Source: 184]) is a lightweight solution to avoid dependencies, but it carries risks.

Robustness: While you verify the magic string and header existence, parsing N from shape_str ([Source: 203]) and immediately allocating memory can be risky if the file header is malformed or maliciously crafted to declare a massive size.
Endianness: The code assumes Little Endian (<i4, <i8). While standard for x86/ARM, explicitly asserting system endianness matches the file format would be safer.
Recommendation: Ensure N is within a sane limit (e.g., matching expected window sizes) before allocation to prevent OOM attacks, or wrap the file reading in strict bounds checks.

2. Vision Pipeline Memory Allocation (`src/models/fara_vl_vision.cpp`)

In FaraVisionPipeline::Run, vectors are allocated and deallocated on every inference call:

// [Source: 233]
std::vector<float> pe_out_buf(num_patches * hidden_dim);
// [Source: 238]
std::vector<float> reordered(seq_len * hidden_dim);
// [Source: 243]
std::vector<float> attn_out_buf(seq_len * hidden_dim);

Performance Impact: For a high-throughput scenario, this constant reallocation (malloc/free) will add latency.
Suggestion: Consider making these vectors member variables of FaraVisionPipeline and using reserve()/resize() to reuse capacity across runs, similar to how the KV cache is managed.

3. Vision Embedding Injection Logic (`src/models/fara_vl_model.cpp`)

In InjectVisionEmbeddings ([Source: 168]):

if (token_ids_cpu[i] == image_token_id && image_embed_consumed_ < static_cast<size_t>(num_vision_tokens)) {
    // memcpy...
    image_embed_consumed_++;
}

Alignment Assumption: This logic assumes a strict 1-to-1 sequential mapping between occurrences of image_token_id in the input stream and the rows in vision_data.
Edge Case: If a user provides a prompt with more image_token_id placeholders than the vision model produced features for (e.g., user copy-pasted a prompt twice but only sent one image), the logic safely stops injecting (image_embed_consumed_ < num), which is good. However, the remaining image_token_ids will likely result in garbage/random embeddings being processed by the decoder.
Suggestion: Log a warning if total_tokens iteration finishes but image_embed_consumed_ != num_vision_tokens.

4. Python Inference Logic (`examples/python/fara_inference.py`)

There is a discrepancy in how generation length is calculated:

# [Source: 21] - eos_ids logic
# [Source: 22] - max_length setup
max_length = min(context_len, 2048)
params.set_search_options(max_length=max_length, ...)

Issue: The script accepts --max_new_tokens (default 4096), but the generator parameters are set using max_length. onnxruntime-genai typically treats max_length as the total sequence length (prompt + generation).
Consequence: If the prompt is 2000 tokens and max_length is capped at 2048, the model will only generate 48 tokens, ignoring the user's request for max_new_tokens.
Fix: Calculate max_length as input_length + max_new_tokens.

5. Sliding Window Alignment Fix (`src/generators.cpp`)

The fix for left-alignment padding ([Source: 93]) looks correct and resolves a known issue where padding was being appended to the end rather than prepended for left-aligned models.

// [Source: 94]
std::copy(input_ids.begin(), input_ids.end(), cpu_span.begin() + (padded_input_ids_size - input_ids.size()));

Verify: Ensure model_->config_->model.pad_token_id is initialized correctly in all paths, as std::fill_n relies on it.

tianleiwu · 2025-12-04T23:39:01Z

Updated review from AI:

Here is a review of the changes to onnxruntime-genai to support the FARA-7B (Qwen 2.5 VL based) model.

High-Level Summary

The implementation introduces a specialized pipeline for Qwen 2.5 VL architectures. Instead of treating the vision encoder as a monolithic ONNX model, it splits the vision tower into three stages (patch_embed, vision_attn, patch_merger). This design choice allows specific targeting of the computationally heavy vision_attn layer to the QNN Execution Provider (NPU), while keeping the embedding and merging layers on the CPU. This is a sophisticated and performant approach for hybrid hardware.

Critical Feedback & Potential Bugs

1. Assumption of Square Images in Pre-processing

In src/models/qwen2_5_vl_image_processor.cpp, the calculation of the grid dimensions assumes the image patches form a perfect square:

+    int64_t num_patches = pixel_values_shape[1];
// Single frame
+    int64_t grid_h = static_cast<int64_t>(std::sqrt(num_patches));
+    int64_t grid_w = grid_h;

Issue: Qwen 2.5 VL utilizes "Naive Dynamic Resolution," which preserves the aspect ratio of the input image. If the OrtxImagePreProcess pipeline (specifically "smart_resize") outputs a non-square patch count (e.g., a rectangular image), sqrt(num_patches) will truncate or produce incorrect dimensions for grid_h and grid_w.
Recommendation: You must calculate grid_h and grid_w based on the actual height/width output from the pre-processor, or ensure the pre-processor configuration strictly forces square resizing (which might degrade model performance on Qwen 2.5 VL). If OrtxImagePreProcess doesn't return the H/W explicitly, you may need to deduce it from the original image aspect ratio and the patch size.

2. Hardcoded Image Token ID

In src/models/qwen_vl_model.cpp:

+  constexpr int32_t image_token_id = 151655;

Issue: While 151655 is the standard for Qwen 2.5 VL, hardcoding token IDs in C++ source files is brittle. If a finetune (like "FARA") modifies the vocabulary or special tokens, this will break.
Recommendation: Fetch this ID from model.config (e.g., look up <|vision_start|> or a specific config entry) or the tokenizer's special token map if available.

3. Silent Failure in Vision Pipeline

In src/models/qwen_vl_model.cpp:

+  try {
+    image_features_buffer_ = vl_model_.vision_pipeline_->Run(pixel_data, pixel_shape_vec, grid_thw);
+  } catch (const std::exception&) {
+    return;
// Silent failure - pipeline already logs errors
+  }

Issue: If the vision pipeline fails (e.g., QNN failure, shape mismatch), the code catches the exception and returns void. The vision_ran_ flag remains false (presumably), but the generation might proceed with text-only context or crash later when expecting embeddings.
Recommendation: It is better to re-throw the exception or explicitly abort the generation. Continuing with a failed vision encoding will result in hallucinations or garbage output.

Code Quality & Architecture Review

1. Vision Pipeline "Windowing" Logic

The implementation in src/models/qwen_vl_vision.cpp correctly ports the complex dynamic windowing logic (3D-ish attention) required for Qwen 2.5 VL.

Strengths: The use of std::memcpy for the window reordering is efficient.
Note: The logic explicitly calculates padding (pad_h, pad_w) and creates a reordered buffer. This confirms that the model export expects "windowed" input rather than handling the windowing inside the ONNX graph. This is a valid optimization strategy to simplify the ONNX export for NPUs.

2. QNN Execution Provider Integration

The explicit handling of QNN in qwen_vl_vision.cpp is well done:

It checks for the provider availability .
It configures HTP performance modes (burst) .
Suggestion: The library path QnnHtp.dll is Windows-specific. If this code is intended for Linux/Android, ensure qnn_backend_path is adjusted dynamically or passed via config.

3. Configuration Parsing

The updates to config.cpp are comprehensive.

The addition of image_grid_thw to VisionInputs is correct for this model architecture.
The PipelineModel struct allows defining the 3-stage pipeline in genai_config.json, which keeps the C++ code generic enough to handle slight variations in file names.

4. Python Script (`model-vision.py`)

WinML Dependency: import winml and winml.register_execution_providers appear to be environment-specific. This should likely be removed or wrapped in a try/except block for the upstream PR.
Redundant Config Loading:
```
+                config_path = Path(args.model_path) / "genai_config.json"
+                with open(config_path, "r") as f:
```
You have already loaded config = og.Config(args.model_path) earlier in the script. You should be able to access the context length directly via the og.Config object rather than re-parsing the JSON file manually.

Minor Suggestions

Header Guards/Pragma: src/models/qwen2_5_vl_image_processor.h uses #pragma once. Ensure this matches the project's coding standard (some projects prefer #ifndef guards, though #pragma once is widely accepted now).
Naming: The mixture of Fara and Qwen2_5 in naming is slightly confusing. Since Fara is based on Qwen, sticking to Qwen2_5 for class names (as you have mostly done) is cleaner.
Float Conversion: The manual loops for Float16ToFloat32 and BFloat16ToFloat32 are fine, but ensure generators.h defines these clearly or uses onnxruntime utilities if available to benefit from SIMD if possible.

Decision

Approve with modifications.
The C++ core logic for the split pipeline is technically impressive and necessary for NPU support. However, Critical Feedback #1 (Square Image Assumption) needs to be addressed before merging, as it will break inference on any non-square input images, which is a primary feature of the Qwen 2.5 VL architecture.

This pull request introduces significant improvements to the configuration system and inference pipeline, especially for vision and multimodal models. It adds support for object-style pipeline definitions, enhances flexibility in model configuration, and refines input handling for sliding window models. The changes also include a new Python example for Fara VLM inference using the updated system. * Added support for object-style pipeline definitions for both decoder and vision models, allowing pipelines to be specified as objects with named stages and parameters, increasing flexibility for complex model architectures. (`src/config.cpp` [[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R588-R592) [[2]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R611) [[3]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R650-R737) [[4]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R770-R786) [[5]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R796-R798); `src/config.h` [[6]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179) * Introduced new configuration fields for vision pipelines, including `PipelineModel` and `WindowIndexing`, supporting advanced vision processing workflows. (`src/config.h` [src/config.hR162-R179](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179)) * Added support for `image_token_id` in the model configuration to enable multimodal token handling. (`src/config.cpp` [[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R973-R974); `src/config.h` [[2]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R284-R285) * Improved handling of alternate naming for image size inputs, mapping `image_grid_thw` to `image_sizes` for greater compatibility. (`src/config.cpp` [src/config.cppR622-R623](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R622-R623)) * Added a comprehensive Python example (`fara_inference.py`) demonstrating multimodal inference with image and text inputs, including detailed prompt formatting and output handling for tool calls. (`examples/python/fara_inference.py` [examples/python/fara_inference.pyR1-R142](diffhunk://#diff-f0dfdb822900fdf9b8218ec5d7ebf08aa4268fbb324d03945da204e23e57569cR1-R142)) * Refined padding logic for input IDs in sliding window models, correctly handling left and right alignment during device allocation for improved inference accuracy. (`src/generators.cpp` [src/generators.cppL321-R338](diffhunk://#diff-52e5b706d10acf58568e8717799a49d6ed0fcda1a61b45de8d982d209739fc0cL321-R338)) * Updated decoder-only state initialization to use the new position input creation logic, reflecting changes in the configuration system. (`src/models/decoder_only.cpp` [src/models/decoder_only.cppL19-R21](diffhunk://#diff-9080bf1f9205dfc25250df5bde7458b844511341ee209c6da836fd4ecb105967L19-R21)) * Updated the `onnxruntime_extensions` dependency to a newer commit, ensuring compatibility with recent features. (`cmake/deps.txt` [cmake/deps.txtL17-R17](diffhunk://#diff-12c22e06cbb37ea0ed9f9eaf60cbe408dbeef04072df6a9f431c3290822ea835L17-R17)) These changes collectively make the system more flexible for multimodal and vision models, improve configuration clarity and extensibility, and provide a robust example for users to follow. --------- Co-authored-by: Yi Ren <reny@microsoft.com>

github-advanced-security AI found potential problems Dec 3, 2025

View reviewed changes

Comment thread examples/python/fara_inference.py Fixed

apsonawane force-pushed the asonawane/vlm branch from b8893d3 to 5939dd4 Compare December 3, 2025 01:16

tianleiwu reviewed Dec 3, 2025

View reviewed changes

Comment thread src/config.cpp Outdated

tianleiwu reviewed Dec 3, 2025

View reviewed changes

Comment thread src/config.cpp

tianleiwu reviewed Dec 3, 2025

View reviewed changes

Comment thread src/config.cpp Outdated

tianleiwu reviewed Dec 3, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp Outdated

apsonawane force-pushed the asonawane/vlm branch from a43780c to a64dd32 Compare December 3, 2025 22:08

github-advanced-security AI found potential problems Dec 3, 2025

View reviewed changes

Comment thread examples/python/qwen2_5_vl_inference.py Fixed

tianleiwu reviewed Dec 4, 2025

View reviewed changes

Comment thread examples/python/qwen2_5_vl_inference.py Outdated

apsonawane force-pushed the asonawane/vlm branch from 551c601 to 248148a Compare December 4, 2025 01:46

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread examples/python/qwen2_5_vl_inference.py Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/config.h Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/config.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/kv_cache.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/model.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/model.cpp

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_model.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread test/csharp/Microsoft.ML.OnnxRuntimeGenAI.Tests.csproj Outdated

nenad1002 reviewed Dec 4, 2025

View reviewed changes

Comment thread examples/python/qwen2_5_vl_inference.py Outdated

Comment thread src/models/model.cpp

Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated

Comment thread src/models/qwen_vl_model.cpp Outdated

nenad1002 reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp

apsonawane added 8 commits December 4, 2025 13:57

Initial model support

781c273

Cleanup

0baab8d

Update inference script

151c15d

More cleanup

014eed6

Use extensions pre-processing

de74b6b

Update model name to Fara

d989306

Update name

0250dc8

Add position ids back as input

a82de4a

apsonawane force-pushed the asonawane/vlm branch from 7d4aa6c to 9e41478 Compare December 4, 2025 21:59

apsonawane enabled auto-merge (squash) December 4, 2025 22:03

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread examples/python/model-vision.py Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread examples/python/model-vision.py Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/config.cpp

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_model.cpp

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_model.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp Outdated

kunal-vaishnavi reviewed Dec 4, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp Outdated

kunal-vaishnavi reviewed Dec 5, 2025

View reviewed changes

Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated

kunal-vaishnavi reviewed Dec 5, 2025

View reviewed changes

Comment thread src/models/qwen_vl_vision.cpp

apsonawane added 3 commits December 4, 2025 17:07

Comments addressed

7b569ef

Fix linter

eb0f77e

Fix kv cache condition

4e6babc

tianleiwu reviewed Dec 5, 2025

View reviewed changes

Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated

kunal-vaishnavi reviewed Dec 5, 2025

View reviewed changes

Comment thread examples/python/model-vision.py Outdated

Fix using extensions

2c8b8b2

kunal-vaishnavi reviewed Dec 5, 2025

View reviewed changes

Comment thread src/models/kv_cache.cpp Outdated

kunal-vaishnavi reviewed Dec 5, 2025

View reviewed changes

Comment thread src/models/kv_cache.cpp Outdated

apsonawane added 2 commits December 4, 2025 20:42

Address comments

ff44339

Fix linter errors

0dfda54

kunal-vaishnavi approved these changes Dec 5, 2025

View reviewed changes

apsonawane merged commit 31afc78 into main Dec 5, 2025
15 of 16 checks passed

apsonawane deleted the asonawane/vlm branch December 5, 2025 06:01

dependabot Bot mentioned this pull request Dec 15, 2025

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.2 to 0.11.4 yuniko-software/qwen3-onnx#10

Merged

dependabot Bot mentioned this pull request Feb 16, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.0 yuniko-software/qwen3-onnx#23

Closed

dependabot Bot mentioned this pull request Mar 2, 2026

Bump Microsoft.ML.OnnxRuntimeGenAI from 0.11.4 to 0.12.1 yuniko-software/qwen3-onnx#27

Open

Conversation

apsonawane commented Dec 3, 2025

Configuration and Pipeline Enhancements

Inference and Example Updates

Input Handling Improvements

Dependency Update

Uh oh!

Uh oh!

tianleiwu commented Dec 3, 2025

Code Review & Suggestions

1. NPY Binary Parsing Security (src/models/fara_vl_vision.cpp)

2. Vision Pipeline Memory Allocation (src/models/fara_vl_vision.cpp)

3. Vision Embedding Injection Logic (src/models/fara_vl_model.cpp)

4. Python Inference Logic (examples/python/fara_inference.py)

5. Sliding Window Alignment Fix (src/generators.cpp)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Dec 4, 2025

Updated review from AI:

High-Level Summary

Critical Feedback & Potential Bugs

1. Assumption of Square Images in Pre-processing

2. Hardcoded Image Token ID

3. Silent Failure in Vision Pipeline

Code Quality & Architecture Review

1. Vision Pipeline "Windowing" Logic

2. QNN Execution Provider Integration

3. Configuration Parsing

4. Python Script (model-vision.py)

Minor Suggestions

Decision

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

1. NPY Binary Parsing Security (`src/models/fara_vl_vision.cpp`)

2. Vision Pipeline Memory Allocation (`src/models/fara_vl_vision.cpp`)

3. Vision Embedding Injection Logic (`src/models/fara_vl_model.cpp`)

4. Python Inference Logic (`examples/python/fara_inference.py`)

5. Sliding Window Alignment Fix (`src/generators.cpp`)

4. Python Script (`model-vision.py`)