Skip to content

Integrate FARA-7B model#1902

Merged
apsonawane merged 25 commits into
mainfrom
asonawane/vlm
Dec 5, 2025
Merged

Integrate FARA-7B model#1902
apsonawane merged 25 commits into
mainfrom
asonawane/vlm

Conversation

@apsonawane
Copy link
Copy Markdown
Contributor

This pull request introduces significant improvements to the configuration system and inference pipeline, especially for vision and multimodal models. It adds support for object-style pipeline definitions, enhances flexibility in model configuration, and refines input handling for sliding window models. The changes also include a new Python example for Fara VLM inference using the updated system.

Configuration and Pipeline Enhancements

  • Added support for object-style pipeline definitions for both decoder and vision models, allowing pipelines to be specified as objects with named stages and parameters, increasing flexibility for complex model architectures. (src/config.cpp [1] [2] [3] [4] [5]; src/config.h [6]
  • Introduced new configuration fields for vision pipelines, including PipelineModel and WindowIndexing, supporting advanced vision processing workflows. (src/config.h src/config.hR162-R179)
  • Added support for image_token_id in the model configuration to enable multimodal token handling. (src/config.cpp [1]; src/config.h [2]
  • Improved handling of alternate naming for image size inputs, mapping image_grid_thw to image_sizes for greater compatibility. (src/config.cpp src/config.cppR622-R623)

Inference and Example Updates

  • Added a comprehensive Python example (fara_inference.py) demonstrating multimodal inference with image and text inputs, including detailed prompt formatting and output handling for tool calls. (examples/python/fara_inference.py examples/python/fara_inference.pyR1-R142)

Input Handling Improvements

  • Refined padding logic for input IDs in sliding window models, correctly handling left and right alignment during device allocation for improved inference accuracy. (src/generators.cpp src/generators.cppL321-R338)
  • Updated decoder-only state initialization to use the new position input creation logic, reflecting changes in the configuration system. (src/models/decoder_only.cpp src/models/decoder_only.cppL19-R21)

Dependency Update

  • Updated the onnxruntime_extensions dependency to a newer commit, ensuring compatibility with recent features. (cmake/deps.txt cmake/deps.txtL17-R17)

These changes collectively make the system more flexible for multimodal and vision models, improve configuration clarity and extensibility, and provide a robust example for users to follow.

Comment thread examples/python/fara_inference.py Fixed
@tianleiwu
Copy link
Copy Markdown
Contributor

From AI:


Code Review & Suggestions

1. NPY Binary Parsing Security (src/models/fara_vl_vision.cpp)

The custom .npy parser Load1DNpyIndices ([Source: 184]) is a lightweight solution to avoid dependencies, but it carries risks.

  • Robustness: While you verify the magic string and header existence, parsing N from shape_str ([Source: 203]) and immediately allocating memory can be risky if the file header is malformed or maliciously crafted to declare a massive size.
  • Endianness: The code assumes Little Endian (<i4, <i8). While standard for x86/ARM, explicitly asserting system endianness matches the file format would be safer.
  • Recommendation: Ensure N is within a sane limit (e.g., matching expected window sizes) before allocation to prevent OOM attacks, or wrap the file reading in strict bounds checks.

2. Vision Pipeline Memory Allocation (src/models/fara_vl_vision.cpp)

In FaraVisionPipeline::Run, vectors are allocated and deallocated on every inference call:

// [Source: 233]
std::vector<float> pe_out_buf(num_patches * hidden_dim);
// [Source: 238]
std::vector<float> reordered(seq_len * hidden_dim);
// [Source: 243]
std::vector<float> attn_out_buf(seq_len * hidden_dim);
  • Performance Impact: For a high-throughput scenario, this constant reallocation (malloc/free) will add latency.
  • Suggestion: Consider making these vectors member variables of FaraVisionPipeline and using reserve()/resize() to reuse capacity across runs, similar to how the KV cache is managed.

3. Vision Embedding Injection Logic (src/models/fara_vl_model.cpp)

In InjectVisionEmbeddings ([Source: 168]):

if (token_ids_cpu[i] == image_token_id && image_embed_consumed_ < static_cast<size_t>(num_vision_tokens)) {
    // memcpy...
    image_embed_consumed_++;
}
  • Alignment Assumption: This logic assumes a strict 1-to-1 sequential mapping between occurrences of image_token_id in the input stream and the rows in vision_data.
  • Edge Case: If a user provides a prompt with more image_token_id placeholders than the vision model produced features for (e.g., user copy-pasted a prompt twice but only sent one image), the logic safely stops injecting (image_embed_consumed_ < num), which is good. However, the remaining image_token_ids will likely result in garbage/random embeddings being processed by the decoder.
  • Suggestion: Log a warning if total_tokens iteration finishes but image_embed_consumed_ != num_vision_tokens.

4. Python Inference Logic (examples/python/fara_inference.py)

There is a discrepancy in how generation length is calculated:

# [Source: 21] - eos_ids logic
# [Source: 22] - max_length setup
max_length = min(context_len, 2048)
params.set_search_options(max_length=max_length, ...)
  • Issue: The script accepts --max_new_tokens (default 4096), but the generator parameters are set using max_length. onnxruntime-genai typically treats max_length as the total sequence length (prompt + generation).
  • Consequence: If the prompt is 2000 tokens and max_length is capped at 2048, the model will only generate 48 tokens, ignoring the user's request for max_new_tokens.
  • Fix: Calculate max_length as input_length + max_new_tokens.

5. Sliding Window Alignment Fix (src/generators.cpp)

The fix for left-alignment padding ([Source: 93]) looks correct and resolves a known issue where padding was being appended to the end rather than prepended for left-aligned models.

// [Source: 94]
std::copy(input_ids.begin(), input_ids.end(), cpu_span.begin() + (padded_input_ids_size - input_ids.size()));
  • Verify: Ensure model_->config_->model.pad_token_id is initialized correctly in all paths, as std::fill_n relies on it.

Comment thread src/config.cpp Outdated
Comment thread src/config.cpp
Comment thread src/config.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp Outdated
Comment thread examples/python/qwen2_5_vl_inference.py Fixed
Comment thread examples/python/qwen2_5_vl_inference.py Outdated
Comment thread examples/python/qwen2_5_vl_inference.py Outdated
Comment thread src/config.h Outdated
Comment thread src/config.cpp Outdated
Comment thread src/models/kv_cache.cpp Outdated
Comment thread src/models/model.cpp Outdated
Comment thread src/models/model.cpp
Comment thread src/models/qwen_vl_model.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp Outdated
Comment thread test/csharp/Microsoft.ML.OnnxRuntimeGenAI.Tests.csproj Outdated
Comment thread examples/python/qwen2_5_vl_inference.py Outdated
Comment thread src/models/model.cpp
Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated
Comment thread src/models/qwen_vl_model.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp
@apsonawane apsonawane enabled auto-merge (squash) December 4, 2025 22:03
Comment thread examples/python/model-vision.py Outdated
Comment thread examples/python/model-vision.py Outdated
Comment thread src/config.cpp
Comment thread src/models/qwen_vl_model.cpp
Comment thread src/models/qwen_vl_model.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp
@tianleiwu
Copy link
Copy Markdown
Contributor

Updated review from AI:

Here is a review of the changes to onnxruntime-genai to support the FARA-7B (Qwen 2.5 VL based) model.

High-Level Summary

The implementation introduces a specialized pipeline for Qwen 2.5 VL architectures. Instead of treating the vision encoder as a monolithic ONNX model, it splits the vision tower into three stages (patch_embed, vision_attn, patch_merger). This design choice allows specific targeting of the computationally heavy vision_attn layer to the QNN Execution Provider (NPU), while keeping the embedding and merging layers on the CPU. This is a sophisticated and performant approach for hybrid hardware.

Critical Feedback & Potential Bugs

1. Assumption of Square Images in Pre-processing

In src/models/qwen2_5_vl_image_processor.cpp, the calculation of the grid dimensions assumes the image patches form a perfect square:

+    int64_t num_patches = pixel_values_shape[1];
// Single frame
+    int64_t grid_h = static_cast<int64_t>(std::sqrt(num_patches));
+    int64_t grid_w = grid_h;

Issue: Qwen 2.5 VL utilizes "Naive Dynamic Resolution," which preserves the aspect ratio of the input image. If the OrtxImagePreProcess pipeline (specifically "smart_resize") outputs a non-square patch count (e.g., a rectangular image), sqrt(num_patches) will truncate or produce incorrect dimensions for grid_h and grid_w.
Recommendation: You must calculate grid_h and grid_w based on the actual height/width output from the pre-processor, or ensure the pre-processor configuration strictly forces square resizing (which might degrade model performance on Qwen 2.5 VL). If OrtxImagePreProcess doesn't return the H/W explicitly, you may need to deduce it from the original image aspect ratio and the patch size.

2. Hardcoded Image Token ID

In src/models/qwen_vl_model.cpp:

+  constexpr int32_t image_token_id = 151655;

Issue: While 151655 is the standard for Qwen 2.5 VL, hardcoding token IDs in C++ source files is brittle. If a finetune (like "FARA") modifies the vocabulary or special tokens, this will break.
Recommendation: Fetch this ID from model.config (e.g., look up <|vision_start|> or a specific config entry) or the tokenizer's special token map if available.

3. Silent Failure in Vision Pipeline

In src/models/qwen_vl_model.cpp:

+  try {
+    image_features_buffer_ = vl_model_.vision_pipeline_->Run(pixel_data, pixel_shape_vec, grid_thw);
+  } catch (const std::exception&) {
+    return;
// Silent failure - pipeline already logs errors
+  }

Issue: If the vision pipeline fails (e.g., QNN failure, shape mismatch), the code catches the exception and returns void. The vision_ran_ flag remains false (presumably), but the generation might proceed with text-only context or crash later when expecting embeddings.
Recommendation: It is better to re-throw the exception or explicitly abort the generation. Continuing with a failed vision encoding will result in hallucinations or garbage output.

Code Quality & Architecture Review

1. Vision Pipeline "Windowing" Logic

The implementation in src/models/qwen_vl_vision.cpp correctly ports the complex dynamic windowing logic (3D-ish attention) required for Qwen 2.5 VL.

  • Strengths: The use of std::memcpy for the window reordering is efficient.
  • Note: The logic explicitly calculates padding (pad_h, pad_w) and creates a reordered buffer. This confirms that the model export expects "windowed" input rather than handling the windowing inside the ONNX graph. This is a valid optimization strategy to simplify the ONNX export for NPUs.

2. QNN Execution Provider Integration

The explicit handling of QNN in qwen_vl_vision.cpp is well done:

  • It checks for the provider availability .
  • It configures HTP performance modes (burst) .
  • Suggestion: The library path QnnHtp.dll is Windows-specific. If this code is intended for Linux/Android, ensure qnn_backend_path is adjusted dynamically or passed via config.

3. Configuration Parsing

The updates to config.cpp are comprehensive.

  • The addition of image_grid_thw to VisionInputs is correct for this model architecture.
  • The PipelineModel struct allows defining the 3-stage pipeline in genai_config.json, which keeps the C++ code generic enough to handle slight variations in file names.

4. Python Script (model-vision.py)

  • WinML Dependency: import winml and winml.register_execution_providers appear to be environment-specific. This should likely be removed or wrapped in a try/except block for the upstream PR.
  • Redundant Config Loading:
    +                config_path = Path(args.model_path) / "genai_config.json"
    +                with open(config_path, "r") as f:
    You have already loaded config = og.Config(args.model_path) earlier in the script. You should be able to access the context length directly via the og.Config object rather than re-parsing the JSON file manually.

Minor Suggestions

  1. Header Guards/Pragma: src/models/qwen2_5_vl_image_processor.h uses #pragma once. Ensure this matches the project's coding standard (some projects prefer #ifndef guards, though #pragma once is widely accepted now).
  2. Naming: The mixture of Fara and Qwen2_5 in naming is slightly confusing. Since Fara is based on Qwen, sticking to Qwen2_5 for class names (as you have mostly done) is cleaner.
  3. Float Conversion: The manual loops for Float16ToFloat32 and BFloat16ToFloat32 are fine, but ensure generators.h defines these clearly or uses onnxruntime utilities if available to benefit from SIMD if possible.

Decision

Approve with modifications.
The C++ core logic for the split pipeline is technically impressive and necessary for NPU support. However, Critical Feedback #1 (Square Image Assumption) needs to be addressed before merging, as it will break inference on any non-square input images, which is a primary feature of the Qwen 2.5 VL architecture.

Comment thread src/models/qwen_vl_vision.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp Outdated
Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated
Comment thread src/models/qwen_vl_vision.cpp
Comment thread src/models/qwen2_5_vl_image_processor.cpp Outdated
Comment thread examples/python/model-vision.py Outdated
Comment thread src/models/kv_cache.cpp Outdated
Comment thread src/models/kv_cache.cpp Outdated
@apsonawane apsonawane merged commit 31afc78 into main Dec 5, 2025
15 of 16 checks passed
@apsonawane apsonawane deleted the asonawane/vlm branch December 5, 2025 06:01
kunal-vaishnavi pushed a commit that referenced this pull request Dec 5, 2025
This pull request introduces significant improvements to the
configuration system and inference pipeline, especially for vision and
multimodal models. It adds support for object-style pipeline
definitions, enhances flexibility in model configuration, and refines
input handling for sliding window models. The changes also include a new
Python example for Fara VLM inference using the updated system.

* Added support for object-style pipeline definitions for both decoder
and vision models, allowing pipelines to be specified as objects with
named stages and parameters, increasing flexibility for complex model
architectures. (`src/config.cpp`
[[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R588-R592)
[[2]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R611)
[[3]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R650-R737)
[[4]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R770-R786)
[[5]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R796-R798);
`src/config.h`
[[6]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179)
* Introduced new configuration fields for vision pipelines, including
`PipelineModel` and `WindowIndexing`, supporting advanced vision
processing workflows. (`src/config.h`
[src/config.hR162-R179](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179))
* Added support for `image_token_id` in the model configuration to
enable multimodal token handling. (`src/config.cpp`
[[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R973-R974);
`src/config.h`
[[2]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R284-R285)
* Improved handling of alternate naming for image size inputs, mapping
`image_grid_thw` to `image_sizes` for greater compatibility.
(`src/config.cpp`
[src/config.cppR622-R623](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R622-R623))

* Added a comprehensive Python example (`fara_inference.py`)
demonstrating multimodal inference with image and text inputs, including
detailed prompt formatting and output handling for tool calls.
(`examples/python/fara_inference.py`
[examples/python/fara_inference.pyR1-R142](diffhunk://#diff-f0dfdb822900fdf9b8218ec5d7ebf08aa4268fbb324d03945da204e23e57569cR1-R142))

* Refined padding logic for input IDs in sliding window models,
correctly handling left and right alignment during device allocation for
improved inference accuracy. (`src/generators.cpp`
[src/generators.cppL321-R338](diffhunk://#diff-52e5b706d10acf58568e8717799a49d6ed0fcda1a61b45de8d982d209739fc0cL321-R338))
* Updated decoder-only state initialization to use the new position
input creation logic, reflecting changes in the configuration system.
(`src/models/decoder_only.cpp`
[src/models/decoder_only.cppL19-R21](diffhunk://#diff-9080bf1f9205dfc25250df5bde7458b844511341ee209c6da836fd4ecb105967L19-R21))

* Updated the `onnxruntime_extensions` dependency to a newer commit,
ensuring compatibility with recent features. (`cmake/deps.txt`
[cmake/deps.txtL17-R17](diffhunk://#diff-12c22e06cbb37ea0ed9f9eaf60cbe408dbeef04072df6a9f431c3290822ea835L17-R17))

These changes collectively make the system more flexible for multimodal
and vision models, improve configuration clarity and extensibility, and
provide a robust example for users to follow.

---------

Co-authored-by: Yi Ren <reny@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants