Integrate FARA-7B model#1902
Conversation
b8893d3 to
5939dd4
Compare
|
From AI: Code Review & Suggestions1. NPY Binary Parsing Security (
|
a43780c to
a64dd32
Compare
551c601 to
248148a
Compare
7d4aa6c to
9e41478
Compare
Updated review from AI:Here is a review of the changes to High-Level SummaryThe implementation introduces a specialized pipeline for Qwen 2.5 VL architectures. Instead of treating the vision encoder as a monolithic ONNX model, it splits the vision tower into three stages ( Critical Feedback & Potential Bugs1. Assumption of Square Images in Pre-processingIn + int64_t num_patches = pixel_values_shape[1];
// Single frame
+ int64_t grid_h = static_cast<int64_t>(std::sqrt(num_patches));
+ int64_t grid_w = grid_h;Issue: Qwen 2.5 VL utilizes "Naive Dynamic Resolution," which preserves the aspect ratio of the input image. If the 2. Hardcoded Image Token IDIn + constexpr int32_t image_token_id = 151655;Issue: While 3. Silent Failure in Vision PipelineIn + try {
+ image_features_buffer_ = vl_model_.vision_pipeline_->Run(pixel_data, pixel_shape_vec, grid_thw);
+ } catch (const std::exception&) {
+ return;
// Silent failure - pipeline already logs errors
+ }Issue: If the vision pipeline fails (e.g., QNN failure, shape mismatch), the code catches the exception and returns Code Quality & Architecture Review1. Vision Pipeline "Windowing" LogicThe implementation in
2. QNN Execution Provider IntegrationThe explicit handling of QNN in
3. Configuration ParsingThe updates to
4. Python Script (
|
This pull request introduces significant improvements to the configuration system and inference pipeline, especially for vision and multimodal models. It adds support for object-style pipeline definitions, enhances flexibility in model configuration, and refines input handling for sliding window models. The changes also include a new Python example for Fara VLM inference using the updated system. * Added support for object-style pipeline definitions for both decoder and vision models, allowing pipelines to be specified as objects with named stages and parameters, increasing flexibility for complex model architectures. (`src/config.cpp` [[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R588-R592) [[2]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R611) [[3]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R650-R737) [[4]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R770-R786) [[5]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R796-R798); `src/config.h` [[6]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179) * Introduced new configuration fields for vision pipelines, including `PipelineModel` and `WindowIndexing`, supporting advanced vision processing workflows. (`src/config.h` [src/config.hR162-R179](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R162-R179)) * Added support for `image_token_id` in the model configuration to enable multimodal token handling. (`src/config.cpp` [[1]](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R973-R974); `src/config.h` [[2]](diffhunk://#diff-c24f78b3519d763901eb9f67b864f01d802d803df1b24faaf154019cf812bf95R284-R285) * Improved handling of alternate naming for image size inputs, mapping `image_grid_thw` to `image_sizes` for greater compatibility. (`src/config.cpp` [src/config.cppR622-R623](diffhunk://#diff-6b2f0a449fdefd8930e23ef0dcd752beec69242e1303d77653f047c5e0766385R622-R623)) * Added a comprehensive Python example (`fara_inference.py`) demonstrating multimodal inference with image and text inputs, including detailed prompt formatting and output handling for tool calls. (`examples/python/fara_inference.py` [examples/python/fara_inference.pyR1-R142](diffhunk://#diff-f0dfdb822900fdf9b8218ec5d7ebf08aa4268fbb324d03945da204e23e57569cR1-R142)) * Refined padding logic for input IDs in sliding window models, correctly handling left and right alignment during device allocation for improved inference accuracy. (`src/generators.cpp` [src/generators.cppL321-R338](diffhunk://#diff-52e5b706d10acf58568e8717799a49d6ed0fcda1a61b45de8d982d209739fc0cL321-R338)) * Updated decoder-only state initialization to use the new position input creation logic, reflecting changes in the configuration system. (`src/models/decoder_only.cpp` [src/models/decoder_only.cppL19-R21](diffhunk://#diff-9080bf1f9205dfc25250df5bde7458b844511341ee209c6da836fd4ecb105967L19-R21)) * Updated the `onnxruntime_extensions` dependency to a newer commit, ensuring compatibility with recent features. (`cmake/deps.txt` [cmake/deps.txtL17-R17](diffhunk://#diff-12c22e06cbb37ea0ed9f9eaf60cbe408dbeef04072df6a9f431c3290822ea835L17-R17)) These changes collectively make the system more flexible for multimodal and vision models, improve configuration clarity and extensibility, and provide a robust example for users to follow. --------- Co-authored-by: Yi Ren <reny@microsoft.com>
This pull request introduces significant improvements to the configuration system and inference pipeline, especially for vision and multimodal models. It adds support for object-style pipeline definitions, enhances flexibility in model configuration, and refines input handling for sliding window models. The changes also include a new Python example for Fara VLM inference using the updated system.
Configuration and Pipeline Enhancements
src/config.cpp[1] [2] [3] [4] [5];src/config.h[6]PipelineModelandWindowIndexing, supporting advanced vision processing workflows. (src/config.hsrc/config.hR162-R179)image_token_idin the model configuration to enable multimodal token handling. (src/config.cpp[1];src/config.h[2]image_grid_thwtoimage_sizesfor greater compatibility. (src/config.cppsrc/config.cppR622-R623)Inference and Example Updates
fara_inference.py) demonstrating multimodal inference with image and text inputs, including detailed prompt formatting and output handling for tool calls. (examples/python/fara_inference.pyexamples/python/fara_inference.pyR1-R142)Input Handling Improvements
src/generators.cppsrc/generators.cppL321-R338)src/models/decoder_only.cppsrc/models/decoder_only.cppL19-R21)Dependency Update
onnxruntime_extensionsdependency to a newer commit, ensuring compatibility with recent features. (cmake/deps.txtcmake/deps.txtL17-R17)These changes collectively make the system more flexible for multimodal and vision models, improve configuration clarity and extensibility, and provide a robust example for users to follow.