Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
12750a7
feat(convert): Get language model conversion working for 4.1 vision
gabe-l-hart May 6, 2026
47dd3b1
feat(convert): Skip multimodal tensors for GraniteMoeHybrid (vision 4.0)
gabe-l-hart May 5, 2026
b3a6914
fix: Disable vocab padding for non-hybrid models that use GraniteMoeH…
gabe-l-hart May 6, 2026
f83418c
feat: Plumb python-side vision projector names and mappings
gabe-l-hart May 7, 2026
5b23f80
feat: Add python side architecture name
gabe-l-hart May 7, 2026
a176cbf
feat: Add python-side plumbing for setting FEATURE_LAYERS hparam
gabe-l-hart May 7, 2026
79412a4
feat: Add c++ side tensor naming defines
gabe-l-hart May 7, 2026
623ea2b
feat(mtmd): Convert vision_feature_layer to an ordered vector
gabe-l-hart May 7, 2026
7dda78f
feat(mtmd): Add architecture label plumbing
gabe-l-hart May 7, 2026
5e6184f
feat(wip): Add partial conversion for mmproj
gabe-l-hart May 7, 2026
97600c7
feat: Add gguf_writer and constant support for new hparams and deepst…
gabe-l-hart May 7, 2026
f6d1975
feat: Full conversion for mmproj w/ tensor mappings
gabe-l-hart May 8, 2026
97e612a
fix: Add lm_head skip for mmproj for 4.0
gabe-l-hart May 8, 2026
2a969d3
fix: De-alias text_config architecture in convert_lora_to_gguf.py
gabe-l-hart May 8, 2026
2332686
feat: Add --trust-remote-code arg to convert_lora_to_gguf.py
gabe-l-hart May 8, 2026
fc31cca
fix: De-alias model.language_model. -> model. for lora adapters
gabe-l-hart May 8, 2026
0b03ada
fix: Extend language model tensor dealiasing in adapters
gabe-l-hart May 8, 2026
fb6075b
fix: Remove unnecessary registration for GraniteSpeech in language model
gabe-l-hart May 8, 2026
8e4c0b5
feat: Plumb through mm prefix formatting for qformer tensors
gabe-l-hart May 8, 2026
8aa1268
refactor: Refactor vision projector tensors to use predictor ID as th…
gabe-l-hart May 14, 2026
14fd2cc
feat: Add spatial offests array hparam conversion
gabe-l-hart May 14, 2026
0feeb29
feat: Add stub plumbing for granite vision in mtmd
gabe-l-hart May 15, 2026
5f23c21
feat: Add new hparam and tensor naming in clip-impl.h
gabe-l-hart May 15, 2026
234973d
fix: Move deepstack_layer_arr to llm hparam instead of mmproj
gabe-l-hart May 15, 2026
cb05a27
fix: Remove IS_DEEPSTACK_LAYERS
gabe-l-hart May 15, 2026
1551ec3
refactor: n_deepstack_layers -> deepstack_layer_arr
gabe-l-hart May 15, 2026
5d0f1ee
fix: Use try/catch for single/multi valued deepstack info
gabe-l-hart May 15, 2026
c69e655
feat: Add deepstack injection point for granite LLM
gabe-l-hart May 15, 2026
acf0e98
fix: add missing vision attn layernorm eps
gabe-l-hart May 15, 2026
5ce4b81
refactor: Hoist qformer tensors into qf_block and hold a vector for m…
gabe-l-hart May 15, 2026
520d789
fix: Fix missing prefix template for TN_QF_PROJ_LINEAR
gabe-l-hart May 15, 2026
a460879
fix: Add embedding scale and image grid pinpoints hparams in conversion
gabe-l-hart May 15, 2026
173becf
feat: Add mtmd KEY_ section for hparams shared with the LLM
gabe-l-hart May 15, 2026
d3c174c
feat: Implement c++ hparam parsing
gabe-l-hart May 15, 2026
944f15f
fix: Flatten pinpoints in conversion
gabe-l-hart May 15, 2026
86fef1e
fix: Add missing break
gabe-l-hart May 15, 2026
575f401
fix: No reason to have modality prefix for img_pos
gabe-l-hart May 15, 2026
4d59c0e
feat: Add tensor loading
gabe-l-hart May 15, 2026
0df9de9
fix(convert): Fix confusion between proj.norm and proj.qformer.layernorm
gabe-l-hart May 15, 2026
493111f
fix: Use the right portion of speech for tensor loading!
gabe-l-hart May 15, 2026
5e7231a
feat: Add logging of deepstack_layers_arr if set
gabe-l-hart May 19, 2026
d072dc9
fix: Make sure input embeddings are cont before f_embedding_scale
gabe-l-hart May 19, 2026
0f65a0d
feat: Add init and mmproj_embd cases for g4v
gabe-l-hart May 19, 2026
87e363b
fix: Invert (h, w) -> (w, h) pinpoints
gabe-l-hart May 19, 2026
8c976a0
fix: Reorder projectors based on llm index and skip the first injection
gabe-l-hart May 19, 2026
b1ab316
fix: Fix mmproj hparams in conversion
gabe-l-hart May 19, 2026
02eabed
fix: Fix ordering/logic for deepstack injection in granite
gabe-l-hart May 19, 2026
af636f5
fix: Fix preprocessing config to match what the model needs
gabe-l-hart May 19, 2026
d655ee6
wip: Partial port of Eli's implementation
gabe-l-hart May 19, 2026
3e0508b
fix: Fix the pre-scaling on the input embeddings to correctly invert …
gabe-l-hart May 20, 2026
5792a27
feat: invert embedding multiplier -> base_scale at load
gabe-l-hart May 20, 2026
9a06787
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 21, 2026
f2e2de6
fix: Fix setting image_resize_pad after new enum introduced
gabe-l-hart May 21, 2026
6bb918c
fix: Add G4V to mmproj mapping in conversion
gabe-l-hart May 21, 2026
12c085e
fix: Re-add padding disable for non-hybrid hybrid models
gabe-l-hart May 21, 2026
db28b58
refactor: Simplify G4V n_tokens computation
gabe-l-hart May 21, 2026
6f110e7
feat: Add new clip APIs for post-tile-encoding assembly
gabe-l-hart May 22, 2026
db6a998
feat: Add model interfaces for granite 4 vision assembler
gabe-l-hart May 22, 2026
509c0ae
refactor: Remove all g4v-specific branching from mtmd.cpp in favor of…
gabe-l-hart May 22, 2026
3f15957
refactor(mtmd): Consolidate assembler logic into clip_assembler class…
gabe-l-hart May 22, 2026
1754e31
style: Comment improvement
gabe-l-hart May 22, 2026
75452c3
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 22, 2026
0a6c5cf
refactor: granite_vision -> granite4_vision
gabe-l-hart May 22, 2026
8dc2b24
fix: Remove dead codepath for Qwen3VL add_vision_is_deepstack
gabe-l-hart May 22, 2026
2c4e167
fix: Oops! I did not mean to commit one of my prompt files
gabe-l-hart May 22, 2026
72355e9
fix: Add missing <algorithm> include for std::find
gabe-l-hart May 22, 2026
33ec796
fix: Fix Flake8 warnings in granite conversion module
gabe-l-hart May 22, 2026
52633fa
refactor: Remove clip_assembler in favor of clip_image_f32.append_token
gabe-l-hart May 26, 2026
5977036
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 26, 2026
23990da
refactor(convert): Split n_deepstack_layers and deepstack_layers (array)
gabe-l-hart May 27, 2026
f28a91a
refactor(src): Handle n_deepstack_layers and deepstack_layers GGUF keys
gabe-l-hart May 27, 2026
6b42c74
fix: Fix GGUF key for deepstack_layers_arr
gabe-l-hart May 27, 2026
dd50fb4
refactor: Remove pre-scaling embeddings and skip scaling for raw embd…
gabe-l-hart May 27, 2026
e73aa80
refactor: deepstack_layers(_arr) -> deepstack_mapping(_arr)
gabe-l-hart May 27, 2026
43a6f6e
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 27, 2026
094ce7c
refactor: Fully revert changes to n_deepstack_layers and qwen3vl*
gabe-l-hart May 27, 2026
b556366
fix: Revert removal of "is_deepstack_layers" GGUF KV
gabe-l-hart May 27, 2026
11bd6bb
fix: Remove unnecessary ggml_cont and build_forward_expand in cbx
gabe-l-hart May 27, 2026
5c6bd55
style: Clean up comments
gabe-l-hart May 27, 2026
4910062
fix: Tighter and more flexible code for g4v_build_block
gabe-l-hart May 27, 2026
bb81156
fix: Remove unnecessary `unordered_set` include
gabe-l-hart May 27, 2026
4e6a206
fix: Add architecture guard on deepstack_mapping_arr printout
gabe-l-hart May 27, 2026
096ea2c
fix: Remove unnecessary AI-gen comment
gabe-l-hart May 27, 2026
0b46432
fix: Always initialize deepstack_mapping_arr with -1 values
gabe-l-hart May 27, 2026
7c4a791
style: Remove TODO about block/vs non-block tensor mapping
gabe-l-hart May 28, 2026
263a4a3
refactor: Move is_vision_feature_layer logic into clip_hparams
gabe-l-hart May 28, 2026
440f36b
refactor: Use a bool for append_token
gabe-l-hart May 28, 2026
9d8e3e8
style: Remove unnecessary comment
gabe-l-hart May 28, 2026
54546ff
fix: Remove unused get_model api
gabe-l-hart May 28, 2026
70c2302
refactor: Rearrange helpers for g4v to be private members and use bui…
gabe-l-hart May 28, 2026
5d6de27
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 28, 2026
55605c0
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 29, 2026
ecb247b
fix: Fix off-by-one in vision layer index
gabe-l-hart May 29, 2026
d8d37df
fix: Fix norm/post_norm mixup in conversion
gabe-l-hart May 29, 2026
255f934
style: More descriptive tensor names
gabe-l-hart May 29, 2026
eb8906a
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart May 29, 2026
3323d68
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart Jun 1, 2026
2eea626
Merge remote-tracking branch 'origin/master' into Granite4Vision
gabe-l-hart Jun 4, 2026
9eb1762
fix: Apply PR cleanup for new conversion changes
gabe-l-hart Jun 4, 2026
c5afa80
fix(convert): Remove duplicate V_ENC_EMBD_IMGNL
gabe-l-hart Jun 4, 2026
c12a262
refactor: append_token -> add_newline
gabe-l-hart Jun 4, 2026
d3d5a08
style: Comment cleanup
gabe-l-hart Jun 4, 2026
b200984
feat: Cleaner error handling/checking
gabe-l-hart Jun 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conversion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,7 @@
"Glm4vMoeForConditionalGeneration": "qwen3vl",
"GlmOcrForConditionalGeneration": "qwen3vl",
"GlmasrModel": "ultravox",
"Granite4VisionForConditionalGeneration": "granite",
"GraniteSpeechForConditionalGeneration": "granite",
"HunYuanVLForConditionalGeneration": "hunyuan",
"Idefics3ForConditionalGeneration": "smolvlm",
Expand Down
158 changes: 154 additions & 4 deletions conversion/granite.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from __future__ import annotations

import re
from typing import Any, Callable, Iterable, TYPE_CHECKING

import torch
Expand All @@ -13,7 +14,7 @@
from .mamba import Mamba2Model


@ModelBase.register("GraniteForCausalLM", "GraniteSpeechForConditionalGeneration")
@ModelBase.register("GraniteForCausalLM")
class GraniteModel(LlamaModel):
"""Conversion for IBM's GraniteForCausalLM"""
model_arch = gguf.MODEL_ARCH.GRANITE
Expand Down Expand Up @@ -46,11 +47,29 @@ def set_gguf_parameters(self):
self.gguf_writer.add_logit_scale(logits_scale)
logger.info("gguf: (granite) logits_scale = %s", logits_scale)

# If being used as the base for Granite4 Vision, add deepstack_layer_arr
if self.hparams.get("spatial_target_layers") or self.hparams.get("deepstack_layer_map"):
normalized_projector_map = Granite4VisionMmprojModel.get_normalized_projector_map(self.hparams)
deepstack_mapping_arr = [-1 for _ in range(self.block_count)] # Populate with -1 sentinels
for proj_idx, (_, llm_layer, _, _) in enumerate(normalized_projector_map):
# Skip the first projector which is handled as the base embedding
# stream like normal
if proj_idx == 0:
continue
deepstack_mapping_arr[llm_layer] = proj_idx
self.gguf_writer.add_deepstack_mapping(deepstack_mapping_arr)

@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, gen = item
if name.startswith("encoder."):
return None
# Skip multimodal tensors
if (
name.startswith(("encoder."))
or "image_" in name
or "layerwise_projectors" in name
or "spatial_projectors" in name
):
return
return super().filter_tensors(item)


Expand Down Expand Up @@ -241,7 +260,8 @@ def set_gguf_parameters(self):
assert self.d_inner % d_head == 0, f"SSM inner size {self.d_inner} not a multiple of head dim {d_head}"

def set_vocab(self):
self.hparams["pad_vocab_size_multiple"] = 8
# For models with no ssm layers, don't pad for mamba2
self.hparams["pad_vocab_size_multiple"] = 8 if self._ssm_layers else 1
Mamba2Model.set_vocab(self)


Expand Down Expand Up @@ -326,3 +346,133 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
data_torch = data_torch.squeeze(1)

yield from super().modify_tensors(data_torch, name, bid)


@ModelBase.register("Granite4VisionForConditionalGeneration")
class Granite4VisionMmprojModel(MmprojModel):
has_vision_encoder = True
has_audio_encoder = False

@staticmethod
def get_normalized_projector_map(global_config: dict) -> list[tuple[int, int, str, int]]:
"""Normalize both deepstack and spatial projector maps to the form:
(vision_layer, llm_layer, <type>, type_index)
This is then used to populate the following mappings:
- vision_feature_layers (mmproj hparam): ordered list of all
vision_layer values where order corresponds with the order of the
stacked projector tensors
NOTE: Values may appear multiple times for spatial projectors
- tensor_prefix_map (mmproj tensors): mapping from tensor prefixes to
the index of the corresponding projector in the stacked tensors
- deepstack_layer_arr (llm hparam): per-text-layer array indicating
which input vision feature should be injected at that layer
(-1 if none)
Output: (vision_layer, llm_layer, <type>, type_index)
"""
deepstack_map = global_config.get("deepstack_layer_map", []) # [[vis_layer, llm_layer], ...]
spatial_layers = global_config.get("spatial_target_layers", []) # [llm_layer, ...]
n_text_layers = global_config["text_config"]["num_hidden_layers"]
n_vision_layers = global_config["vision_config"]["num_hidden_layers"]
normalized_projector_map = []
if deepstack_map:
for deepstack_idx, (vision_layer, llm_layer) in enumerate(sorted(deepstack_map)):
if vision_layer < 0:
vision_layer = n_vision_layers + vision_layer
if llm_layer < 0:
llm_layer = n_text_layers + llm_layer
normalized_projector_map.append((vision_layer, llm_layer, "layerwise", deepstack_idx))
if spatial_layers:
spatial_vision_layer = global_config.get("spatial_vision_layer", -1)
if spatial_vision_layer < 0:
spatial_vision_layer = n_vision_layers + spatial_vision_layer
for spatial_idx, llm_layer in enumerate(spatial_layers):
normalized_projector_map.append((spatial_vision_layer, llm_layer, "spatial", spatial_idx))
return list(sorted(normalized_projector_map, key=(lambda entry: entry[1])))

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
normalized_projector_map = self.get_normalized_projector_map(self.global_config)
self._n_proj = len(normalized_projector_map)

self._tensor_prefix_map = {
f"model.{proj_type}_projectors.{type_idx}": proj_idx
for proj_idx, (_, _, proj_type, type_idx) in enumerate(normalized_projector_map)
}
self._vision_feature_layers = [vision_layer for vision_layer, _, _, _ in normalized_projector_map]
self._spatial_offsets = [
type_idx if proj_type == "spatial" else -1
for _, _, proj_type, type_idx in normalized_projector_map
]

def set_gguf_parameters(self):
assert self.hparams_vision is not None
super().set_gguf_parameters()

self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.GRANITE4_VISION)

# SigLIP encoder hparams
self.gguf_writer.add_vision_attention_layernorm_eps(self.hparams.get("layer_norm_eps", 1e-6))
self.gguf_writer.add_vision_use_gelu(True)

# Preprocessor
self.gguf_writer.add_vision_preproc_image_size(self.hparams.get("image_size", 384))

# QFormer projector config
ds_rate = self.global_config["downsample_rate"]
ds_parts = ds_rate.split("/")
assert len(ds_parts) == 2, f"Invalid 'downsample_rate' value: {ds_rate}"
query_side, window_side = [int(p) for p in ds_parts]
self.gguf_writer.add_vision_projector_query_side(query_side)
self.gguf_writer.add_vision_projector_window_side(window_side)

# Set vision feature layers
self.gguf_writer.add_vision_feature_layers(self._vision_feature_layers)

# Set the spatial offests per projector
self.gguf_writer.add_vision_spatial_offsets(self._spatial_offsets)

# Add flattened image grind pinpoints (resolution candidates internally)
if pinpoints := self.global_config.get("image_grid_pinpoints"):
# Flatten with h, w -> w, h inversion
pinpoints = [val for h, w in pinpoints for val in (w, h)]
self.gguf_writer.add_vision_image_grid_pinpoints(pinpoints)

@classmethod
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
name, _ = item
if ("vision_model.head" in name or name.startswith("lm_head")):
return None
return super().filter_tensors(item)

def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:

# Detect projector tensors and bin them
projector_idx = None
for prefix, proj_idx in self._tensor_prefix_map.items():
if name.startswith(prefix):
projector_idx = proj_idx
break
if projector_idx is not None:
# If this projector tensor has a block id within the projector,
# alias the bid to projector_idx
#
# TODO: currently, none of the Granite 4 Vision models have
# projectors with multiple QFormer layers, so the `layer.{}` index
# is always 0. This allows us to simply map to a single `bid` that
# matches the projector index. If this changes, we'll need a
# convention that merges the two IDs.
id_matches = list(re.finditer(r"\.([0-9]+)\.", name))
all_ids = [int(m.group(1)) for m in id_matches]
assert len(all_ids) >= 1 and len(all_ids) <= 2, "Must have at least 1 and at most 2 ids in tensor names"
# If not layer id, just use the projector index
new_bid = projector_idx
if len(all_ids) == 1:
new_name = name[:id_matches[0].span(1)[0]] + str(new_bid) + name[id_matches[0].span(1)[1]:]
else: # len(all_ids) == 2
new_bid = projector_idx # + all_ids[1]
new_name = name[:id_matches[0].span(0)[0]] + name[id_matches[0].span(1)[1]:id_matches[1].span(1)[0]] + str(new_bid) + name[id_matches[1].span(1)[1]:]
yield from super().modify_tensors(data_torch, new_name, new_bid)
return
yield from super().modify_tensors(data_torch, name, bid)
16 changes: 11 additions & 5 deletions convert_lora_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,10 @@ def parse_args() -> argparse.Namespace:
"--base-model-id", type=str,
help="the model ID of the base model, if it is not available locally or in the adapter config. If specified, it will ignore --base and load the base model config from the Hugging Face hub (Example: 'meta-llama/Llama-3.2-1B-Instruct')",
)
parser.add_argument(
"--trust-remote-code", default=False, action="store_true",
help="trust remote code in the model",
)
parser.add_argument(
"lora_path", type=Path,
help="directory containing Hugging Face PEFT LoRA config (adapter_model.json) and weights (adapter_model.safetensors or adapter_model.bin)",
Expand All @@ -319,11 +323,11 @@ def parse_args() -> argparse.Namespace:
return parser.parse_args()


def load_hparams_from_hf(hf_model_id: str) -> tuple[dict[str, Any], Path | None]:
def load_hparams_from_hf(hf_model_id: str, trust_remote_code: bool) -> tuple[dict[str, Any], Path | None]:
from huggingface_hub import try_to_load_from_cache

# normally, adapter does not come with base model config, we need to load it from AutoConfig
config = AutoConfig.from_pretrained(hf_model_id)
config = AutoConfig.from_pretrained(hf_model_id, trust_remote_code=trust_remote_code)
cache_dir = try_to_load_from_cache(hf_model_id, "config.json")
cache_dir = Path(cache_dir).parent if isinstance(cache_dir, str) else None

Expand Down Expand Up @@ -372,13 +376,13 @@ def load_hparams_from_hf(hf_model_id: str) -> tuple[dict[str, Any], Path | None]
# load base model
if base_model_id is not None:
logger.info(f"Loading base model from Hugging Face: {base_model_id}")
hparams, dir_base_model = load_hparams_from_hf(base_model_id)
hparams, dir_base_model = load_hparams_from_hf(base_model_id, args.trust_remote_code)
elif dir_base_model is None:
if "base_model_name_or_path" in lparams:
model_id = lparams["base_model_name_or_path"]
logger.info(f"Loading base model from Hugging Face: {model_id}")
try:
hparams, dir_base_model = load_hparams_from_hf(model_id)
hparams, dir_base_model = load_hparams_from_hf(model_id, args.trust_remote_code)
except OSError as e:
logger.error(f"Failed to load base model config: {e}")
logger.error("Please try downloading the base model and add its path to --base")
Expand All @@ -393,7 +397,9 @@ def load_hparams_from_hf(hf_model_id: str) -> tuple[dict[str, Any], Path | None]

with torch.inference_mode():
try:
model_class = get_model_class(hparams["architectures"][0])
model_arch = hparams.get("text_config", {}).get("architectures", hparams["architectures"])[0]
logger.info("Using model architecture: %s", model_arch)
model_class = get_model_class(model_arch)
except NotImplementedError:
logger.error(f"Model {hparams['architectures'][0]} is not supported")
sys.exit(1)
Expand Down
82 changes: 80 additions & 2 deletions gguf-py/gguf/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ class LLM:
MOE_LATENT_SIZE = "{arch}.moe_latent_size"
NEXTN_PREDICT_LAYERS = "{arch}.nextn_predict_layers"
NUM_DEEPSTACK_LAYERS = "{arch}.n_deepstack_layers"
DEEPSTACK_MAPPING = "{arch}.deepstack_mapping"
POOLING_TYPE = "{arch}.pooling_type"
LOGIT_SCALE = "{arch}.logit_scale"
DECODER_START_TOKEN_ID = "{arch}.decoder_start_token_id"
Expand Down Expand Up @@ -325,6 +326,8 @@ class ClipVision:
WA_PATTERN_MODE = "clip.vision.wa_pattern_mode" # used by mimovl, per-layer -1/0/1
IS_DEEPSTACK_LAYERS = "clip.vision.is_deepstack_layers"
WINDOW_SIZE = "clip.vision.window_size"
FEATURE_LAYERS = "clip.vision.feature_layer" # Granite4 Vision
IMAGE_GRID_PINPOINTS = "clip.vision.image_grid_pinpoints" # Granite4 Vision

class Attention:
HEAD_COUNT = "clip.vision.attention.head_count"
Expand All @@ -333,6 +336,9 @@ class Attention:

class Projector:
SCALE_FACTOR = "clip.vision.projector.scale_factor"
QUERY_SIDE = "clip.vision.projector.query_side"
WINDOW_SIDE = "clip.vision.projector.window_side"
SPATIAL_OFFSETS = "clip.vision.projector.spatial_offsets"

class SAM:
BLOCK_COUNT = "clip.vision.sam.block_count"
Expand Down Expand Up @@ -821,6 +827,31 @@ class MODEL_TENSOR(IntEnum):
V_RESMPL_QUERY_768 = auto() # Deepseek-OCR-2
V_RESMPL_QUERY_1024 = auto() # Deepseek-OCR-2

# qformer projector (vision) - Granite4 Vision
V_QF_PROJ_QUERY = auto()
V_QF_PROJ_NORM = auto()
V_QF_PROJ_LINEAR = auto()
V_QF_SELF_ATTN_Q = auto()
V_QF_SELF_ATTN_K = auto()
V_QF_SELF_ATTN_V = auto()
V_QF_SELF_ATTN_O = auto()
V_QF_SELF_ATTN_NORM = auto()
V_QF_CROSS_ATTN_Q = auto()
V_QF_CROSS_ATTN_K = auto()
V_QF_CROSS_ATTN_V = auto()
V_QF_CROSS_ATTN_O = auto()
V_QF_CROSS_ATTN_NORM = auto()
V_QF_FFN_UP = auto()
V_QF_FFN_DOWN = auto()
V_QF_FFN_NORM = auto()
V_PROJ_NORM = auto()
# multi-projector (bid => projector id) - Granite4 vision
V_MULTI_PROJ_IMG_POS = auto()
V_MULTI_PROJ_QUERY = auto()
V_MULTI_PROJ_NORM = auto()
V_MULTI_PROJ_LINEAR = auto()
V_MULTI_PROJ_POST_NORM = auto()

# audio (mtmd)
A_ENC_EMBD_POS = auto()
A_ENC_EMBD_NORM = auto()
Expand Down Expand Up @@ -885,7 +916,7 @@ class MODEL_TENSOR(IntEnum):
A_CTC_OUT = auto()
A_CTC_OUT_MID = auto()
A_ENC_ATTN_REL_POS_EMB = auto()
# qformer projector
# audio qformer projector
A_QF_PROJ_QUERY = auto()
A_QF_PROJ_NORM = auto()
A_QF_PROJ_LINEAR = auto()
Expand Down Expand Up @@ -1337,10 +1368,33 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.V_SAM_NECK: "v.sam.neck.{bid}",
MODEL_TENSOR.V_SAM_NET_2: "v.sam.net_2",
MODEL_TENSOR.V_SAM_NET_3: "v.sam.net_3",
MODEL_TENSOR.V_ENC_EMBD_IMGNL: "v.image_newline", # Deepseek-OCR
MODEL_TENSOR.V_ENC_EMBD_IMGNL: "v.image_newline", # Deepseek-OCR, Granite4Vision
MODEL_TENSOR.V_ENC_EMBD_VSEP: "v.view_seperator", # Deepseek-OCR
MODEL_TENSOR.V_RESMPL_QUERY_768: "v.resample_query_768", # Deepseek-OCR-2 qwen2
MODEL_TENSOR.V_RESMPL_QUERY_1024: "v.resample_query_1024", # Deepseek-OCR-2 qwen2
# Granite4 Vision
# qformer layers (bid => proj_id)
# NOTE: Names align with A_QF_*
MODEL_TENSOR.V_QF_SELF_ATTN_Q: "v.proj_blk.{bid}.self_attn_q",
MODEL_TENSOR.V_QF_SELF_ATTN_K: "v.proj_blk.{bid}.self_attn_k",
MODEL_TENSOR.V_QF_SELF_ATTN_V: "v.proj_blk.{bid}.self_attn_v",
MODEL_TENSOR.V_QF_SELF_ATTN_O: "v.proj_blk.{bid}.self_attn_out",
MODEL_TENSOR.V_QF_SELF_ATTN_NORM: "v.proj_blk.{bid}.self_attn_norm",
MODEL_TENSOR.V_QF_CROSS_ATTN_Q: "v.proj_blk.{bid}.cross_attn_q",
MODEL_TENSOR.V_QF_CROSS_ATTN_K: "v.proj_blk.{bid}.cross_attn_k",
MODEL_TENSOR.V_QF_CROSS_ATTN_V: "v.proj_blk.{bid}.cross_attn_v",
MODEL_TENSOR.V_QF_CROSS_ATTN_O: "v.proj_blk.{bid}.cross_attn_out",
MODEL_TENSOR.V_QF_CROSS_ATTN_NORM: "v.proj_blk.{bid}.cross_attn_norm",
MODEL_TENSOR.V_QF_FFN_UP: "v.proj_blk.{bid}.ffn_up",
MODEL_TENSOR.V_QF_FFN_DOWN: "v.proj_blk.{bid}.ffn_down",
MODEL_TENSOR.V_QF_FFN_NORM: "v.proj_blk.{bid}.ffn_norm",
# multi-projector (bid => projector ID)
MODEL_TENSOR.V_MULTI_PROJ_IMG_POS: "v.proj_blk.{bid}.img_pos",
MODEL_TENSOR.V_MULTI_PROJ_QUERY: "v.proj_blk.{bid}.query",
MODEL_TENSOR.V_MULTI_PROJ_NORM: "v.proj_blk.{bid}.norm",
MODEL_TENSOR.V_MULTI_PROJ_LINEAR: "v.proj_blk.{bid}.linear",
MODEL_TENSOR.V_MULTI_PROJ_POST_NORM: "v.proj_blk.{bid}.post_norm",

# audio (mtmd)
# note: all audio tensor names must use prefix "a." or "mm.a."
MODEL_TENSOR.A_ENC_EMBD_POS: "a.position_embd",
Expand Down Expand Up @@ -1522,6 +1576,29 @@ class MODEL_TENSOR(IntEnum):
MODEL_TENSOR.V_SAM_NET_3,
MODEL_TENSOR.V_RESMPL_QUERY_768,
MODEL_TENSOR.V_RESMPL_QUERY_1024,
MODEL_TENSOR.V_PROJ_NORM,
MODEL_TENSOR.V_QF_PROJ_QUERY,
MODEL_TENSOR.V_QF_PROJ_NORM,
MODEL_TENSOR.V_QF_PROJ_LINEAR,
MODEL_TENSOR.V_QF_SELF_ATTN_Q,
MODEL_TENSOR.V_QF_SELF_ATTN_K,
MODEL_TENSOR.V_QF_SELF_ATTN_V,
MODEL_TENSOR.V_QF_SELF_ATTN_O,
MODEL_TENSOR.V_QF_SELF_ATTN_NORM,
MODEL_TENSOR.V_QF_CROSS_ATTN_Q,
MODEL_TENSOR.V_QF_CROSS_ATTN_K,
MODEL_TENSOR.V_QF_CROSS_ATTN_V,
MODEL_TENSOR.V_QF_CROSS_ATTN_O,
MODEL_TENSOR.V_QF_CROSS_ATTN_NORM,
MODEL_TENSOR.V_QF_FFN_UP,
MODEL_TENSOR.V_QF_FFN_DOWN,
MODEL_TENSOR.V_QF_FFN_NORM,
MODEL_TENSOR.V_QF_PROJ_NORM,
MODEL_TENSOR.V_MULTI_PROJ_IMG_POS,
MODEL_TENSOR.V_MULTI_PROJ_QUERY,
MODEL_TENSOR.V_MULTI_PROJ_LINEAR,
MODEL_TENSOR.V_MULTI_PROJ_NORM,
MODEL_TENSOR.V_MULTI_PROJ_POST_NORM,
# audio
MODEL_TENSOR.A_ENC_EMBD_POS,
MODEL_TENSOR.A_ENC_EMBD_NORM,
Expand Down Expand Up @@ -4388,6 +4465,7 @@ class VisionProjectorType:
MINICPMV4_6 = "minicpmv4_6"
GRANITE_SPEECH = "granite_speech" # audio
MIMOVL = "mimovl"
GRANITE4_VISION = "granite4_vision"


# Items here are (block size, type size)
Expand Down
Loading
Loading