mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300
mtmd: DeepSeek-OCR multi-tile dynamic resolution batched encoding#24300sfallah wants to merge 1 commit into
Conversation
|
unfortunately I don't think we can accept this change as-is, it's large change and you didn't push any discussions before implementing it. as a consequence, it already conflict with recent changes from #21858 |
| // | ||
| // v1 weaves newlines onto the grid in-graph; | ||
| // v2 just concatenates the per-tile query tokens. | ||
| static bool encode_deepseekocr(clip_ctx * ctx_clip, |
There was a problem hiding this comment.
nope, it's too hacky this way
instead of hacking to benefit only deepseek-ocr, we should discuss & make a proper batching API that can benefit all models
@ngxson It wasn't my aim to actually introduce batching in As always I will be grateful if you tell me how we can proceed. |
the more important thing for now is to plan: coding it is easy, but it will be hard to design an API that is both (1) easy to be adapted in downstream code and (2) more importantly, being model-agnostic. in theory, not just deepseek-ocr, but batching can be enabled on any multimodal encoders (including image and audio input). the most simple way of batching is to allow multiple input having the same size being batched using the 4th dim. dynamic resolution (i.e. 2D rope, m-rope) can also be supported with a mask, but that's quite complicated so it can be skipped for now. so point (2) is not very difficult to achieve. for point (1), that will be a different story. models can have text between 2 images, which make it tricky to keep track of image position. a new array of API (both core API and helpers) will need to be added to handle this case, but I don't yet have a good picture for now. in either ways, I don't think this PR is useful as-is, as it's built mostly around the quirks of deepseek-ocr (for example, no special tokens between 2 images). if we manage to enable model-agnostic batching, then deepseek-ocr will automatically have it, as it's just a sub case of llava-uhd. |
|
I'm replacing this one with #24384 Still WIP, but the public API already works with video input (e.g. multiple input images with same size), it should be trivial to adapt it to the newly 4th dim in build_vit() |
|
@ngxson
Beyond that, something special about DeepSeek-OCR v1 is that newline tensors are appended/concatenated per tile-grid row. This is why a proper DSOCR v1 dynamic-resolution multi-tile implementation couldn't have been done cleanly without batch encoding. I will wait for #24384 to introduce the batching API, and then adapt the DSOCR implementation according to the new API design. |
Overview
This PR adds batched image encoding for both DeepSeek-OCR (DSOCR) versions and unifies their encoding and preprocessing. It builds on the existing DeepSeek-OCR v1/v2 support.
Instead of encoding each tile iteratively, all tiles are now encoded in a single graph execution.
build_vitis extended with a batch dimension.The v1 and v2 preprocessors are merged into one multi-tile dynamic-resolution preprocessor.
Main changes
build_vit; the batch dim is flattened everywhere except self-attention, which uses 4D[d_head, n_head, n_pos, B]Q/K/V viewsencode_deepseekocr: the tile grid is encoded as one batch, then the global view is encoded and appendedclip_n_output_tokensgrid-aware for the tiled token countsreserve_dsocr_max_tilesto pre-reserve the worst-case tile batch at warmupImplementation notes
B = grid_x * grid_yis derived insidebuild_vit. The body runs on a flattened 2D[n_embd, n_pos * B]tensor; the batch only re-appears in the 4D self-attention views. ForB == 1(all non-DSOCR models and the single global view) the graph is unchanged.[tiles..., global]order.Testing
tools/mtmd/tests/test-deepseek-ocr.pywith multi-tile (dynamic-resolution) cases for both versionstests.shpasses for all models (thehugevariant is not tested)llama-mtmd-cliandllama-serverCaveats
@ngxson
B = 1).sf/deepseek-ocr-mul-tile-dyn-res) that I consider to be a hack. That is why I followed this path more seriously.build_vitbatching PR and this one, if you'd prefer.Requirements