-
Notifications
You must be signed in to change notification settings - Fork 30.4k
Support Kosmos-2.5 #31711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Kosmos-2.5 #31711
Changes from all commits
9e620b6
65490b4
d0bf57e
f5d4439
40ff015
63603d6
f8497ce
eab8e69
a6154db
94cc6d2
968b033
142604d
4b7bc95
6f2bd73
08e1cb0
f2dae0d
fcc095f
f66c6ee
9a8479d
830671b
1c58c8f
0153a08
ac94b57
52788cc
0b9e5ad
925e14a
6ed504d
9a841ad
dcced48
91fa383
e3802f4
85da449
b1db4f2
f8c98d6
fbb3e59
ce3a6b0
90c4fcc
00e324d
9c8aff7
2c47915
395a636
8a058d9
c639eeb
b688c4f
d1c52f4
3a58742
d5b8349
9ddc86b
39dc6ef
b2c3db2
c356a36
55944fc
83d600e
2d4cbba
6b2f7d7
5f731a9
0ec499a
7f0d26c
db865db
bf14c4b
9b29aac
ce222a6
876cb6b
a3638ea
30f927a
a65a9b1
7c99fd0
ec9ea0c
8fc9699
22cb70d
001fd70
d1116f5
6f09a51
7d0b827
cd018b0
a5b23f8
d1debcc
1279316
69aec2e
8c579a9
af813ce
ca60142
19da4a2
777a3e2
1ace3d1
8d2e51f
7b65626
89c6901
59700f9
4988c47
3a411a3
4036920
59c21c9
48c6965
f36ef6f
d812476
f339e50
f81256a
beb281c
a6ff4d2
fb62fd6
9a90c54
cf804ac
8fc5a31
0593308
6a55353
2050bc3
9cdfdf3
c49d565
a809295
95dc35c
bcaf808
917dcc8
42c3216
cd35c34
185f370
0c0c485
0216ac8
6b288c3
94e563a
ad6ded5
17b78dd
b9fc031
0ece5c7
fba70ba
4281ff3
6e071c7
a5318da
ead54fd
6c320a6
783877d
7399d8a
c01c60d
902a030
0323624
740386c
9744cb6
19330d2
d5df504
5e3a2e6
0798236
e835f82
b99e679
dd51797
958adb7
bd2083e
9848655
d82f6af
77d0ea2
b9062f5
28a7c36
cc2b7bd
f2c2752
5b72c43
42229b6
ac9c77e
d838cdc
fd1cad2
3e5b033
67a0b79
a7961cb
2c44bd2
bd4d18b
5450db1
e36ea15
9831c1b
e2e9ed9
464d93e
c5981d0
dd3f5ae
5df3ade
625d473
98aa5a5
41f1f4c
44bc32e
7f9ab92
d8e3d52
4e28b99
49b5d8d
e6b6969
439c1d7
61b3a2a
0ad75d7
4baab28
05a64fd
afbe1d7
05d65f7
3703cc4
00903d7
2e6222c
a54d828
27e7171
5e5657b
4ea8d5b
1d5860a
2c396ae
e297649
1f5ae29
61a8fa2
5c51c21
b2bc89b
a8e7da8
f07be7e
8eba9a7
2f8daca
1598dea
41ace2a
9360be5
bc7c331
beaf91c
30fd911
0044442
1f8228a
9d3f1c9
531165e
9afadc5
942c6ad
ca63d57
cb2e311
92862d2
fb9443b
8935c26
911daf2
3e65e24
8afbcd0
1602f48
2bd84a3
e1cd76f
c3e3286
4b0a5ee
5a69625
cab04e4
65c3fc7
b7b3bbf
bba3620
81860a1
4368dca
4825318
24af2a1
955a90e
b41bf5f
2d19e71
2c87331
7baba65
73c5b59
8ad81a1
2f423a4
5578ecf
d681f81
bf16805
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
<!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
the License. You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
specific language governing permissions and limitations under the License. | ||
--> | ||
|
||
<div style="float: right;"> | ||
<div class="flex flex-wrap space-x-1"> | ||
<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
<img alt="Flax" src="https://img.shields.io/badge/Flax-29a79b.svg?style=flat&logo="> | ||
<img alt="FlashAttention" src="https://img.shields.io/badge/%E2%9A%A1%EF%B8%8E%20FlashAttention-eae0c8?style=flat"> | ||
<img alt="SDPA" src="https://img.shields.io/badge/SDPA-DE3412?style=flat&logo=pytorch&logoColor=white"> | ||
</div> | ||
</div> | ||
|
||
|
||
# KOSMOS-2.5 | ||
|
||
The Kosmos-2.5 model was proposed in [KOSMOS-2.5: A Multimodal Literate Model](https://arxiv.org/abs/2309.11419/) by Microsoft. | ||
|
||
The abstract from the paper is the following: | ||
|
||
*We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.* | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png" | ||
alt="drawing" width="600"/> | ||
|
||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_md.png" | ||
alt="drawing" width="600"/> | ||
|
||
<small> Overview of tasks that KOSMOS-2.5 can handle. Taken from the <a href="https://arxiv.org/abs/2309.11419">original paper</a>. </small> | ||
|
||
The examples below demonstrates how to generate with [`AutoModel`], for both Markdown and OCR tasks. | ||
|
||
<hfoptions id="usage"> | ||
<hfoption id="AutoModel - Markdown Task"> | ||
|
||
```py | ||
import re | ||
import torch | ||
import requests | ||
from PIL import Image, ImageDraw | ||
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration | ||
|
||
repo = "ydshieh/kosmos-2.5" | ||
device = "cuda:0" | ||
dtype = torch.bfloat16 | ||
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype) | ||
processor = AutoProcessor.from_pretrained(repo) | ||
|
||
# sample image | ||
url = "https://huggingface.co/ydshieh/kosmos-2.5/resolve/main/receipt_00008.png" | ||
image = Image.open(requests.get(url, stream=True).raw) | ||
|
||
prompt = "<md>" | ||
inputs = processor(text=prompt, images=image, return_tensors="pt") | ||
|
||
height, width = inputs.pop("height"), inputs.pop("width") | ||
raw_width, raw_height = image.size | ||
scale_height = raw_height / height | ||
scale_width = raw_width / width | ||
|
||
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()} | ||
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype) | ||
generated_ids = model.generate( | ||
**inputs, | ||
max_new_tokens=1024, | ||
) | ||
|
||
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | ||
print(generated_text[0]) | ||
``` | ||
|
||
</hfoption> | ||
<hfoption id="AutoModel - OCR Task"> | ||
|
||
```py | ||
import re | ||
import torch | ||
import requests | ||
from PIL import Image, ImageDraw | ||
from transformers import AutoProcessor, Kosmos2_5ForConditionalGeneration | ||
|
||
repo = "ydshieh/kosmos-2.5" | ||
device = "cuda:0" | ||
dtype = torch.bfloat16 | ||
model = Kosmos2_5ForConditionalGeneration.from_pretrained(repo, device_map=device, torch_dtype=dtype) | ||
processor = AutoProcessor.from_pretrained(repo) | ||
|
||
# sample image | ||
url = "https://huggingface.co/ydshieh/kosmos-2.5/resolve/main/receipt_00008.png" | ||
image = Image.open(requests.get(url, stream=True).raw) | ||
|
||
# bs = 1 | ||
prompt = "<ocr>" | ||
inputs = processor(text=prompt, images=image, return_tensors="pt") | ||
height, width = inputs.pop("height"), inputs.pop("width") | ||
raw_width, raw_height = image.size | ||
scale_height = raw_height / height | ||
scale_width = raw_width / width | ||
|
||
# bs > 1, batch generation | ||
# inputs = processor(text=[prompt, prompt], images=[image,image], return_tensors="pt") | ||
# height, width = inputs.pop("height"), inputs.pop("width") | ||
# raw_width, raw_height = image.size | ||
# scale_height = raw_height / height[0] | ||
# scale_width = raw_width / width[0] | ||
|
||
inputs = {k: v.to(device) if v is not None else None for k, v in inputs.items()} | ||
inputs["flattened_patches"] = inputs["flattened_patches"].to(dtype) | ||
generated_ids = model.generate( | ||
**inputs, | ||
max_new_tokens=1024, | ||
) | ||
|
||
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True) | ||
def post_process(y, scale_height, scale_width): | ||
y = y.replace(prompt, "") | ||
if "<md>" in prompt: | ||
return y | ||
pattern = r"<bbox><x_\d+><y_\d+><x_\d+><y_\d+></bbox>" | ||
bboxs_raw = re.findall(pattern, y) | ||
lines = re.split(pattern, y)[1:] | ||
bboxs = [re.findall(r"\d+", i) for i in bboxs_raw] | ||
bboxs = [[int(j) for j in i] for i in bboxs] | ||
info = "" | ||
for i in range(len(lines)): | ||
box = bboxs[i] | ||
x0, y0, x1, y1 = box | ||
if not (x0 >= x1 or y0 >= y1): | ||
x0 = int(x0 * scale_width) | ||
y0 = int(y0 * scale_height) | ||
x1 = int(x1 * scale_width) | ||
y1 = int(y1 * scale_height) | ||
info += f"{x0},{y0},{x1},{y0},{x1},{y1},{x0},{y1},{lines[i]}" | ||
return info | ||
|
||
output_text = post_process(generated_text[0], scale_height, scale_width) | ||
print(output_text) | ||
|
||
draw = ImageDraw.Draw(image) | ||
lines = output_text.split("\n") | ||
for line in lines: | ||
# draw the bounding box | ||
line = list(line.split(",")) | ||
if len(line) < 8: | ||
continue | ||
line = list(map(int, line[:8])) | ||
draw.polygon(line, outline="red") | ||
image.save("output.png") | ||
``` | ||
|
||
</hfoption> | ||
</hfoptions> | ||
|
||
|
||
## Example | ||
**Markdown Task:** For usage instructions, please refer to [md.py](https://huggingface.co/ydshieh/kosmos-2.5/blob/main/md.py). | ||
|
||
**OCR Task:** For usage instructions, please refer to [ocr.py](https://huggingface.co/ydshieh/kosmos-2.5/blob/main/ocr.py). | ||
|
||
|
||
|
||
## Kosmos2_5Config | ||
|
||
[[autodoc]] Kosmos2_5Config | ||
|
||
## Kosmos2_5ImageProcessor | ||
|
||
[[autodoc]] Kosmos2_5ImageProcessor | ||
- preprocess | ||
|
||
## Kosmos2_5ImageProcessorFast | ||
|
||
[[autodoc]] Kosmos2_5ImageProcessorFast | ||
- preprocess | ||
|
||
## Kosmos2_5Processor | ||
|
||
[[autodoc]] Kosmos2_5Processor | ||
|
||
## Kosmos2_5Model | ||
|
||
[[autodoc]] Kosmos2_5Model | ||
- forward | ||
|
||
## Kosmos2_5ForConditionalGeneration | ||
|
||
[[autodoc]] Kosmos2_5ForConditionalGeneration | ||
- forward |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -209,6 +209,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): | |
("jetmoe", "JetMoeModel"), | ||
("jukebox", "JukeboxModel"), | ||
("kosmos-2", "Kosmos2Model"), | ||
("kosmos-2.5", "Kosmos2_5Model"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it ok to have a "." in the model name? for other models we have "_", e.g. "qwen2_5_vl" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
("kyutai_speech_to_text", "KyutaiSpeechToTextModel"), | ||
("layoutlm", "LayoutLMModel"), | ||
("layoutlmv2", "LayoutLMv2Model"), | ||
|
@@ -942,6 +943,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): | |
("instructblip", "InstructBlipForConditionalGeneration"), | ||
("instructblipvideo", "InstructBlipVideoForConditionalGeneration"), | ||
("kosmos-2", "Kosmos2ForConditionalGeneration"), | ||
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"), | ||
("llava", "LlavaForConditionalGeneration"), | ||
("llava_next", "LlavaNextForConditionalGeneration"), | ||
("llava_next_video", "LlavaNextVideoForConditionalGeneration"), | ||
|
@@ -990,6 +992,7 @@ class _BaseModelWithGenerate(PreTrainedModel, GenerationMixin): | |
("internvl", "InternVLForConditionalGeneration"), | ||
("janus", "JanusForConditionalGeneration"), | ||
("kosmos-2", "Kosmos2ForConditionalGeneration"), | ||
("kosmos-2.5", "Kosmos2_5ForConditionalGeneration"), | ||
("llama4", "Llama4ForConditionalGeneration"), | ||
("llava", "LlavaForConditionalGeneration"), | ||
("llava_next", "LlavaNextForConditionalGeneration"), | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# coding=utf-8 | ||
# Copyright 2024 Microsoft Research and The HuggingFace Inc. team. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
from typing import TYPE_CHECKING | ||
|
||
from ...utils import _LazyModule | ||
from ...utils.import_utils import define_import_structure | ||
|
||
|
||
if TYPE_CHECKING: | ||
from .configuration_kosmos2_5 import * | ||
from .image_processing_kosmos2_5 import * | ||
from .image_processing_kosmos2_5_fast import * | ||
from .modeling_kosmos2_5 import * | ||
from .processing_kosmos2_5 import * | ||
else: | ||
import sys | ||
|
||
_file = globals()["__file__"] | ||
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is missing some snippets about how to use for example extra bboxes and use post processor to plot boxes on the image