Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

xenova · 2024-10-30T16:19:21Z

New models

Janus (any-to-any)

This PR adds support for deepseek-ai/Janus-1.3B, a novel autoregressive framework that unifies multimodal understanding and generation.

In particular, it can do the following:

text+image to text:

// Example code based on https://github.com/deepseek-ai/Janus/blob/main/inference.py
import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';

// Load processor and model
const model_id = 'onnx-community/Janus-1.3B-ONNX';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id, {
    dtype: {
        prepare_inputs_embeds: 'q4',
        language_model: 'q4f16',
        lm_head: 'fp16',
        gen_head: 'fp16',
        gen_img_embeds: 'fp16',
        image_decode: 'fp32',
    },
});

// Prepare inputs
const conversation = [
    {
        role: "User",
        content: "<image_placeholder>\nConvert the formula into latex code.",
        images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
    },
]
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
    ...inputs,
    max_new_tokens: 150,
    do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null])
const decoded = processor.tokenizer.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded);

Example output:

Sure, here is the LaTeX code for the given formula:

```
x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
```

This code represents the mathematical expression for the variable \( x \).

image-to-text:

// Example code based on https://github.com/deepseek-ai/Janus/blob/main/generation_inference.py
import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';

// Load processor and model
const model_id = 'onnx-community/Janus-1.3B-ONNX';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id, {
    dtype: {
        prepare_inputs_embeds: 'fp32',
        language_model: 'q4',
        lm_head: 'fp32',
        gen_head: 'fp32',
        gen_img_embeds: 'fp32',
        image_decode: 'fp32',
    },
});

// Prepare inputs
const conversation = [
    {
        role: "User",
        content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
    },
]
const inputs = await processor(conversation, {
    chat_template: "text_to_image",
});

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
    ...inputs,
    min_new_tokens: num_image_tokens,
    max_new_tokens: num_image_tokens,
    do_sample: true,
});

// Save the generated image
await outputs[0].save('test.png');

Example outputs:

This PR also refactors the way that processor classes load image/text pre-preprocessors, aligning better with the python transformers library.

HuggingFaceDocBuilderDev · 2024-10-31T02:02:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kungfooman · 2024-11-06T17:44:15Z

This is an amazing PR and I thought I give it a test in Chrome on Linux. I see randomly these two errors:

Or simply:

Is there anything special I need to do to test this in the browser? So far I couldn't get it to work.

By default it seems to pick WebGPU, but WebGPU is known to work very poorly on Linux, so I tried every other possibility:

const model = await MultiModalityCausalLM.from_pretrained(model_id, {
    dtype: {
        prepare_inputs_embeds: 'fp32',
        language_model: 'q4',
        lm_head: 'fp32',
        gen_head: 'fp32',
        gen_img_embeds: 'fp32',
        image_decode: 'fp32',
    },
    // Pick one: webnn-npu, webnn-gpu, webnn-cpu, webnn, webgpu, wasm
    device: 'wasm',
});

None worked 🙈

Thank you anyway, looking forward getting this to work somehow!

pdufour · 2024-11-13T01:18:53Z

This is great, will this PR support Qwen2-VL? 🙏

xenova · 2024-11-20T12:20:02Z

This is great, will this PR support Qwen2-VL? 🙏

Hey @pdufour, I was originally planning on doing this in a separate PR, but I've been following your work on getting it running (great work BTW!) and so it might be possible to squeeze into this PR! 👀

xenova · 2024-11-26T16:07:15Z

Merging to put out Transformers.js v3.1. Follow-up patches may be needed, but it's good to go for now imo!

xenova added 4 commits October 28, 2024 09:57

Extract processor classes into separate folders

bef6361

Fix typo

628d59f

Define which classes use processor_config.json

03f6662

[WIP] Add support for deepseek-ai/Janus-1.3B

d040e81

xenova marked this pull request as draft October 30, 2024 16:19

xenova added 4 commits October 30, 2024 16:41

Fix unit tests

ae5a29d

Remove redundant extends JSDoc

931579e

Fix JSDoc

2ade0ba

Update Janus JSDoc

a4ecf08

xenova added 9 commits November 14, 2024 15:05

Improve VLChatProcessor processor types

29097c0

Expose ImageFeatureExtractor as copy of ImageProcessor

76f8c33

Add support for LLaVA-OneVision

c2005c9

Add support for ViTPose

f00456c

Add ViTPose to README

c681ed9

Merge branch 'main' into add-janus

c5164f3

Bump dependencies

545354f

Add support for MGP-STR models

5ac2cb9

Documentation fixes

c4b1d63

xenova mentioned this pull request Nov 20, 2024

Add prettier CI step #1044

Open

xenova added 7 commits November 20, 2024 15:31

Add support for Qwen2VLImageProcessor

a06fbc6

Format tests folder

155bb9d

Use AutoImageProcessor for image processors

76c132f

Add support for Qwen2VLProcessor

6146f0b

Fix image_grid_thw dtype

8138b23

Fix bigint product

29de4b0

[WIP] Support for qwen2vl models

54073c8

xenova added 12 commits November 22, 2024 10:34

Add support for JinaCLIP models

cb0e09b

Add listed support for Janus

dba3b2f

Fix qwen2vl processor unit test

cf0714f

Update dependency versions

41a0755

Export logits processors

7d60bfe

Expose batch_decode for processor

95688fc

Qwen2VL - Implement get_rope_index

2e945a5

Add Qwen2VLForConditionalGeneration unit tests

e28ac7a

Update dependencies

79fe412

Update onnxslim==0.1.42

83da94b

tokenizer.default_chat_template has been removed

3019e55

Merge branch 'main' into add-janus

a3992a0

xenova changed the title ~~[WIP] Add support for deepseek-ai/Janus-1.3B (any-to-any)~~ Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, MGP-STR) & refactor processors. Nov 26, 2024

Add listed support for Qwen2-VL

3017402

xenova changed the title ~~Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, MGP-STR) & refactor processors.~~ Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. Nov 26, 2024

xenova marked this pull request as ready for review November 26, 2024 12:57

Fix .from_pretrained function type

cbfdc07

xenova merged commit e848907 into main Nov 26, 2024
4 checks passed

xenova deleted the add-janus branch November 27, 2024 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

xenova commented Oct 30, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 31, 2024

kungfooman commented Nov 6, 2024

pdufour commented Nov 13, 2024 •

edited

Loading

xenova commented Nov 20, 2024 •

edited

Loading

xenova commented Nov 26, 2024

Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

Add new models (Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR) & refactor processors. #1001

Conversation

xenova commented Oct 30, 2024 • edited Loading

New models

Janus (any-to-any)

HuggingFaceDocBuilderDev commented Oct 31, 2024

kungfooman commented Nov 6, 2024

pdufour commented Nov 13, 2024 • edited Loading

xenova commented Nov 20, 2024 • edited Loading

xenova commented Nov 26, 2024

xenova commented Oct 30, 2024 •

edited

Loading

pdufour commented Nov 13, 2024 •

edited

Loading

xenova commented Nov 20, 2024 •

edited

Loading