Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for deepseek-ai/Janus-1.3B (any-to-any) #1001

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

xenova
Copy link
Collaborator

@xenova xenova commented Oct 30, 2024

This PR adds support for deepseek-ai/Janus-1.3B, a novel autoregressive framework that unifies multimodal understanding and generation.

In particular, it can do the following:

  • text+image to text:

    // Example code based on https://github.com/deepseek-ai/Janus/blob/main/inference.py
    import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';
    
    // Load processor and model
    const model_id = 'onnx-community/Janus-1.3B-ONNX';
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await MultiModalityCausalLM.from_pretrained(model_id, {
        dtype: {
            prepare_inputs_embeds: 'q4',
            language_model: 'q4f16',
            lm_head: 'fp16',
            gen_head: 'fp16',
            gen_img_embeds: 'fp16',
            image_decode: 'fp32',
        },
    });
    
    // Prepare inputs
    const conversation = [
        {
            role: "User",
            content: "<image_placeholder>\nConvert the formula into latex code.",
            images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
        },
    ]
    const inputs = await processor(conversation);
    
    // Generate response
    const outputs = await model.generate({
        ...inputs,
        max_new_tokens: 150,
        do_sample: false,
    });
    
    // Decode output
    const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null])
    const decoded = processor.tokenizer.batch_decode(new_tokens, { skip_special_tokens: true });
    console.log(decoded);

    Example output:

    Sure, here is the LaTeX code for the given formula:
    
    ```
    x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
    ```
    
    This code represents the mathematical expression for the variable \( x \).
    
  • image-to-text:

    // Example code based on https://github.com/deepseek-ai/Janus/blob/main/generation_inference.py
    import { AutoProcessor, MultiModalityCausalLM } from '@huggingface/transformers';
    
    // Load processor and model
    const model_id = 'onnx-community/Janus-1.3B-ONNX';
    const processor = await AutoProcessor.from_pretrained(model_id);
    const model = await MultiModalityCausalLM.from_pretrained(model_id, {
        dtype: {
            prepare_inputs_embeds: 'fp32',
            language_model: 'q4',
            lm_head: 'fp32',
            gen_head: 'fp32',
            gen_img_embeds: 'fp32',
            image_decode: 'fp32',
        },
    });
    
    // Prepare inputs
    const conversation = [
        {
            role: "User",
            content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
        },
    ]
    const inputs = await processor(conversation, {
        chat_template: "text_to_image",
    });
    
    // Generate response
    const num_image_tokens = processor.num_image_tokens;
    const outputs = await model.generate_images({
        ...inputs,
        min_new_tokens: num_image_tokens,
        max_new_tokens: num_image_tokens,
        do_sample: true,
    });
    
    // Save the generated image
    await outputs[0].save('test.png');

    Example outputs:

    fox_1 fox_2 fox_3 fox_4
    fox_5 fox_6 fox_7 fox_8

This PR also refactors the way that processor classes load image/text pre-preprocessors, aligning better with the python transformers library.

@xenova xenova marked this pull request as draft October 30, 2024 16:19
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants