Skip to content

Commit

Permalink
Add support for SmolVLM2 (#1196)
Browse files Browse the repository at this point in the history
* Add support for SmolVLM

* Always flush text streamer after prompt

* [WIP] video.js

* Fix streamer unit tests

* Export video.js

* Video processing improvements
  • Loading branch information
xenova authored Feb 26, 2025
1 parent cfd3e55 commit 591a112
Show file tree
Hide file tree
Showing 12 changed files with 153 additions and 5 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,6 +402,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@
1. **[SegFormer](https://huggingface.co/docs/transformers/model_doc/segformer)** (from NVIDIA) released with the paper [SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers](https://arxiv.org/abs/2105.15203) by Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo.
1. **[Segment Anything](https://huggingface.co/docs/transformers/model_doc/sam)** (from Meta AI) released with the paper [Segment Anything](https://arxiv.org/pdf/2304.02643v1.pdf) by Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alex Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick.
1. **[SigLIP](https://huggingface.co/docs/transformers/main/model_doc/siglip)** (from Google AI) released with the paper [Sigmoid Loss for Language Image Pre-Training](https://arxiv.org/abs/2303.15343) by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer.
1. **[SmolVLM](https://huggingface.co/docs/transformers/main/model_doc/smolvlm) (from Hugging Face) released with the blog posts [SmolVLM - small yet mighty Vision Language Model](https://huggingface.co/blog/smolvlm) and [SmolVLM Grows Smaller – Introducing the 250M & 500M Models!](https://huggingface.co/blog/smolervlm) by the Hugging Face TB Research team.
1. **[SpeechT5](https://huggingface.co/docs/transformers/model_doc/speecht5)** (from Microsoft Research) released with the paper [SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing](https://arxiv.org/abs/2110.07205) by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
1. **[SqueezeBERT](https://huggingface.co/docs/transformers/model_doc/squeezebert)** (from Berkeley) released with the paper [SqueezeBERT: What can computer vision teach NLP about efficient neural networks?](https://arxiv.org/abs/2006.11316) by Forrest N. Iandola, Albert E. Shaw, Ravi Krishna, and Kurt W. Keutzer.
1. **[StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm)** (from Stability AI) released with the paper [StableLM 3B 4E1T (Technical Report)](https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-3B-4E1T--VmlldzoyMjU4?accessToken=u3zujipenkx5g7rtcj9qojjgxpconyjktjkli2po09nffrffdhhchq045vp0wyfo) by Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz, Duy Phung, Maksym Zhuravinskyi, Nathan Cooper, Nikhil Pinnaparaju, Reshinth Adithyan, and James Baicoianu.
Expand Down
1 change: 1 addition & 0 deletions src/configs.js
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ function getNormalizedConfig(config) {
case 'florence2':
case 'llava_onevision':
case 'idefics3':
case 'smolvlm':
// @ts-expect-error TS2339
init_normalized_config = getNormalizedConfig(config.text_config);
break;
Expand Down
7 changes: 4 additions & 3 deletions src/generation/streamers.js
Original file line number Diff line number Diff line change
Expand Up @@ -72,9 +72,10 @@ export class TextStreamer extends BaseStreamer {
throw Error('TextStreamer only supports batch size of 1');
}

if (this.skip_prompt && this.next_tokens_are_prompt) {
const is_prompt = this.next_tokens_are_prompt;
if (is_prompt) {
this.next_tokens_are_prompt = false;
return;
if (this.skip_prompt) return;
}

const tokens = value[0];
Expand All @@ -85,7 +86,7 @@ export class TextStreamer extends BaseStreamer {
const text = this.tokenizer.decode(this.token_cache, this.decode_kwargs);

let printable_text;
if (text.endsWith('\n')) {
if (is_prompt || text.endsWith('\n')) {
// After the symbol for a new line, we flush the cache.
printable_text = text.slice(this.print_len);
this.token_cache = [];
Expand Down
11 changes: 10 additions & 1 deletion src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -3692,7 +3692,7 @@ export class Idefics3PreTrainedModel extends PreTrainedModel {
}

/**
* The LLAVA model which consists of a vision backbone and a language model.
* The Idefics3 model which consists of a vision backbone and a language model.
*/
export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {

Expand All @@ -3715,6 +3715,13 @@ export class Idefics3ForConditionalGeneration extends Idefics3PreTrainedModel {
}
//////////////////////////////////////////////////

/**
* The SmolVLM Model with a language modeling head.
* It is made up a SigLIP vision encoder, with a language modeling head on top.
*/
export class SmolVLMForConditionalGeneration extends Idefics3ForConditionalGeneration { }

//////////////////////////////////////////////////
export class Phi3VPreTrainedModel extends PreTrainedModel {
forward_params = [
'input_ids',
Expand Down Expand Up @@ -7316,6 +7323,7 @@ const MODEL_FOR_QUESTION_ANSWERING_MAPPING_NAMES = new Map([
const MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES = new Map([
['vision-encoder-decoder', ['VisionEncoderDecoderModel', VisionEncoderDecoderModel]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
]);

const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
Expand All @@ -7325,6 +7333,7 @@ const MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = new Map([
['florence2', ['Florence2ForConditionalGeneration', Florence2ForConditionalGeneration]],
['qwen2-vl', ['Qwen2VLForConditionalGeneration', Qwen2VLForConditionalGeneration]],
['idefics3', ['Idefics3ForConditionalGeneration', Idefics3ForConditionalGeneration]],
['smolvlm', ['SmolVLMForConditionalGeneration', SmolVLMForConditionalGeneration]],
['paligemma', ['PaliGemmaForConditionalGeneration', PaliGemmaForConditionalGeneration]],
]);

Expand Down
1 change: 1 addition & 0 deletions src/models/image_processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ export * from './rt_detr/image_processing_rt_detr.js'
export * from './sam/image_processing_sam.js'
export * from './segformer/image_processing_segformer.js'
export * from './siglip/image_processing_siglip.js'
export * from './smolvlm/image_processing_smolvlm.js'
export * from './swin2sr/image_processing_swin2sr.js'
export * from './vit/image_processing_vit.js'
export * from './vitmatte/image_processing_vitmatte.js'
Expand Down
1 change: 1 addition & 0 deletions src/models/processors.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ export * from './paligemma/processing_paligemma.js';
export * from './pyannote/processing_pyannote.js';
export * from './qwen2_vl/processing_qwen2_vl.js';
export * from './sam/processing_sam.js';
export * from './smolvlm/processing_smolvlm.js';
export * from './speecht5/processing_speecht5.js';
export * from './wav2vec2/processing_wav2vec2.js';
export * from './wav2vec2_with_lm/processing_wav2vec2_with_lm.js';
Expand Down
2 changes: 2 additions & 0 deletions src/models/smolvlm/image_processing_smolvlm.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@

export { Idefics3ImageProcessor as SmolVLMImageProcessor } from "../idefics3/image_processing_idefics3.js";
2 changes: 2 additions & 0 deletions src/models/smolvlm/processing_smolvlm.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@

export { Idefics3Processor as SmolVLMProcessor } from "../idefics3/processing_idefics3.js";
1 change: 1 addition & 0 deletions src/transformers.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ export * from './configs.js';

export * from './utils/audio.js';
export * from './utils/image.js';
export * from './utils/video.js';
export * from './utils/tensor.js';
export * from './utils/maths.js';

Expand Down
128 changes: 128 additions & 0 deletions src/utils/video.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
import { RawImage } from "./image.js";
import { apis } from "../env.js";

export class RawVideoFrame {

/**
* @param {RawImage} image
* @param {number} timestamp
*/
constructor(image, timestamp) {
this.image = image;
this.timestamp = timestamp;
}
}

export class RawVideo {
/**
* @param {RawVideoFrame[]|RawImage[]} frames
* @param {number} duration
*/
constructor(frames, duration) {
if (frames.length > 0 && frames[0] instanceof RawImage) {
// Assume uniform timestamps
frames = frames.map((image, i) => new RawVideoFrame(image, (i + 1) / (frames.length + 1) * duration));
}
this.frames = /** @type {RawVideoFrame[]} */ (frames);
this.duration = duration;
}

get width() {
return this.frames[0].image.width;
}
get height() {
return this.frames[0].image.height;
}

get fps() {
return this.frames.length / this.duration;
}
}


/**
* Loads a video.
*
* @param {string|Blob|HTMLVideoElement} src The video to process.
* @param {Object} [options] Optional parameters.
* @param {number} [options.num_frames=null] The number of frames to sample uniformly.
* @param {number} [options.fps=null] The number of frames to sample per second.
*
* @returns {Promise<RawVideo>} The loaded video.
*/
export async function load_video(src, { num_frames = null, fps = null } = {}) {
if (!apis.IS_BROWSER_ENV) {
throw new Error("`load_video` is currently only supported in browser environments.");
}

// TODO: Support efficiently loading all frames using the WebCodecs API.
// Specfically, https://developer.mozilla.org/en-US/docs/Web/API/VideoDecoder
if (num_frames == null && fps == null) {
throw new Error("Either num_frames or fps must be provided.");
}

const frames = [];

const video = document.createElement("video");
video.crossOrigin = "anonymous";
video.muted = true; // mute to allow autoplay and seeking

if (typeof src === 'string') {
video.src = src;
} else if (src instanceof Blob) {
video.src = URL.createObjectURL(src);
} else if (src instanceof HTMLVideoElement) {
video.src = src.src;
} else {
throw new Error("Invalid URL or video element provided.");
}
// Wait for metadata to load to obtain duration
await new Promise((resolve) => video.onloadedmetadata = resolve);

if (video.seekable.start(0) === video.seekable.end(0)) {
// Fallback: Download entire video if not seekable
const response = await fetch(video.src);
const blob = await response.blob();
video.src = URL.createObjectURL(blob);
await new Promise((resolve) => video.onloadedmetadata = resolve);
}

const duration = video.duration;

let count, step;
if (num_frames != null) {
count = num_frames;
step = num_frames === 1 ? 0 : duration / (num_frames - 1);
} else {
step = 1 / fps;
count = Math.floor(duration / step);
}

// Build an array of sample times based on num_frames or fps
let sampleTimes = [];
for (let i = 0; i < count; ++i) {
sampleTimes.push(num_frames === 1 ? duration / 2 : i * step);
}

const canvas = document.createElement("canvas");
canvas.width = video.videoWidth;
canvas.height = video.videoHeight;
const ctx = canvas.getContext("2d", { willReadFrequently: true });
for (const t of sampleTimes) {
video.currentTime = t;
await new Promise((resolve) => {
video.onseeked = resolve;
});
ctx.drawImage(video, 0, 0, canvas.width, canvas.height);
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
const frameData = new RawImage(imageData.data, canvas.width, canvas.height, 4);

const frame = new RawVideoFrame(frameData, t);
frames.push(frame);
}

// Clean up video element.
video.remove();

return new RawVideo(frames, duration);
}
2 changes: 1 addition & 1 deletion tests/utils/generation.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@ describe("Streamers", () => {
it(
"batch_size=1",
async () => {
const target_chunks = ["helloerdingsdelete ", "melytabular ", "Stadiumoba ", "alcune ", "drug"];
const target_chunks = ["hello", "erdingsdelete ", "melytabular ", "Stadiumoba ", "alcune ", "drug"];
const chunks = [];
const callback_function = (text) => {
chunks.push(text);
Expand Down

0 comments on commit 591a112

Please sign in to comment.