A compact vision language model that you can pretrain and finetune on a single consumer GPU such as NVIDIA RTX 4090 with 24GB VRAM.
- 08/23/2025: Created a new model based on Qwen3-0.6B-base with SigLIP2-so400m π keeeeenw/MicroLlava-Qwen3-0.6B-base-siglip2-so400m. This model has ~1B parameters and achieves a 78.5 VQAv2 score, on par with the original LLaVA 1.5 (7B).
 - 08/23/2025: Added Qwen3 support to TinyLLaVA_Factory, including:
- A new chat template for Qwen3 integration
 - Training and evaluation scripts with hyperparameters for a single Nvidia 4090
 - Various compatibility fixes such as transformers upgrade required for the new Qwen3-0.6B-base model
 
 - 08/17/2025: the hugging face repo is renamed to https://huggingface.co/keeeeenw/MicroLlava.
 - 08/17/2025: improved VQAv2 average dev-test score from 44.01% to 56.91% by upgrading the vision tower from SigLip to SigLip2.
 - 08/09/2025: initial version of MicroLlava released
 
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model from Hugging Face
hf_path = 'keeeeenw/MicroLlava'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
# model.cuda()  # Enable CUDA if needed - model runs fairly quickly on CPU
# Setup tokenizer
config = model.config
tokenizer = AutoTokenizer.from_pretrained(
    hf_path, 
    use_fast=False, 
    model_max_length=config.tokenizer_model_max_length,
    padding_side=config.tokenizer_padding_side
)
# Run inference
prompt = "What are the things I should be cautious about when I visit here?"
image_url = "https://llava-vl.github.io/static/images/view.jpg"
output_text, generation_time = model.chat(
    prompt=prompt,
    image=image_url,
    tokenizer=tokenizer
)
print(f'Model output: {output_text}')
print(f'Generation time: {generation_time}')| Component | Details | 
|---|---|
| Framework | Transformers + PyTorch | 
| Language Model | MicroLlama (~300M parameters) | 
| Vision Encoder | SigLIP2-SO400M | 
| Training Hardware | Single NVIDIA RTX 4090 | 
| Checkpoint Format | SafeTensors | 
| License | Apache 2.0 | 
- π§ Single GPU Training: Train on consumer hardware without DeepSpeed
 - β‘ Fast Training: Pretraining takes ~5 hours, finetuning ~12 hours on RTX 4090
 - π¦ Compact: Only ~300M language model parameters
 - π¨ Vision-Language Tasks: Visual Question Answering, image captioning
 - π Easy Iteration: Perfect for research and experimentation
 
| Question Type | Accuracy | 
|---|---|
| Yes/No | 72.32% | 
| Number | 43.89% | 
| Other | 46.65% | 
| Overall | 56.91% | 
Evaluated on VQAv2 test-dev split
| Question Type | Accuracy | 
|---|---|
| Yes/No | 65.08% | 
| Number | 28.97% | 
| Other | 29.32% | 
| Overall | 44.01% | 
Evaluated on VQAv2 test-dev split
- VQAv2 test set (instead of test-dev)
 - and datasets from TinyLlava evaluation
 - Community contributions with benchmark results are welcome and encouraged.
 
This model is based on TinyLLaVA Factory with optimizations for single GPU training.
- Pretraining: ~5 hours on LAION-CC-SBU-558K
 - Finetuning: ~12 hours on TinyLLaVA datasets
 
Pretraining Hyperparameters:
gradient_accumulation_steps: 2 β 8learning_rate: 1e-3 β 2.5e-4warmup_ratio: 0.03 β 0.06bfloat16: True after the Siglip2 upgrade (improved stability)
Finetuning:
- Precision: 
bfloat16(improved stability) - Same major hyperparameters as original TinyLLaVA
 
- Clone the training repository:
 
git clone https://github.com/keeeeenw/TinyLLaVA_Factory.git
cd TinyLLaVA_Factory- Follow the training guides in the repository for pretraining and finetuning steps.
 
- Research: Vision-language experimentation on limited hardware
 - Education: Learning VLM concepts and implementations
 - Prototyping: Quick iteration for domain-specific applications
 - Finetuning: Starting point for specialized vision-language tasks
 
- Small model size may limit complex reasoning capabilities
 - OCR performance may be limited compared to larger models
 - Performance varies with image quality and domain
 - Minimal safety filtering - implement safeguards for production use
 
Warning: This model should not be used for safety-critical applications without thorough human review and additional safeguards.
- MicroLlama - The base language model
 - TinyLLaVA Factory - Training framework
 - SigLIP2 - Vision encoder
 
@misc{wang2024microllama,
  title        = {MicroLLaVA: a TinyLLaVA based VLM with MicroLlama 300M for single GPU training},
  author       = {Zixiao Ken Wang},
  year         = {2025},
  url          = {https://huggingface.co/keeeeenw/MicroLlava}
}We welcome contributions! Please see our Contributing Guidelines for details.
- Additional evaluation benchmarks
 - Performance optimizations
 - Documentation improvements
 - Example applications
 
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Special thanks to:
- TinyLLaVA Factory team for the training framework
 - SigLIP2 authors for the efficient vision encoder
 - LAION community for the pretraining datasets
 - Hugging Face for model hosting and tools
 
β Star this repository if you find it useful! β
For questions and support, please open an issue or check out the Hugging Face model page.