- Cambridge, Massachusetts, United States
- http://kentang.net
Stars
A suite of image and video neural tokenizers
HLS-based framework to accelerate the implementation of 2-D DP kernels on FPGA
[ICLR 2025] VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
A sparse attention kernel supporting mix sparse patterns
PyTorch implementation of MAR+DiffLoss https://arxiv.org/abs/2406.11838
[ICML 2024] LLM and Simulation as Bilevel Optimizers: A New Paradigm to Advance Physical Scientific Discovery
Code for the paper DisCo-Diff: Enhancing Continuous Diffusion Models with Discrete Latents, ICML 2024
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
SEED-Voken: A Series of Powerful Visual Tokenizers
[NeurIPS 2024] Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
Tile primitives for speedy kernels
(NeurIPS 2024 Oral 🔥) Improved Distribution Matching Distillation for Fast Image Synthesis
LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Model Compression Toolbox for Large Language Models and Diffusion Models
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
PyTorch emulation library for Microscaling (MX)-compatible data formats
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV 2024)