Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,63 @@
# vLLM-omni: Multi-modal Extension for vLLM

vLLM-omni is designed to extend vLLM capabilities to support multi-modality model inference and serving, particularly focusing on non-autoregressive structures and non-textual outputs.

## 🎯 Overview

Traditional vLLM systems are limited to text-based, autoregressive generation. vLLM-omni addresses this limitation by enabling support for:

- **Multi-modal Models**: Text, image, video, audio, and sensor data processing
- **Non-autoregressive Architectures**: Diffusion Transformers (DiT) and other parallel generation models
- **Heterogeneous Outputs**: Beyond traditional text generation to structured, binary, and streaming outputs

## 🏗️ Architecture

vLLM-omni is built on a modular architecture that extends vLLM's core functionality:


## 🚀 Key Features

### Multi-Engine Support

- **Autoregressive Engine**: Traditional text generation with enhanced KV-caching
- **Diffusion Engine**: Support for DiT models and iterative generation
- **Hybrid Engine**: Combined AR+DiT processing pipelines

### Modality Processing

- **Text**: Advanced tokenization and embedding generation
- **Image**: Vision encoder integration (CLIP, etc.)
- **Audio**: Speech processing and audio embedding
- **Video**: Frame-by-frame and temporal processing
- **Sensor**: IoT and sensor data interpretation

### Output Formats

- **Structured Data**: JSON, XML, and custom formats
- **Binary Outputs**: Images, audio, and video generation
- **Streaming**: Real-time progressive generation
- **Multipart**: Combined multi-modal responses

## 📋 Supported Models

### AR + Diffusion Transformer (DiT) Models
- Qwen-Image (Image generation and editing)
- Qwen-omni (Thinker-Talker-Codec structure)
- Custom DiT and hiybrid architectures

## 🛠️ Installation

### Prerequisites

- Python 3.8+
- PyTorch 2.0+
- CUDA 11.8+ (for GPU acceleration)

### Install from Source

```bash
git clone https://github.com/your-org/vllm-omni.git
pip install -r requirements.txt
cd vllm-omni
pip install -e .
```
179 changes: 179 additions & 0 deletions docs/PRD.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# vLLM-omni Product Requirements Document (PRD)

## 1. Product Overview

### 1.1 Product Name
vLLM-omni: Multi-modality models inference and serving with non-autoregressive structures

### 1.2 Product Vision
Extend vLLM beyond traditional text-based, autoregressive generation to support multi-modality models with non-autoregressive structures and non-textual outputs while maintaining vLLM's proven architecture and performance.

### 1.3 Target Users
- AI researchers working with multimodal models
- ML engineers building production inference systems
- Developers integrating DiT (Diffusion Transformer) models
- Organizations requiring efficient multimodal model serving

## 2. Core Requirements

### 2.1 Functional Requirements

#### 2.1.1 Multi-Stage Processing
- **REQ-001**: Support stage-based model processing where each stage can use different engine types (AR/DiT)
- **REQ-002**: Enable sequential processing through multiple stages with data flow between stages
- **REQ-003**: Support both autoregressive (AR) and diffusion (DiT) model stages

#### 2.1.2 vLLM Compatibility
- **REQ-004**: Maintain full compatibility with vLLM V1 architecture (AsyncLLM and EngineCore patterns)
- **REQ-005**: Support existing vLLM CLI commands with `--omni` flag extension
- **REQ-006**: Reuse vLLM's multiprocess worker architecture for scalability

#### 2.1.3 Multimodal Support
- **REQ-007**: Support text, image, and latent space inputs/outputs
- **REQ-008**: Enable image-to-image, text-to-image, and text-to-text generation
- **REQ-009**: Support hidden state passing between AR and DiT stages

#### 2.1.4 CLI and API
- **REQ-010**: Provide CLI command: `vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8000`
- **REQ-011**: Support both online (AsyncOmniLLM) and offline (OmniLLM) inference modes
- **REQ-012**: Maintain vLLM's existing API compatibility

### 2.2 Non-Functional Requirements

#### 2.2.1 Performance
- **REQ-013**: Maintain vLLM's inference performance for AR stages
- **REQ-014**: Optimize DiT stage processing with caching mechanisms
- **REQ-015**: Support distributed inference across multiple GPUs

#### 2.2.2 Scalability
- **REQ-016**: Support horizontal scaling through vLLM's worker process pattern
- **REQ-017**: Enable efficient memory management for large multimodal models
- **REQ-018**: Support batch processing for multiple requests

#### 2.2.3 Extensibility
- **REQ-019**: Easy integration of new modalities and model architectures
- **REQ-020**: Pluggable scheduler and executor components
- **REQ-021**: Support for future non-autoregressive model types

## 3. Technical Architecture

### 3.1 Core Components

#### 3.1.1 Entry Points
- **OmniServeCommand**: CLI wrapper that intercepts vLLM commands with `--omni` flag
- **OmniLLM**: Offline inference class supporting multi-stage processing
- **AsyncOmniLLM**: Online inference class with asynchronous processing

#### 3.1.2 Stage Management
- **OmniStageConfig**: Configuration for each processing stage
- **Stage Engine List**: Multiple AsyncLLM instances for each stage
- **Stage I/O Management**: Data flow between stages

#### 3.1.3 Engine Components
- **EngineCore**: Reused from vLLM (no changes needed)
- **OmniDiffusionScheduler**: New scheduler for DiT models
- **DiTCacheManager**: Caching system for DiT optimization
- **MultiprocExecutor**: Reused from vLLM for DiT without diffusers
- **DiffusersPipelineExecutor**: New executor for diffusers integration

#### 3.1.4 Model Runners
- **OmniDiffusionModelRunner**: Handles DiT model execution
- **OmniARModelRunner**: Handles AR model execution with hidden state output
- **ModelRunnerOutput**: Extended to support multimodal outputs via pooler_output

#### 3.1.5 Output Processing
- **MultimodalOutputProcessor**: Handles final multimodal output processing
- **RequestState**: Extended to support pooling outputs
- **Output Handlers**: Type-specific output processing

### 3.2 Data Flow
```
API Server → OmniLLM/AsyncOmniLLM → LLMEngine/AsyncLLM → Engine Core
→ Scheduler (AR/DiT) → Executor (AR/DiT) → Worker (AR/DiT)
→ ModelRunner (AR/DiT) → RequestState → OutputProcessor → Final Output
```

## 4. Implementation Phases

### Phase 1: Foundation (Weeks 1-2)
- Package structure and dependencies
- Basic OmniLLM and AsyncOmniLLM classes
- Stage configuration system
- CLI integration

### Phase 2: Core Components (Weeks 3-4)
- DiT scheduler implementation
- Model runners for AR and DiT
- Basic output processing

### Phase 3: Advanced Features (Weeks 5-6)
- Caching system implementation
- Multimodal output processing
- Request state management

### Phase 4: Integration & Testing (Weeks 7-8)
- End-to-end integration
- Comprehensive testing
- Performance optimization
- Documentation

## 5. Success Criteria

### 5.1 Functional Success
- [ ] Successfully run `vllm serve model --omni` command
- [ ] Process multi-stage AR→DiT pipelines
- [ ] Generate multimodal outputs (text + image)
- [ ] Maintain vLLM API compatibility

### 5.2 Performance Success
- [ ] AR stage performance within 5% of native vLLM
- [ ] DiT stage processing with reasonable latency
- [ ] Memory usage comparable to vLLM for equivalent models

### 5.3 Quality Success
- [ ] 90%+ test coverage
- [ ] All integration tests passing
- [ ] Documentation complete and accurate

## 6. Risk Assessment

### 6.1 Technical Risks
- **High**: vLLM API changes breaking compatibility
- **Medium**: DiT model integration complexity
- **Low**: Performance overhead from multi-stage processing

### 6.2 Mitigation Strategies
- Regular vLLM compatibility testing
- Incremental DiT integration with fallback options
- Performance benchmarking at each stage

## 7. Dependencies

### 7.1 External Dependencies
- vLLM >= 0.10.2
- PyTorch >= 2.7
- Transformers >= 4.30.0
- FastAPI, Uvicorn for API serving
- Ray for distributed computing

### 7.2 Optional Dependencies
- xDiT for DiT acceleration
- Cache-DiT for advanced caching
- Diffusers for pipeline-based DiT models

## 8. Future Roadmap

### 8.1 Short-term (3 months)
- Additional DiT model support
- Performance optimizations
- Enhanced caching strategies

### 8.2 Medium-term (6 months)
- Support for video generation models
- Advanced scheduling strategies
- Multi-GPU DiT optimization

### 8.3 Long-term (12 months)
- Custom model architecture support
- Advanced multimodal fusion
- Production deployment tools
Loading