Skip to content

Commit 406673c

Browse files
authored
Update README (#118)
* Changed README to LaRaes version * Remove arrows from table * Add table with people & projects to follow * Update images and links in README.md
1 parent 3316908 commit 406673c

File tree

1 file changed

+115
-63
lines changed

1 file changed

+115
-63
lines changed

README.md

Lines changed: 115 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
<img width="1280" height="360" alt="Readme" src="https://github.com/user-attachments/assets/80c437dc-a80a-45da-bd18-0545740a3358" />
2+
13
# Open Vision Agents by Stream
24

35
[![build](https://github.com/GetStream/Vision-Agents/actions/workflows/ci.yml/badge.svg)](https://github.com/GetStream/Vision-Agents/actions)
@@ -6,21 +8,29 @@
68
[![License](https://img.shields.io/github/license/GetStream/Vision-Agents)](https://github.com/GetStream/Vision-Agents/blob/main/LICENSE)
79
[![Discord](https://img.shields.io/discord/1108586339550638090)](https://discord.gg/RkhX9PxMS6)
810

11+
---
12+
13+
## Build Real-Time Vision AI Agents
14+
915
<a href="https://youtu.be/Hpl5EcCpLw8">
10-
<img src="assets/demo_thumbnail.png" alt="Watch the demo" style="max-width: 800px; width: 60%">
16+
<img src="assets/demo_thumbnail.png" alt="Watch the demo" style="width:100%; max-width:900px;">
1117
</a>
1218

13-
Build Vision Agents quickly with any model or video provider.
19+
### Multi-modal AI agents that watch, listen, and understand video.
20+
21+
Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.
1422

15-
- **Video AI**: Built for real-time video AI. Combine Yolo, Roboflow and others with gemini/openai realtime
16-
- **Low Latency**: Join quickly (500ms) and low audio/video latency (30ms)
17-
- **Open**: Built by Stream, but use any video edge network that you like
18-
- **Native APIs**: Native SDK methods from OpenAI (create response), Gemini (generate) and Claude (create message). So you can always use the latest LLM capabilities.
19-
- **SDKs**: SDKs for React, Android, iOS, Flutter, React, React Native and Unity.
23+
### Key Highlights
2024

21-
Created by Stream, uses [Stream's edge network](https://getstream.io/video/) for ultra-low latency.
25+
- **Video AI:** Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
26+
- **Low Latency:** Join quickly (500ms) and maintain audio/video latency under 30ms using [Stream's edge network](https://getstream.io/video/).
27+
- **Open:** Built by Stream, but works with any video edge network.
28+
- **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.
29+
- **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
2230

23-
## Examples
31+
---
32+
33+
## See It In Action
2434

2535
### Sports Coaching
2636

@@ -45,7 +55,7 @@ Combining a fast object detection model (like YOLO) with a full realtime AI is u
4555
For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.
4656

4757
<a href="https://x.com/nash0x7e2/status/1950341779745599769">
48-
<img src="assets/golf_example_tweet.png" alt="Golf Example" style="max-width: 500px; width: 40%">
58+
<img src="assets/golf_example_tweet.png" alt="Golf Example" style="width:100%; max-width:800px;">
4959
</a>
5060

5161
### Cluely style Invisible Assistant (coming soon)
@@ -66,82 +76,124 @@ agent = Agent(
6676
)
6777
```
6878

79+
## Quick Start
80+
81+
**Step 1: Install via uv**
82+
83+
`uv add vision-agents`
84+
85+
**Step 2: (Optional) Install with extra integrations**
86+
87+
`uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"`
88+
89+
**Step 3: Obtain your Stream API credentials**
90+
91+
Get a free API key from [Stream](https://getstream.io/). Developers receive **333,000 participant minutes** per month, plus extra credits via the Maker Program.
92+
93+
## Features
94+
95+
| **Feature** | **Description** |
96+
| ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
97+
| **True real-time via WebRTC** | Stream directly to model providers that support it for instant visual understanding. |
98+
| **Interval/processor pipeline** | For providers without WebRTC, process frames with pluggable video processors (e.g., YOLO, Roboflow, or custom PyTorch/ONNX) before/after model calls. |
99+
| **Turn detection & diarization** | Keep conversations natural; know when the agent should speak or stay quiet and who's talking. |
100+
| **Voice activity detection (VAD)** | Trigger actions intelligently and use resources efficiently. |
101+
| **Speech↔Text↔Speech** | Enable low-latency loops for smooth, conversational voice UX. |
102+
| **Tool/function calling** | Execute arbitrary code and APIs mid-conversation. Create Linear issues, query weather, trigger telephony, or hit internal services. |
103+
| **Built-in memory via Stream Chat** | Agents recall context naturally across turns and sessions. |
104+
| **Text back-channel** | Message the agent silently during a call. |
105+
106+
## Out-of-the-Box Integrations
107+
108+
| **Plugin Name** | **Description** | **Docs Link** |
109+
| --------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------ |
110+
| **Cartesia** | TTS plugin for realistic voice synthesis in real-time voice applications | [View Docs](https://visionagents.ai/integrations/cartesia) |
111+
| **Deepgram** | STT plugin for fast, accurate real-time transcription with speaker diarization | [View Docs](https://visionagents.ai/integrations/deepgram) |
112+
| **ElevenLabs** | TTS plugin with highly realistic and expressive voices for conversational agents | [View Docs](https://visionagents.ai/integrations/elevenlabs) |
113+
| **Kokoro** | Local TTS engine for offline voice synthesis with low latency | [View Docs](https://visionagents.ai/integrations/kokoro) |
114+
| **Moonshine** | STT plugin optimized for fast, locally runnable transcription on constrained devices | [View Docs](https://visionagents.ai/integrations/moonshine) |
115+
| **OpenAI** | LLM plugin for real-time reasoning, conversation, and multimodal capabilities using OpenAI's Realtime API | [View Docs](https://visionagents.ai/integrations/openai) |
116+
| **Gemini** | Multimodal plugin for real-time audio, video, and text understanding powered by Google's Gemini Live models | [View Docs](https://visionagents.ai/integrations/gemini) |
117+
| **Silero** | VAD plugin for voice activity detection and turn-taking in low-latency real-time conversations | [View Docs](https://visionagents.ai/integrations/silero) |
118+
| **Wizper** | Real-time variant of OpenAI's Whisper v3 for Speech-to-Text and on-the-fly translation, hosted by Fal.ai | [View Docs](https://visionagents.ai/integrations/wizper) |
119+
69120
## Processors
70121

71-
Processors enable you to provide state and receive/publish video & audio.
72-
Many video AI use case require you to do things like
122+
Processors let your agent **manage state** and **handle audio/video** in real-time.
123+
124+
They take care of the hard stuff, like:
73125

74-
* Run a smaller AI model next to the LLM (like Yolo or roboflow)
75-
* Make API calls to maintain relevant info/game state
76-
* Modify audio/video, for instance avatars
77-
* Capture audio/video
126+
- Running smaller models
127+
- Making API calls
128+
- Transforming media
78129

79-
This is all handled by processors.
130+
… so you can focus on your agent logic.
80131

81-
## Docs
132+
## Documentation
82133

83-
To get started with Vision Agents, check out our getting started guide at [VisionAgents.ai](https://visionagents.ai).
134+
Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai/).
84135

85-
- Quickstart: [Building a Voice AI app](https://visionagents.ai/introduction/voice-agents)
86-
- Quickstart: [Building a Video AI app](https://visionagents.ai/introduction/video-agents)
87-
- Tutorial: [Building realtime sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)
88-
- Tutorial: [Building a realtime meeting assistant](#)
136+
**Quickstart:** [Building a Voice AI app](https://visionagents.ai/introduction/voice-agents)
137+
**Quickstart:** [Building a Video AI app](https://visionagents.ai/introduction/video-agents)
138+
**Tutorial:** [Building real-time sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)
139+
**Tutorial:** [Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)
89140

90141
## Development
91142

92-
See DEVELOPMENT.md
143+
See [DEVELOPMENT.md](DEVELOPMENT.md)
144+
145+
## Open Platform
146+
147+
Want to add your platform or provider? Reach out to **[email protected]**.
93148

94149
## Awesome Video AI
95150

96151
Our favorite people & projects to follow for vision AI
97152

98-
* https://x.com/demishassabis. CEO google deepmind, won a nobel prize
99-
* https://x.com/OfficialLoganK. Product lead gemini, posts about robotics vision
100-
* https://x.com/ultralytics. various fast vision AI models. Pose, detect objects, segment, classify etc.
101-
* https://x.com/skalskip92. roboflow open source lead
102-
* https://x.com/moondreamai. the tiny vision model that could
103-
* https://x.com/kwindla. pipecat/daily
104-
* https://x.com/juberti. head of realtime AI openai
105-
* https://x.com/romainhuet head of developer experience openAI
106-
* https://x.com/thorwebdev eleven labs
107-
* https://x.com/mervenoyann huggingface, quite some posts about Video AI
108-
* https://x.com/stash_pomichter spatial memory for robots
153+
| [<img src="https://github.com/user-attachments/assets/9149e871-cfe8-4169-a4ce-4073417e645c" width="80"/>](https://x.com/demishassabis) | [<img src="https://github.com/user-attachments/assets/2e1335d3-58af-4988-b879-1db8d862cd34" width="80"/>](https://x.com/OfficialLoganK) | [<img src="https://github.com/user-attachments/assets/c9249ae9-e66a-4a70-9393-f6fe4ab5c0b0" width="80"/>](https://x.com/ultralytics) |
154+
| :----------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: |
155+
| [@demishassabis](https://x.com/demishassabis)<br>CEO @ Google DeepMind<br><sub>Won a Nobel prize</sub> | [@OfficialLoganK](https://x.com/OfficialLoganK)<br>Product Lead @ Gemini<br><sub>Posts about robotics vision</sub> | [@ultralytics](https://x.com/ultralytics)<br>Various fast vision AI models<br><sub>Pose, detect, segment, classify</sub> |
156+
157+
| [<img src="https://github.com/user-attachments/assets/c1fe873d-6f41-4155-9be1-afc287ca9ac7" width="80"/>](https://x.com/skalskip92) | [<img src="https://github.com/user-attachments/assets/43359165-c23d-4d5d-a5a6-1de58d71fabd" width="80"/>](https://x.com/moondreamai) | [<img src="https://github.com/user-attachments/assets/490d349c-7152-4dfb-b705-04e57bb0a4ca" width="80"/>](https://x.com/kwindla) |
158+
| :---------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |
159+
| [@skalskip92](https://x.com/skalskip92)<br>Open Source Lead @ Roboflow<br><sub>Building tools for vision AI</sub> | [@moondreamai](https://x.com/moondreamai)<br>The tiny vision model that could<br><sub>Lightweight, fast, efficient</sub> | [@kwindla](https://x.com/kwindla)<br>Pipecat / Daily<br><sub>Sharing AI and vision insights</sub> |
160+
161+
| [<img src="https://github.com/user-attachments/assets/d7ade584-781f-4dac-95b8-1acc6db4a7c4" width="80"/>](https://x.com/juberti) | [<img src="https://github.com/user-attachments/assets/00a1ed37-3620-426d-b47d-07dd59c19b28" width="80"/>](https://x.com/romainhuet) | [<img src="https://github.com/user-attachments/assets/eb5928c7-83b9-4aaa-854f-1d4f641426f2" width="80"/>](https://x.com/thorwebdev) |
162+
| :-------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------: |
163+
| [@juberti](https://x.com/juberti)<br>Head of Realtime AI @ OpenAI<br><sub>Realtime AI systems</sub> | [@romainhuet](https://x.com/romainhuet)<br>Head of DX @ OpenAI<br><sub>Developer tooling & APIs</sub> | [@thorwebdev](https://x.com/thorwebdev)<br>Eleven Labs<br><sub>Voice and AI experiments</sub> |
164+
165+
| [<img src="https://github.com/user-attachments/assets/ab5ef918-7c97-4c6d-be10-2e2aeefec015" width="80"/>](https://x.com/mervenoyann) | [<img src="https://github.com/user-attachments/assets/af936e13-22cf-4000-a35b-bfe30d44c320" width="80"/>](https://x.com/stash_pomichter) |
166+
| :------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------: |
167+
| [@mervenoyann](https://x.com/mervenoyann)<br>Hugging Face<br><sub>Posts extensively about Video AI</sub> | [@stash_pomichter](https://x.com/stash_pomichter)<br>Spatial memory for robots<br><sub>Robotics & AI navigation</sub> |
109168

110169
## Inspiration
111170

112171
- Livekit Agents: Great syntax, Livekit only
113172
- Pipecat: Flexible, but more verbose.
114173
- OpenAI Agents: Focused on openAI only
115174

116-
## Open Platform
117-
Reach out to [email protected], and we'll collaborate on getting you added
118-
We'd like to add support for and are reaching out to:
119-
120-
* Mediasoup
121-
* Janus
122-
* Cloudflare
123-
* Twilio
124-
* AWS IVS
125-
* Vonage
126-
* And others.
127-
128175
## Roadmap
129176

130-
**0.1 - First release**
131-
- Support for >10 out of the box [integrations](https://visionagents.ai/integrations/introduction-to-integrations)
132-
- Support for video processors
177+
### 0.1 – First Release
178+
179+
- Support for 10+ out-of-the-box integrations
180+
- Video processors
133181
- Native Stream Chat integration for memory
134-
- Support for MCP and function calling for Gemini and OpenAI
135-
- Support for realtime WebRTC video and voice with GPT Realtime
136-
137-
**Coming Soon**
138-
- The Python WebRTC lib we use has some limitations. Investigating this.
139-
- Hosting & production deploy example
140-
- More built-in Yolo processors: Object detection, person detection, etc
141-
- Roboflow support
142-
- Computer use support
143-
- AI avatar support. Tavus etc
144-
- QWen3 vision support
145-
- Buffered video capture support (enabling AI to capture video when something exciting happens)
146-
- Moondream vision
182+
- MCP & function calling for Gemini and OpenAI
183+
- Realtime WebRTC video and voice with GPT Realtime
184+
185+
### Coming Soon
186+
187+
[ ] Improved Python WebRTC library
188+
[ ] Hosting & production deploy example
189+
[ ] More built-in YOLO processors (object & person detection)
190+
[ ] Roboflow support
191+
[ ] Computer use support
192+
[ ] AI avatar integrations (e.g., Tavus)
193+
[ ] QWen3 vision support
194+
[ ] Buffered video capture (for "catch the moment" scenarios)
195+
[ ] Moondream vision
196+
197+
## Star History
147198

199+
[![Star History Chart](https://api.star-history.com/svg?repos=GetStream/vision-agents&type=timeline&legend=top-left)](https://www.star-history.com/#GetStream/vision-agents&type=timeline&legend=top-left)

0 commit comments

Comments
 (0)