Simplify TTS plugin and audio utils #123

tbarbugli · 2025-10-23T21:29:08Z

Summary by CodeRabbit

Release Notes

New Features
- Added AWS Polly text-to-speech support with customizable voices and audio formats.
- Added OpenAI text-to-speech integration with configurable models and voices.
- Enhanced TTS output format configuration with support for multiple sample rates and channels.
Improvements
- Unified audio handling across all TTS providers for consistent format support.
- Added audio testing utilities including WAV generation and non-blocking verification.
- Improved multi-channel audio support with enhanced resampling capabilities.

coderabbitai · 2025-10-23T21:29:18Z

Walkthrough

Audio handling architecture refactored to use PCM-centric data streams with a new OutputAudioTrack protocol. Core PcmData class enhanced for multi-channel samples, resampling, and WAV serialization. TTS base class updated with set_output_format configuration and chunk-based audio event emission. Multiple TTS plugins (Cartesia, ElevenLabs, Fish, Kokoro) and new plugins (OpenAI, AWS Polly) adapted to return PcmData streams. Testing utilities introduced for session-based result collection and non-blocking verification.

Changes

Cohort / File(s)	Summary
Core Audio Infrastructure `agents-core/vision_agents/core/edge/types.py`	Added `OutputAudioTrack` protocol with async `write()` and `stop()` methods. Enhanced `PcmData` with `channels` field, `stereo` property, `duration_ms` property, multi-channel resampling, WAV serialization (`to_wav_bytes`, `to_bytes`), `from_data` classmethod, and `from_response` streaming factory. Added top-level `play_pcm_with_ffplay()` async utility for WAV playback.
TTS Base Class Refactor `agents-core/vision_agents/core/tts/tts.py`	Added `set_output_format(sample_rate, channels, audio_format)` for output configuration. Introduced `_iter_pcm()` to normalize provider responses, `_emit_chunk()` for PCM resampling/serialization with event emission, and updated `stream_audio()` to return `Union[bytes, Iterator[bytes], AsyncIterator[bytes], PcmData, Iterator[PcmData], AsyncIterator[PcmData]]`. Added `stop_audio()` public method. Updated error handling and latency recording; removed `PluginInitializedEvent`, added `PluginClosedEvent` emission.
Agent RTC Integration `agents-core/vision_agents/core/agents/agents.py`	Changed `_audio_track` type from `Optional[aiortc.AudioStreamTrack]` to `Optional[OutputAudioTrack]`. Updated `_prepare_rtc` to call TTS `set_output_format()` instead of `set_output_track()`. Added `TTSAudioEvent` import and event handler hook. Set default framerate to 48000 Hz and stereo to True when not in realtime mode.
TTS Testing Utilities `agents-core/vision_agents/core/tts/testing.py`, `agents-core/vision_agents/core/tts/manual_test.py`	New `TTSSession` class for event-driven result collection (`wait_for_result()` with timeout). New `TTSResult` dataclass. New `manual_tts_to_wav()` async helper for TTS-to-WAV conversion with optional ffplay playback. New `assert_tts_send_non_blocking()` utility to verify event-loop responsiveness during TTS sends.
Observability Updates `agents-core/vision_agents/core/observability/metrics.py`, `agents-core/vision_agents/core/observability/__init__.py`	Refactored OpenTelemetry initialization to defer provider configuration to application. Added `tts_events_emitted` counter metric. Updated `tracer` and `meter` to use fixed library identifiers via `trace.get_tracer()` and `metrics.get_meter()`. Removed `CALL_ATTRS` export.
Edge Transport Abstraction `agents-core/vision_agents/core/edge/edge_transport.py`, `plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py`	Updated `create_audio_track()` return type to `OutputAudioTrack`. Added `OutputAudioTrack` import. Minor formatting adjustments for multi-line signatures.
Cartesia TTS Plugin `plugins/cartesia/vision_agents/plugins/cartesia/tts.py`	Removed `get_required_framerate()`, `get_required_stereo()`, `set_output_track()`. Updated `stream_audio()` signature to return `PcmData \| Iterator[PcmData] \| AsyncIterator[PcmData]` via `PcmData.from_response()`. Changed `stop_audio()` to log no-op. Updated imports: added `AsyncIterator`, `Iterator`; removed `AudioStreamTrack`.
ElevenLabs TTS Plugin `plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py`	Removed framerate/stereo methods (`get_required_framerate()`, `get_required_stereo()`, `set_output_track()`). Updated `stream_audio()` to wrap response via `PcmData.from_response()` with sample_rate=16000, channels=1, format="s16". Changed `stop_audio()` to no-op. Added `PcmData` and os imports.
Fish TTS Plugin `plugins/fish/vision_agents/plugins/fish/tts.py`	Updated `stream_audio()` signature and return type to `PcmData \| Iterator[PcmData] \| AsyncIterator[PcmData]`. Changed default `reference_id` to concrete string. Added support for `FISH_AUDIO_API_KEY` env var. Refactored TTSRequest construction with `**tts_request_kwargs`. Wrapped stream via `PcmData.from_response()`. Changed `stop_audio()` to no-op. Added `Iterator`, `PcmData` imports; removed `AudioStreamTrack`.
Kokoro TTS Plugin `plugins/kokoro/vision_agents/plugins/kokoro/tts.py`	Removed `get_required_framerate()`, `get_required_stereo()`, `set_output_track()`. Updated `stream_audio()` to yield `PcmData.from_bytes()` objects instead of raw bytes; return type now `PcmData \| Iterator[PcmData] \| AsyncIterator[PcmData]`. Changed `stop_audio()` to no-op. Updated typing imports.
OpenAI TTS Plugin (New) `plugins/openai/vision_agents/plugins/openai/tts.py`, `plugins/openai/vision_agents/plugins/openai/__init__.py`, `plugins/openai/tests/test_tts_openai.py`	New TTS implementation for OpenAI. Constructor takes `api_key`, `model` (default "gpt-4o-mini-tts"), `voice` (default "alloy"), optional `client`. `stream_audio()` calls OpenAI API with PCM output, returns `PcmData` with sample_rate=24000, channels=1, format="s16". `stop_audio()` is no-op. Added `__init__` export and `TTS` to `__all__`. New integration tests with fixture and manual WAV generation.
AWS Polly TTS Plugin (New) `plugins/aws/vision_agents/plugins/aws/tts.py`, `plugins/aws/vision_agents/plugins/aws/__init__.py`, `plugins/aws/tests/test_tts.py`, `plugins/aws/example/aws_polly_tts_example.py`, `plugins/aws/README.md`	New TTS implementation for AWS Polly. Constructor configures region, voice, text type, engine, language code, lexicon names, optional client. `stream_audio()` calls Polly SynthesizeSpeech with 16kHz PCM output, returns `PcmData`. Added `__init__` export and `TTS` to `__all__`. Updated README. Added example script and integration tests with credential detection and environment-based gating.
Test Refactoring `plugins/cartesia/tests/test_tts.py`, `plugins/elevenlabs/tests/test_tts.py`, `plugins/fish/tests/test_fish_tts.py`, `plugins/kokoro/tests/test_tts.py`, `tests/test_tts_base.py`, `tests/test_pcm_data.py`	Removed unit-test mocking infrastructure; replaced with integration-focused tests using `TTSSession`, `manual_tts_to_wav()`, and `assert_tts_send_non_blocking()`. Tests now use pytest fixtures with environment-based credential gating (pytest.skip). Added comprehensive `PcmData` tests covering interleaving, resampling, duration preservation, and multi-channel handling. New `test_tts_base.py` validates PCM streaming, error propagation, and event emission for TTS base class.
Documentation & Examples `docs/ai/instructions/ai-tts.md`, `docs/ai/instructions/ai-tests.md`, `DEVELOPMENT.md`, `examples/01_simple_agent_example/simple_agent_example.py`	Updated TTS implementation guide with emphasis on `stream_audio()` returning `PcmData` and usage of `PcmData.from_bytes()`. Added non-blocking checks documentation with `assert_tts_send_non_blocking()` example. New "Audio management" section in DEVELOPMENT.md detailing PCM-centric handling, WAV serialization, resampling, and playback. Updated simple agent example UI flow.
Infrastructure `conftest.py`, `tests/test_utils.py`, `plugins/aws/tests/test_aws.py`	Minor formatting and stylistic adjustments in conftest.py. Updated test utilities to handle 1D and 2D numpy arrays in PcmData tests. Updated AWS Bedrock test fixture to skip with `pytest.skip()` when credentials are missing instead of raising.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant TTS
    participant TTSProvider
    participant PcmData
    participant OutputAudioTrack
    
    Agent->>TTS: set_output_format(sample_rate, channels)
    activate TTS
    TTS->>TTS: store desired format
    deactivate TTS
    
    Agent->>TTS: send(text)
    activate TTS
    TTS->>TTSProvider: synthesize_speech(text)
    activate TTSProvider
    TTSProvider-->>TTS: audio stream (bytes/chunks)
    deactivate TTSProvider
    
    loop for each chunk
        TTS->>TTS: _iter_pcm(chunk)
        TTS->>PcmData: from_bytes(chunk, ...)
        TTS->>TTS: resample to output_format
        TTS->>TTS: _emit_chunk(pcm)
        TTS->>TTS: emit TTSAudioEvent
        TTS->>OutputAudioTrack: write(pcm_bytes)
        activate OutputAudioTrack
        OutputAudioTrack-->>Agent: audio routed to WebRTC
        deactivate OutputAudioTrack
    end
    
    TTS->>TTS: emit TTSSynthesisCompleteEvent
    deactivate TTS

sequenceDiagram
    participant Test
    participant TTSSession
    participant TTS
    participant EventBus
    
    Test->>TTS: set_output_format(sample_rate, channels)
    Test->>TTSSession: new TTSSession(tts)
    activate TTSSession
    TTSSession->>EventBus: subscribe(TTSSynthesisStartEvent, ...)
    TTSSession->>EventBus: subscribe(TTSAudioEvent, ...)
    TTSSession->>EventBus: subscribe(TTSErrorEvent, ...)
    TTSSession->>EventBus: subscribe(TTSSynthesisCompleteEvent, ...)
    deactivate TTSSession
    
    Test->>TTS: send(text)
    activate TTS
    TTS->>EventBus: emit TTSSynthesisStartEvent
    TTS->>EventBus: emit TTSAudioEvent (multiple)
    TTS->>EventBus: emit TTSSynthesisCompleteEvent
    deactivate TTS
    
    Test->>TTSSession: wait_for_result(timeout)
    activate TTSSession
    TTSSession->>TTSSession: await first relevant event or timeout
    TTSSession-->>Test: TTSResult(speeches, errors, started, completed)
    deactivate TTSSession

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

PcmData multi-channel handling and resampling logic (agents-core/vision_agents/core/edge/types.py): Complex logic for shape normalization, 2D/3D sample layouts, interleaving, and resampling with dtype conversions. Validate correctness of channel count calculations and byte ordering.
TTS base class PCM emission and event orchestration (agents-core/vision_agents/core/tts/tts.py): New _emit_chunk(), _iter_pcm() methods with latency recording and error handling require careful review of event sequencing, duration calculations, and completion semantics.
Agent RTC integration changes (agents-core/vision_agents/core/agents/agents.py): Verify that set_output_format() is called at the correct point in RTC preparation; ensure backward compatibility with realtime mode; validate that OutputAudioTrack protocol is correctly implemented by created tracks.
Plugin refactoring consistency: Multiple plugins follow similar patterns (remove framerate/stereo methods, wrap with PcmData.from_response()). Verify all plugins handle sample rates, channels, and formats consistently.
Test infrastructure migration: Verify that new TTSSession-based tests capture the same failure scenarios as the removed mock-based tests; ensure environment-based credential gating (pytest.skip) is consistent across all integration tests.
AWS Polly thread-pool execution: Validate that stream_audio() thread-pool call to synthesize_speech properly handles timeouts and cancellation without deadlocks.

Possibly related PRs

[AI-195] Fish support #115: Modifies the Fish TTS plugin implementation (plugins/fish/vision_agents/plugins/fish/tts.py), overlapping with this PR's Fish TTS refactoring to use PcmData and remove legacy framerate/stereo constraints.
[AI-201] Fish speech to text #121: Modifies agent event registration in agents-core/vision_agents/core/agents/agents.py by adding STT error event logging; this PR also modifies the same file to add TTSAudioEvent handling, creating potential merge conflicts or duplicated subscriber logic.

Suggested reviewers

Nash0x7E2
maxkahan
d3xvn

Poem

Bell jar of bytes descends—
Each PCM chunk resampled, interleaved, pressed
Into wire-thin audio tracks,
The TTS daemon speaks in stereo silence,
Formats standardized, no more guessing:
Forty-eight thousand hertz, the hum of engines,
Two channels deep—
the throat of the machine.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 48.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The PR title "Simplify TTS plugin and audio utils" is clearly related to the changeset. The raw summary confirms that the changes involve removing public methods from TTS plugins (like `get_required_framerate`, `get_required_stereo`, and `set_output_track`), which aligns with the simplification theme. The title also matches the stated PR objectives, which list "Simplified TTS base class" and multiple plugin cleanups as completed tasks. While the title doesn't explicitly convey the underlying architectural shift to PCM-centric audio handling or the expansion of audio utilities, it accurately captures the main goal of reducing API complexity across TTS plugins and modernizing the audio layer. The title is concise, specific, and conveys meaningful information about the primary change.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch stt-plugins

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

plugins/aws/tests/test_aws.py (1)
126-146: Apply consistent credential checking to all tests.

These tests create their own LLM instances and don't use the llm fixture, so they bypass the skip logic on line 41. Without AWS_BEARER_TOKEN_BEDROCK, they will attempt to run and likely fail, creating inconsistent test behavior.

Consider one of these solutions:

Solution 1: Add skip check to each test
 @pytest.mark.integration
 async def test_image_description(self, golf_swing_image):
+    if not os.environ.get("AWS_BEARER_TOKEN_BEDROCK"):
+        pytest.skip("AWS_BEARER_TOKEN_BEDROCK not set – skipping Bedrock tests")
     # Use a vision-capable model (Claude 3 Haiku supports images and is widely available)
     vision_llm = BedrockLLM(
Solution 2: Use the fixture and modify as needed
 @pytest.mark.integration
-async def test_image_description(self, golf_swing_image):
+async def test_image_description(self, llm: BedrockLLM, golf_swing_image):
     # Use a vision-capable model (Claude 3 Haiku supports images and is widely available)
-    vision_llm = BedrockLLM(
+    llm._model = "anthropic.claude-3-haiku-20240307-v1:0"
+    vision_llm = llm
-        model="anthropic.claude-3-haiku-20240307-v1:0", region_name="us-east-1"
-    )
Apply similar changes to test_instruction_following.

Also applies to: 149-161
agents-core/vision_agents/core/observability/metrics.py (1)
77-81: Do not emit spans at import time.

Creating spans during module import causes global side effects and unexpected traffic. Remove these calls; expose helpers to start spans in calling code instead.
-with tracer.start_as_current_span("stt.request", kind=trace.SpanKind.CLIENT) as span:
-    pass
-
-span = tracer.start_span("stt.request")
-span.end()
agents-core/vision_agents/core/agents/agents.py (1)
991-1004: Realtime warning condition is inconsistent with the message.

The second branch warns about “STT, TTS and Turn Detection” but only checks self.stt or self.turn_detection. Include self.tts for consistency.
-            if self.stt or self.turn_detection:
+            if self.stt or self.tts or self.turn_detection:
                 self.logger.warning(
                     "Realtime mode detected: STT, TTS and Turn Detection services will be ignored. "
                     "The Realtime model handles both speech-to-text, text-to-speech and turn detection internally."
                 )
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

39-62: Update docstring to reflect actual return type; fix typo in __init__ docstring.

The SDK usage is correct—output_format="pcm_16000" is the proper format string for PCM at 16 kHz. However, two issues remain:

Line 47 (stream_audio docstring): Change "An async iterator of audio chunks as bytes" to describe the actual return type PcmData | Iterator[PcmData] | AsyncIterator[PcmData].

Line 23 (init docstring): Fix "ElvenLabs Client" → "ElevenLabs Client".

🧹 Nitpick comments (30)

plugins/aws/tests/test_aws.py (1)

43-43: Consider public API for test setup.

The tests access private attributes (_conversation) and methods (_set_instructions) directly. While common in testing, this couples tests to implementation details.

If BedrockLLM provides public methods to configure conversation state and instructions, prefer those. If not, consider adding public test helpers:
# In BedrockLLM class
def configure_for_testing(self, instructions: str = None, conversation = None):
    """Configure LLM for testing purposes."""
    if instructions:
        self._set_instructions(instructions)
    if conversation:
        self._conversation = conversation
Then in tests:
llm.configure_for_testing(conversation=InMemoryConversation("be friendly", []))
Also applies to: 154-154

docs/ai/instructions/ai-tts.md (3)

15-17: Clarify stream_audio return contract.

Current text says “return a single PcmData,” but plugins may return a PcmData or an (async) iterator of PcmData. Please update the guidance to accept both to match the base class behavior and existing plugins.

36-38: Avoid recommending buffering entire streams.

“Buffer streaming SDK audio into a single byte string” risks high memory usage for long utterances. Prefer emitting multiple PcmData chunks (or returning an iterator) and let the Agent handle resampling/assembly.

80-84: Safer assertion in example.

Use assert result.speeches (or assert len(result.speeches) > 0) instead of indexing result.speeches[0] to avoid IndexError in edge cases.

agents-core/vision_agents/core/tts/manual_test.py (2)

25-31: Fix docstring inaccuracies (Google style).

The function receives a TTS instance; it does not create one via tts_factory(). Please remove that bullet to avoid confusion.

-    - Creates the TTS instance via `tts_factory()`.
     - Sets desired output format via `set_output_format(sample_rate, channels)`.

66-81: Ensure subprocess cleanup on timeout.

After proc.kill(), also await proc.wait() to reap the process.

         try:
             await asyncio.wait_for(proc.wait(), timeout=30.0)
         except asyncio.TimeoutError:
-            proc.kill()
+            proc.kill()
+            try:
+                await proc.wait()
+            except Exception:
+                pass

agents-core/vision_agents/core/observability/metrics.py (3)

35-39: Duplicate meter assignment.

meter is assigned twice (__name__ then "voice-agent.latency"). Keep one to avoid confusion.
-meter = metrics.get_meter(__name__)
-
-
-meter = metrics.get_meter("voice-agent.latency")
+meter = metrics.get_meter("voice-agent.latency")
12-13: Hard-coded OTLP endpoint.

Make OTLP_ENDPOINT configurable via env (e.g., OTLP_ENDPOINT = os.getenv("OTLP_ENDPOINT", "http://localhost:4317")) to work across environments.

69-75: Remove unused sample attrs or mark as example.

CALL_ATTRS appears unused; consider deleting or moving into examples to avoid dead code.

plugins/cartesia/tests/test_tts.py (1)

15-20: Avoid type: ignore by importing the symbol.

Import the concrete class for typing and return it from tts().

-from vision_agents.plugins import cartesia
+from vision_agents.plugins import cartesia
+from vision_agents.plugins.cartesia import TTS as CartesiaTTS
@@
-    def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
+    def tts(self) -> CartesiaTTS:
@@
-        return cartesia.TTS(api_key=api_key)
+        return CartesiaTTS(api_key=api_key)

plugins/kokoro/tests/test_tts.py (1)

16-18: LGTM overall; add a sanity assertion and optional cleanup.

Capture the returned path and assert it exists; optionally remove it to avoid temp buildup.

-    async def test_kokoro_tts_convert_text_to_audio_manual_test(self, tts):
-        await manual_tts_to_wav(tts, sample_rate=24000, channels=1)
+    async def test_kokoro_tts_convert_text_to_audio_manual_test(self, tts):
+        path = await manual_tts_to_wav(tts, sample_rate=24000, channels=1)
+        assert path and os.path.exists(path)
+        try:
+            os.remove(path)
+        except OSError:
+            pass

agents-core/vision_agents/core/agents/agents.py (3)

306-317: Guard against format mismatches when writing to the audio track.

You assume TTS honored set_output_format, but if a plugin misbehaves, bytes at the wrong rate/channels could hit the track. Log (or drop) mismatched chunks to prevent artifacts.

         async def _on_tts_audio(event: TTSAudioEvent):
             try:
-                if self._audio_track and event.audio_data:
-                    from typing import Any, cast
-
-                    track_any = cast(Any, self._audio_track)
-                    await track_any.write(event.audio_data)
+                if self._audio_track and event.audio_data:
+                    from typing import Any, cast
+                    # Optional: verify negotiated format
+                    try:
+                        expected_rate = getattr(self._audio_track, "framerate", None)
+                        expected_channels = 2 if getattr(self._audio_track, "stereo", False) else 1
+                        if (expected_rate and event.sample_rate != expected_rate) or (
+                            expected_channels and event.channels != expected_channels
+                        ):
+                            self.logger.warning(
+                                "Dropping TTS audio: format mismatch (got %s Hz/%sch, expected %s Hz/%sch)",
+                                event.sample_rate, event.channels, expected_rate, expected_channels,
+                            )
+                            return
+                    except Exception:
+                        # If track doesn’t expose props, proceed optimistically
+                        pass
+                    track_any = cast(Any, self._audio_track)
+                    await track_any.write(event.audio_data)
             except Exception as e:
                 self.logger.error(f"Error writing TTS audio to track: {e}")

1032-1047: Make 48k/stereo defaults configurable and reuse them for validation.

Expose framerate/stereo as Agent init kwargs or class constants, store on self for reuse (e.g., in _on_tts_audio validation). Keeps behavior flexible across environments.

-                framerate = 48000
-                stereo = True
+                framerate = getattr(self, "_audio_out_rate", 48000)
+                stereo = getattr(self, "_audio_out_stereo", True)
                 self._audio_track = self.edge.create_audio_track(
                     framerate=framerate, stereo=stereo
                 )
                 # Inform TTS of desired output format so it can resample accordingly
                 if self.tts:
                     channels = 2 if stereo else 1

311-314: Tiny nit: avoid re-importing typing inside the handler.

Import cast at module top to reduce per-call overhead and keep imports centralized.

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (2)

37-38: Consider honoring desired sample rate to reduce resampling.

If the agent negotiates 48 kHz stereo, you’ll resample 16 kHz mono to match. If ElevenLabs supports multiple PCM rates, map self._desired_sample_rate to a supported output_format to minimize CPU work. Otherwise, keep current behavior.

Also applies to: 60-62

64-72: Doc says “Clears the queue and stops playing audio” but it’s a no‑op.

Either implement actual cancellation if the SDK supports it or update the docstring to reflect no-op behavior.
-        """
-        Clears the queue and stops playing audio.
-        This method can be used manually or under the hood in response to turn events.
-        ...
-        """
+        """
+        Stop request hook. ElevenLabs SDK streaming is pull-based here; there is no internal
+        playback/queue to flush, so this is a no-op by design.
+        """

plugins/elevenlabs/tests/test_tts.py (2)

29-30: Strengthen assertions to catch regressions early.

Also assert session start to ensure events flow.
-        assert not result.errors
-        assert len(result.speeches) > 0
+        assert not result.errors
+        assert result.started is True
+        assert len(result.speeches) > 0
33-35: Avoid print in tests; prefer logging or assertion of output.

Printing paths is noisy in CI. Use logging or silence by default; optionally assert file exists.
-        path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
-        print("ElevenLabs TTS audio written to:", path)
+        path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        assert os.path.exists(path)
Note: Consider enhancing manual_tts_to_wav to wait until synthesis completes before writing, otherwise it may write only the first chunk. Based on relevant helpers.

agents-core/vision_agents/core/tts/testing.py (2)

70-81: Provide an option to wait until synthesis completes.

wait_for_result returns after the first audio/error, which is great for smoke checks but truncates longer audio for callers like manual_tts_to_wav. Add a mode to wait for TTSSynthesisCompleteEvent (with timeout).

-    async def wait_for_result(self, timeout: float = 10.0) -> TTSResult:
+    async def wait_for_result(
+        self, timeout: float = 10.0, until_complete: bool = False
+    ) -> TTSResult:
         try:
-            await asyncio.wait_for(self._first_event.wait(), timeout=timeout)
+            if until_complete:
+                async def _wait_complete():
+                    # Fast-path if already completed
+                    if self._completed:
+                        return
+                    # Wait until completion toggles (events update the flag)
+                    while not self._completed and not self._errors and not self._speeches:
+                        await asyncio.sleep(0.01)
+                await asyncio.wait_for(_wait_complete(), timeout=timeout)
+            else:
+                await asyncio.wait_for(self._first_event.wait(), timeout=timeout)
         except asyncio.TimeoutError:
             # Return whatever we have so far
             pass
         return TTSResult(
             speeches=list(self._speeches),
             errors=list(self._errors),
             started=self._started,
             completed=self._completed,
         )

42-61: Add a simple teardown to avoid subscriber leaks in long-lived tests.

Store unsubscribe handles (if supported) or expose a close() to deregister callbacks.

 class TTSSession:
@@
-        @tts.events.subscribe
-        async def _on_start(ev: TTSSynthesisStartEvent):  # type: ignore[name-defined]
+        self._subs = []
+        @tts.events.subscribe
+        async def _on_start(ev: TTSSynthesisStartEvent):  # type: ignore[name-defined]
             self._started = True
+        self._subs.append(_on_start)
@@
-        @tts.events.subscribe
+        @tts.events.subscribe
         async def _on_complete(ev: TTSSynthesisCompleteEvent):  # type: ignore[name-defined]
             self._completed = True
+        self._subs.append(_on_complete)
+
+    def close(self) -> None:
+        for cb in getattr(self, "_subs", []):
+            try:
+                self._tts.events.unsubscribe(cb)  # if supported by EventManager
+            except Exception:
+                pass

If EventManager lacks unsubscribe, consider a no-op close() for API consistency. As per coding guidelines.

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (2)

54-58: Docstring: clarify return shapes and native format.

Mention that response may be async iterator and that PcmData is s16 mono at self.sample_rate, to match base expectations.
-    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]  # noqa: D401
-        """Generate speech and return a stream of PcmData."""
+    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]:  # noqa: D401
+        """Generate speech and return PcmData stream (s16 mono at sample_rate)."""
80-82: Honor desired channel count if agent requests stereo.

If upstream calls set_output_format(..., channels=2), consider threading that into from_response so downstream resampling has correct provenance.
-        return PcmData.from_response(
-            response, sample_rate=self.sample_rate, channels=1, format="s16"
-        )
+        return PcmData.from_response(
+            response, sample_rate=self.sample_rate, channels=1, format="s16"
+        )
Alternatively, set self._native_channels = 1 in init for clarity; base class will rechannel to desired on emit.

plugins/fish/vision_agents/plugins/fish/tts.py (2)

25-26: Avoid hard-coding a reference voice by default.

A baked-in reference_id can break for users lacking access to that voice. Default to None and document how to set it via config/env.
-        reference_id: Optional[str] = "03397b4c4be74759b72533b663fbd001",
+        reference_id: Optional[str] = None,
86-90: Explicitly declare native format/channel for clarity.

Not required, but setting provider-native format helps future maintainers.
-        return PcmData.from_response(
-            stream, sample_rate=16000, channels=1, format="s16"
-        )
+        # Provider-native is 16kHz mono s16
+        return PcmData.from_response(stream, sample_rate=16000, channels=1, format="s16")

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (2)

47-53: Use get_running_loop() in async context.

get_event_loop() is deprecated when a loop is running; prefer get_running_loop() to avoid warnings on 3.11+.
-        loop = asyncio.get_event_loop()
+        loop = asyncio.get_running_loop()
55-60: Minor: annotate generator return and keep PCM metadata close.

Inline the format/sample_rate once to avoid repetition.
         async def _aiter():
             for chunk in chunks:
-                yield PcmData.from_bytes(
-                    chunk, sample_rate=self.sample_rate, channels=1, format="s16"
-                )
+                yield PcmData.from_bytes(chunk, sample_rate=self.sample_rate, channels=1, format="s16")

agents-core/vision_agents/core/tts/tts.py (3)

125-142: Deduplicate normalization: delegate to PcmData.from_response; also handle memoryview correctly

re-implementing chunk normalization invites edge bugs. Use PcmData.from_response, which already aligns/aggregates and supports bytes/PcmData/iterators.

Apply this refactor:

-    async def _iter_pcm(self, resp: Any) -> AsyncGenerator[PcmData, None]:
-        """Yield PcmData chunks from a provider response of various shapes."""
-        # Single buffer or PcmData
-        if isinstance(resp, (bytes, bytearray, PcmData)):
-            yield self._normalize_to_pcm(resp)
-            return
-        # Async iterable
-        if hasattr(resp, "__aiter__"):
-            async for item in resp:
-                yield self._normalize_to_pcm(item)
-            return
-        # Sync iterable (avoid treating bytes-like as iterable of ints)
-        if hasattr(resp, "__iter__") and not isinstance(resp, (str, bytes, bytearray)):
-            for item in resp:
-                yield self._normalize_to_pcm(item)
-            return
-        raise TypeError(f"Unsupported return type from stream_audio: {type(resp)}")
+    async def _iter_pcm(self, resp: Any) -> AsyncGenerator[PcmData, None]:
+        """Yield PcmData chunks from arbitrary provider responses via PcmData.from_response."""
+        fmt = self._native_format.value if hasattr(self._native_format, "value") else "s16"
+        norm = PcmData.from_response(
+            resp,
+            sample_rate=self._native_sample_rate,
+            channels=self._native_channels,
+            format=fmt,
+        )
+        if isinstance(norm, PcmData):
+            yield norm
+            return
+        if hasattr(norm, "__aiter__"):
+            async for pcm in norm:
+                yield pcm
+            return
+        if hasattr(norm, "__iter__"):
+            for pcm in norm:
+                yield pcm
+            return
+        raise TypeError(f"Unsupported return type from stream_audio: {type(resp)}")

179-186: Update stream_audio docstring to mention PcmData variants

Return annotation includes PcmData types, but the docstring doesn’t. Clarify for implementers.

Apply this doc tweak:

-        Returns:
-            Audio data as bytes, an iterator of audio chunks, or an async iterator of audio chunks
+        Returns:
+            Audio as:
+            - bytes or (async) iterator[bytes], or
+            - PcmData or (async) iterator[PcmData].

As per coding guidelines.

Also applies to: 197-199

277-281: Compute real‑time factor using total send duration, not pre‑stream “setup” time

synthesis_time measures only until stream_audio returns, not the full emission. Use total elapsed before emitting the complete event.

Apply this adjustment:

-            real_time_factor = (
-                (synthesis_time * 1000) / estimated_audio_duration_ms
-                if estimated_audio_duration_ms > 0
-                else None
-            )
+            total_elapsed_ms = (time.time() - start_time) * 1000.0
+            real_time_factor = (
+                total_elapsed_ms / estimated_audio_duration_ms
+                if estimated_audio_duration_ms > 0
+                else None
+            )
@@
-                    synthesis_time_ms=synthesis_time * 1000,
+                    synthesis_time_ms=total_elapsed_ms,

If “synthesis_time_ms” is intended to reflect only provider latency, consider adding a second field (e.g., end_to_end_ms) instead of overloading.

Also applies to: 283-296, 313-317

agents-core/vision_agents/core/edge/types.py (1)

320-337: to_bytes: ensure interleaved view is contiguous before tobytes()

Transpose often creates non‑contiguous views. Make it explicit.

Apply:

-            if arr.ndim == 2:
-                # (channels, samples) -> interleaved (samples, channels)
-                interleaved = arr.T.reshape(-1)
-                return interleaved.tobytes()
+            if arr.ndim == 2:
+                # (channels, samples) -> interleaved (samples, channels)
+                interleaved = np.ascontiguousarray(arr.T).reshape(-1)
+                return interleaved.tobytes()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a725b0 and d9f79b3.

📒 Files selected for processing (19)

agents-core/vision_agents/core/agents/agents.py (3 hunks)
agents-core/vision_agents/core/edge/types.py (6 hunks)
agents-core/vision_agents/core/observability/__init__.py (2 hunks)
agents-core/vision_agents/core/observability/metrics.py (1 hunks)
agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
agents-core/vision_agents/core/tts/testing.py (1 hunks)
agents-core/vision_agents/core/tts/tts.py (5 hunks)
docs/ai/instructions/ai-tts.md (1 hunks)
examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
plugins/aws/tests/test_aws.py (1 hunks)
plugins/cartesia/tests/test_tts.py (1 hunks)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (5 hunks)
plugins/elevenlabs/tests/test_tts.py (1 hunks)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (4 hunks)
plugins/fish/tests/test_tts.py (1 hunks)
plugins/fish/vision_agents/plugins/fish/tts.py (5 hunks)
plugins/kokoro/tests/test_tts.py (1 hunks)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (3 hunks)
tests/test_tts_base.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/observability/__init__.py
tests/test_tts_base.py
plugins/kokoro/tests/test_tts.py
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py
examples/01_simple_agent_example/simple_agent_example.py
plugins/elevenlabs/tests/test_tts.py
agents-core/vision_agents/core/agents/agents.py
plugins/kokoro/vision_agents/plugins/kokoro/tts.py
plugins/cartesia/vision_agents/plugins/cartesia/tts.py
agents-core/vision_agents/core/observability/metrics.py
plugins/aws/tests/test_aws.py
agents-core/vision_agents/core/edge/types.py
plugins/fish/vision_agents/plugins/fish/tts.py
plugins/cartesia/tests/test_tts.py
agents-core/vision_agents/core/tts/manual_test.py
agents-core/vision_agents/core/tts/tts.py
plugins/fish/tests/test_tts.py
agents-core/vision_agents/core/tts/testing.py

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

tests/test_tts_base.py

🧬 Code graph analysis (15)

tests/test_tts_base.py (4)

agents-core/vision_agents/core/tts/tts.py (4)

TTS (32-329)

stream_audio (177-200)

set_output_format (81-99)

send (216-317)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSErrorEvent (51-64)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

agents-core/vision_agents/core/edge/types.py (3)

PcmData (37-505)

_agen (416-448)

from_bytes (118-186)

agents-core/vision_agents/core/events/manager.py (1)

wait (470-484)

plugins/kokoro/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)

TTS (18-77)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (6)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_response (382-505)

agents-core/vision_agents/core/tts/tts.py (1)

stream_audio (177-200)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)

stream_audio (54-82)

plugins/fish/vision_agents/plugins/fish/tts.py (1)

stream_audio (56-90)

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)

stream_audio (47-61)

tests/test_tts_base.py (6)

stream_audio (17-21)

stream_audio (28-37)

stream_audio (44-47)

stream_audio (54-58)

stream_audio (65-69)

stream_audio (76-77)

examples/01_simple_agent_example/simple_agent_example.py (1)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)

open_demo (329-406)

plugins/elevenlabs/tests/test_tts.py (3)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

TTS (10-72)

agents-core/vision_agents/core/agents/agents.py (4)

agents-core/vision_agents/core/tts/events.py (1)

TTSAudioEvent (10-21)

agents-core/vision_agents/core/events/manager.py (1)

subscribe (299-368)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)

create_audio_track (291-294)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (2)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_bytes (118-186)

agents-core/vision_agents/core/tts/tts.py (1)

stream_audio (177-200)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (3)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_response (382-505)

agents-core/vision_agents/core/tts/tts.py (1)

stream_audio (177-200)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

stream_audio (39-62)

agents-core/vision_agents/core/edge/types.py (1)

tests/test_tts_base.py (1)

_agen (32-35)

plugins/fish/vision_agents/plugins/fish/tts.py (4)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_response (382-505)

agents-core/vision_agents/core/tts/tts.py (3)

TTS (32-329)

stream_audio (177-200)

stop_audio (203-214)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (3)

TTS (18-92)

stream_audio (54-82)

stop_audio (84-92)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (3)

TTS (10-72)

stream_audio (39-62)

stop_audio (64-72)

plugins/cartesia/tests/test_tts.py (3)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)

TTS (18-92)

agents-core/vision_agents/core/tts/manual_test.py (2)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/edge/types.py (3)

PcmData (37-505)

from_bytes (118-186)

to_wav_bytes (338-379)

agents-core/vision_agents/core/tts/tts.py (8)

agents-core/vision_agents/core/events/base.py (3)

PluginInitializedEvent (56-63)

PluginClosedEvent (67-74)

AudioFormat (23-30)

agents-core/vision_agents/core/edge/types.py (6)

PcmData (37-505)

from_bytes (118-186)

resample (251-318)

to_bytes (320-336)

duration_ms (101-103)

close (33-34)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

TTSErrorEvent (51-64)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)

stream_audio (54-82)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

stream_audio (39-62)

plugins/fish/vision_agents/plugins/fish/tts.py (1)

stream_audio (56-90)

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)

stream_audio (47-61)

tests/test_tts_base.py (6)

stream_audio (17-21)

stream_audio (28-37)

stream_audio (44-47)

stream_audio (54-58)

stream_audio (65-69)

stream_audio (76-77)

plugins/fish/tests/test_tts.py (3)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

plugins/fish/vision_agents/plugins/fish/tts.py (1)

TTS (12-102)

agents-core/vision_agents/core/tts/testing.py (3)

agents-core/vision_agents/core/tts/tts.py (1)

TTS (32-329)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSErrorEvent (51-64)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

tests/test_tts_base.py (8)

_on_start (92-93)

_on_audio (96-99)

_on_audio (126-128)

_on_audio (148-150)

_on_audio (167-169)

_on_audio (188-190)

_on_error (207-209)

_on_complete (102-103)

🪛 LanguageTool

docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op (the Agent controls playback...

(MISSING_IT_THERE)

🔇 Additional comments (14)

plugins/aws/tests/test_aws.py (2)

40-41: LGTM – Proper use of pytest.skip for missing credentials.

The fixture-based skip logic ensures that all tests depending on the llm fixture will be skipped when credentials are unavailable, which is the correct approach for integration tests.

1-161: AI summary inconsistency with file content.

The AI-generated summary describes TTS audio handling and TTS plugins, but this file tests the BedrockLLM language model. The summary appears to describe other files in the PR rather than this one.

examples/01_simple_agent_example/simple_agent_example.py (1)

55-56: Original review comment is not supported by codebase evidence.

The search reveals that the predominant pattern in the codebase (16 of 22 instances) is to call open_demo before the with await agent.join(call): context, not inside it. The review claims moving open_demo inside the join context is an improvement for "reducing race conditions," but this contradicts the established practice. While 6 examples do use the inside pattern, they are the minority. Additionally, no explicit error handling is observed around open_demo calls in any of these examples, so the error-handling concern in the original review is not validated by precedent.

If this change intentionally diverges from the common pattern, that architectural decision should be justified explicitly rather than framed as a general improvement.

Likely an incorrect or invalid review comment.

agents-core/vision_agents/core/observability/__init__.py (1)

18-19: Export looks good.

tts_events_emitted is properly imported and added to __all__.

Also applies to: 34-35

plugins/cartesia/tests/test_tts.py (1)

21-31: Integration test flow looks solid.

Env‑guard, output format set, session wait, and assertions are appropriate.
If flakes occur, consider increasing timeout to match real API latency.

plugins/elevenlabs/tests/test_tts.py (1)

19-27: No changes needed — asyncio support is already properly configured.

The repository has asyncio_mode = auto configured in pytest.ini, which enables automatic async test execution. The test at lines 19-27 will run correctly with only the @pytest.mark.integration marker; adding @pytest.mark.asyncio is unnecessary and contradicts the established pattern of relying on auto mode.

agents-core/vision_agents/core/tts/tts.py (3)

63-79: Initialization/event plumbing looks solid

Sessioning, provider naming, and PluginInitializedEvent emission are consistent and minimal. No issues.

321-329: Graceful close event emission LGTM

PluginClosedEvent with plugin_type="TTS" is consistent.

143-175: Verification confirms field is properly defined

The user_metadata field is defined in the BaseEvent class (agents-core/vision_agents/core/events/base.py, line 41) as user_metadata: Optional[Participant] = None. Since TTSAudioEvent inherits from PluginBaseEvent, which extends BaseEvent, the field is available and the code at lines 143-175 is correct. No dataclass initialization errors will occur.

agents-core/vision_agents/core/edge/types.py (5)

56-76: Multi‑channel duration and duration_ms: 👍

Handles (channels, samples) correctly and exposes ms helper. Looks good.

Also applies to: 100-104

118-187: from_bytes: alignment + interleaving logic LGTM

Good trimming to sample width and channel‑multiple; returns (channels, samples) for multichannel.

188-250: from_data: pragmatic normalization

Covers bytes and ndarray shapes/dtypes well. Minor note: when ambiguous 2D, assuming first dim as channels is reasonable.

338-380: to_wav_bytes: sensible s16 conversion path

Converts non‑s16 to s16 and writes standard WAV headers. Looks good.

381-505: from_response: versatile and aligns chunks

Covers bytes/PcmData/(a)synchronous iterables and pads trailing partial frames. Good reuse across plugins.

Confirm target providers always return PCM (not compressed) when using this path. If not, gate by format and raise early.

agents-core/vision_agents/core/edge/types.py

agents-core/vision_agents/core/tts/manual_test.py

agents-core/vision_agents/core/tts/tts.py

plugins/fish/tests/test_tts.py

tests/test_tts_base.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

plugins/fish/tests/test_fish_tts.py (1)
34-37: Consider capturing the result from wait_for_result for consistency.

While accessing session.errors and session.speeches works because the session maintains internal state, the idiomatic pattern shown in the TTSSession docstring suggests using the returned TTSResult object.

Apply this diff:
         await tts.send(text)
-        await session.wait_for_result(timeout=15.0)
+        result = await session.wait_for_result(timeout=15.0)
 
-        assert not session.errors
-        assert len(session.speeches) > 0
+        assert not result.errors
+        assert len(result.speeches) > 0
agents-core/vision_agents/core/agents/agents.py (2)
318-319: Type-safety bypass requires justification or stronger validation.

Casting to Any silences type checking entirely. If the track genuinely lacks proper type hints for write(), consider adding runtime validation or a comment explaining why the cast is necessary.
-                track_any = cast(Any, self._audio_track)
-                await track_any.write(event.audio_data)
+                # AudioStreamTrack.write() not in type stubs but exists at runtime
+                if not hasattr(self._audio_track, 'write'):
+                    self.logger.error("Audio track does not support write method")
+                    return
+                track_any = cast(Any, self._audio_track)
+                await track_any.write(event.audio_data)
1037-1042: Hardcoded audio format lacks configuration mechanism.

The comment mentions "unless configured differently," but framerate and stereo are hardcoded literals with no constructor parameter, config file, or environment-variable override. This reduces flexibility for deployments requiring different sample rates.

Consider adding constructor parameters:
def __init__(
    self,
    # ... existing params ...
    audio_output_sample_rate: int = 48000,
    audio_output_stereo: bool = True,
    # ... rest of params ...
):
Then reference self.audio_output_sample_rate and self.audio_output_stereo in _prepare_rtc().

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9f79b3 and 451075f.

📒 Files selected for processing (3)

agents-core/vision_agents/core/agents/agents.py (3 hunks)
examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
plugins/fish/tests/test_fish_tts.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

agents-core/vision_agents/core/agents/agents.py
examples/01_simple_agent_example/simple_agent_example.py
plugins/fish/tests/test_fish_tts.py

🧬 Code graph analysis (3)

agents-core/vision_agents/core/agents/agents.py (4)

agents-core/vision_agents/core/tts/events.py (1)

TTSAudioEvent (10-21)

agents-core/vision_agents/core/events/manager.py (1)

subscribe (299-368)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)

create_audio_track (291-294)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

examples/01_simple_agent_example/simple_agent_example.py (1)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)

open_demo (329-406)

plugins/fish/tests/test_fish_tts.py (3)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (3)

TTSSession (23-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

🔇 Additional comments (5)

plugins/fish/tests/test_fish_tts.py (3)

1-8: LGTM!

Clean imports that align with the new TTSSession-based testing pattern.

14-17: LGTM!

Straightforward fixture setup following pytest conventions.

19-25: LGTM!

Proper environment variable guard ensures graceful skip when API keys are unavailable.

agents-core/vision_agents/core/agents/agents.py (2)

35-35: LGTM: Import aligns with event-driven TTS architecture.

The TTSAudioEvent import supports the new PCM-based TTS output flow.

311-321: Metadata serves its purpose; handler design is correct.

The audio format metadata is not unused—it documents the converted audio format in the TTSAudioEvent. At tts.py:153, audio is resampled to self._desired_sample_rate and self._desired_channels, and the event is created with matching metadata (audio_format, sample_rate, channels set to the desired values). Since format conversion happens upstream in the TTS pipeline, the handler in agents.py correctly ignores the metadata and writes pre-converted bytes directly to the track.

agents-core/vision_agents/core/agents/agents.py

examples/01_simple_agent_example/simple_agent_example.py

plugins/fish/tests/test_fish_tts.py

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (7)

plugins/elevenlabs/tests/test_tts.py (3)
10-18: Consider adding docstrings for test documentation.

The test class and fixture method lack docstrings. Adding brief documentation following the Google style guide would improve maintainability and help other developers understand the test setup.

Based on coding guidelines.

20-29: Solid integration test implementation.

The test correctly follows the TTSSession pattern, configures output format, and validates both error absence and audio generation. Consider adding a docstring to document the test's purpose per coding guidelines.

Based on coding guidelines.

31-34: Add assertions to validate the WAV output.

While manual_tts_to_wav handles internal error checking, this test lacks explicit assertions. Consider validating that the returned path exists and the file is non-empty to ensure the test fails appropriately in CI environments.
     async def test_elevenlabs_tts_convert_text_to_audio_manual_test(self, tts):
         path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
         print("ElevenLabs TTS audio written to:", path)
+        import os
+        assert os.path.exists(path), f"WAV file not created at {path}"
+        assert os.path.getsize(path) > 0, "WAV file is empty"
Also consider adding a docstring per coding guidelines.

Based on coding guidelines.
plugins/cartesia/tests/test_tts.py (1)
16-21: Consider using @pytest.fixture for synchronous fixtures.

The fixture returns a synchronous result but uses @pytest_asyncio.fixture. While this may work, the standard convention is to use @pytest.fixture for non-async fixtures and reserve @pytest_asyncio.fixture for async ones.

Apply this diff if you prefer strict convention adherence:
-    @pytest_asyncio.fixture
+    @pytest.fixture
     def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
plugins/openai/tests/test_tts_openai.py (1)
1-8: Consider adding dotenv for test environment consistency.

While environment variables can be set externally, the Cartesia and Fish test modules both use python-dotenv to load .env files, which improves developer experience.

To align with other TTS plugin tests, consider adding:
+from dotenv import load_dotenv
 import os
 import pytest
 import pytest_asyncio

 from vision_agents.plugins import openai as openai_plugin
 from vision_agents.core.tts.testing import TTSSession
 from vision_agents.core.tts.manual_test import manual_tts_to_wav

+# Load environment variables
+load_dotenv()
docs/ai/instructions/ai-tts.md (1)
27-28: Consider clarifying the sentence structure.

Static analysis suggests the sentence could be more complete, though the meaning is clear in context.

If you prefer a complete sentence:
-- `stop_audio` can be a no-op
+- `stop_audio` can be implemented as a no-op
plugins/fish/tests/test_fish_tts.py (1)
14-16: Add API key validation to skip gracefully when credentials are absent.

The fixture instantiates fish.TTS() without checking for required environment variables. If FISH_API_KEY or FISH_AUDIO_API_KEY is missing, tests will fail rather than skip gracefully.

Apply this diff:
     @pytest_asyncio.fixture
     def tts(self) -> fish.TTS:
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set")
         return fish.TTS()
Note: This addresses the same concern raised in previous review comments about the integration test.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 451075f and e5e0cf5.

📒 Files selected for processing (7)

docs/ai/instructions/ai-tts.md (1 hunks)
plugins/cartesia/tests/test_tts.py (1 hunks)
plugins/elevenlabs/tests/test_tts.py (1 hunks)
plugins/fish/tests/test_fish_tts.py (1 hunks)
plugins/openai/tests/test_tts_openai.py (1 hunks)
plugins/openai/vision_agents/plugins/openai/__init__.py (1 hunks)
plugins/openai/vision_agents/plugins/openai/tts.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/openai/tests/test_tts_openai.py
plugins/openai/vision_agents/plugins/openai/__init__.py
plugins/openai/vision_agents/plugins/openai/tts.py
plugins/fish/tests/test_fish_tts.py
plugins/cartesia/tests/test_tts.py
plugins/elevenlabs/tests/test_tts.py

🧬 Code graph analysis (6)

plugins/openai/tests/test_tts_openai.py (4)

agents-core/vision_agents/core/tts/testing.py (3)

TTSSession (23-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/openai/vision_agents/plugins/openai/tts.py (1)

TTS (10-51)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

plugins/openai/vision_agents/plugins/openai/__init__.py (2)

plugins/openai/tests/test_tts_openai.py (1)

tts (12-16)

plugins/openai/vision_agents/plugins/openai/tts.py (1)

TTS (10-51)

plugins/openai/vision_agents/plugins/openai/tts.py (2)

plugins/openai/tests/test_tts_openai.py (1)

tts (12-16)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_bytes (118-186)

plugins/fish/tests/test_fish_tts.py (4)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (3)

TTSSession (23-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

conftest.py (1)

wait_for_result (54-67)

plugins/cartesia/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

plugins/elevenlabs/tests/test_tts.py (4)

plugins/cartesia/tests/test_tts.py (1)

tts (17-21)

plugins/openai/tests/test_tts_openai.py (1)

tts (12-16)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

🪛 LanguageTool

docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op ## init The plugin con...

(MISSING_IT_THERE)

[style] ~43-~43: It’s considered informal to use ‘a couple’ without the preposition ‘of’ before a noun.
Context: ... not necessary - Make to write at least a couple integration tests, use TTSSession to ...

(A_COUPLE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (13)

plugins/elevenlabs/tests/test_tts.py (1)

1-7: LGTM!

The imports are well-organized and align with the integration test pattern used in other plugins.

plugins/openai/vision_agents/plugins/openai/__init__.py (1)

3-5: LGTM!

The TTS export cleanly extends the public API and aligns with the plugin's module structure.

plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM!

The imports and dotenv configuration follow the established pattern seen across TTS plugin tests.

23-31: LGTM!

The integration test properly exercises the real API with appropriate guards and assertions.

33-35: LGTM!

The manual WAV conversion test correctly delegates to the shared utility.

plugins/openai/tests/test_tts_openai.py (2)

10-16: LGTM!

The async fixture is properly decorated and handles missing credentials gracefully.

18-31: LGTM!

Both integration tests follow the established pattern with proper setup, execution, and assertions.

docs/ai/instructions/ai-tts.md (1)

1-11: LGTM!

The layout conventions align with the actual plugin structure and correctly reference PEP 420 namespace packages.

plugins/fish/tests/test_fish_tts.py (2)

18-20: LGTM!

The manual WAV test correctly delegates to the shared utility function.

22-32: LGTM!

The integration test follows the established pattern and will benefit from improved API key handling in the fixture.

plugins/openai/vision_agents/plugins/openai/tts.py (3)

1-8: LGTM!

The imports are well-organized and follow the project's conventions.

33-47: LGTM!

The stream_audio implementation correctly synthesizes speech to PCM format and returns a properly constructed PcmData buffer. The 24kHz sample rate aligns with OpenAI's TTS output specifications.

49-51: LGTM!

The stop_audio no-op implementation is appropriate given that playback management is handled by the agent.

docs/ai/instructions/ai-tts.md

plugins/openai/vision_agents/plugins/openai/tts.py

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (5)

plugins/aws/example/aws_polly_tts_example.py (1)

9-12: Add a Google‑style docstring and allow basic env overrides for text/output.

Keeps the example self‑documenting and convenient without adding deps.

 async def main():
-    load_dotenv()
-    tts = TTS(voice_id=os.environ.get("AWS_POLLY_VOICE", "Joanna"))
-    await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+    """Run AWS Polly TTS example.
+
+    Returns:
+        None
+    """
+    load_dotenv()
+    tts = TTS(voice_id=os.environ.get("AWS_POLLY_VOICE", "Joanna"))
+    text = os.environ.get("TTS_TEXT", "This is a manual TTS playback test.")
+    outfile = os.environ.get("TTS_OUTFILE")
+    await manual_tts_to_wav(
+        tts, sample_rate=16000, channels=1, text=text, outfile_path=outfile
+    )

As per coding guidelines.

plugins/aws/tests/test_tts.py (2)

35-45: Strengthen assertions to catch silent failures.

Also assert synthesis started; keeps failures crisp.

     async def test_aws_polly_tts_speech(self, tts: aws_plugin.TTS):
         tts.set_output_format(sample_rate=16000, channels=1)
         session = TTSSession(tts)
 
         await tts.send("Hello from AWS Polly TTS")
 
         result = await session.wait_for_result(timeout=30.0)
-        assert not result.errors
-        assert len(result.speeches) > 0
+        assert not result.errors
+        assert result.started
+        assert len(result.speeches) > 0

46-48: Avoid temp file leakage; validate WAV artifact.

Use pytest’s tmp_path and check file size.

-    async def test_aws_polly_tts_manual_wav(self, tts: aws_plugin.TTS):
-        await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+    async def test_aws_polly_tts_manual_wav(self, tts: aws_plugin.TTS, tmp_path):
+        outfile = tmp_path / "polly.wav"
+        path = await manual_tts_to_wav(
+            tts, sample_rate=16000, channels=1, outfile_path=str(outfile)
+        )
+        assert os.path.exists(path)
+        # WAV header is 44 bytes; ensure non-empty audio payload.
+        assert os.path.getsize(path) > 44

plugins/aws/vision_agents/plugins/aws/tts.py (2)

53-57: Configure client timeouts and retries.

Prevents indefinite hangs under network issues.

+from botocore.config import Config
@@
     def client(self):
         if self._client is None:
-            self._client = boto3.client("polly", region_name=self.region_name)
+            cfg = Config(
+                read_timeout=20,
+                connect_timeout=5,
+                retries={"max_attempts": 3, "mode": "standard"},
+            )
+            self._client = boto3.client("polly", region_name=self.region_name, config=cfg)
         return self._client

62-66: Adopt Google‑style docstring for stream_audio.

-        """Synthesize the entire speech to a single PCM buffer.
-
-        Returns PcmData with s16 format and the configured sample rate.
-        """
+        """Synthesize text with Polly and return PCM audio.
+
+        Args:
+            text: Input text or SSML to synthesize.
+            *_, **__: Unused, reserved for BaseTTS compatibility.
+
+        Returns:
+            PcmData with s16 format and the selected sample rate.
+        """

As per coding guidelines.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e5e0cf5 and ff9ebed.

📒 Files selected for processing (5)

plugins/aws/README.md (4 hunks)
plugins/aws/example/aws_polly_tts_example.py (1 hunks)
plugins/aws/tests/test_tts.py (1 hunks)
plugins/aws/vision_agents/plugins/aws/__init__.py (1 hunks)
plugins/aws/vision_agents/plugins/aws/tts.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

plugins/aws/README.md

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/aws/tests/test_tts.py
plugins/aws/example/aws_polly_tts_example.py
plugins/aws/vision_agents/plugins/aws/__init__.py
plugins/aws/vision_agents/plugins/aws/tts.py

🧬 Code graph analysis (4)

plugins/aws/tests/test_tts.py (4)

agents-core/vision_agents/core/tts/testing.py (3)

TTSSession (23-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/aws/vision_agents/plugins/aws/tts.py (1)

TTS (10-92)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (81-99)

plugins/aws/example/aws_polly_tts_example.py (2)

plugins/aws/vision_agents/plugins/aws/tts.py (1)

TTS (10-92)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/aws/vision_agents/plugins/aws/__init__.py (2)

plugins/aws/tests/test_tts.py (1)

tts (29-33)

plugins/aws/vision_agents/plugins/aws/tts.py (1)

TTS (10-92)

plugins/aws/vision_agents/plugins/aws/tts.py (2)

plugins/aws/tests/test_tts.py (1)

tts (29-33)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (37-505)

from_bytes (118-186)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"

🔇 Additional comments (1)

plugins/aws/vision_agents/plugins/aws/__init__.py (1)

3-5: LGTM: export surface updated to include TTS.

plugins/aws/vision_agents/plugins/aws/tts.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

agents-core/vision_agents/core/tts/tts.py (1)

243-258: Critical: Streaming path never marks the final chunk.

All streamed chunks are emitted with is_final_chunk=False (line 254). Downstream consumers cannot detect stream completion, potentially causing audio playback to hang or buffer indefinitely.

Apply a one-element lookahead to mark the final chunk:

             else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+                ait = self._iter_pcm(response).__aiter__()
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    prev = None
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt

agents-core/vision_agents/core/edge/types.py (1)

272-339: Critical: dtype mismatch and incorrect output format in resample().

Two interrelated bugs:

Input format mismatch (line 303): The AV format is hardcoded based on channel count only, ignoring self.samples.dtype. Passing float32 input will cause AudioFrame.from_ndarray() to fail because it expects int16 data when format is "s16".
Output format inconsistency (line 331): The resampler always outputs s16 (line 311), but the returned PcmData preserves self.format, creating a mismatch between the format field and actual data.

Apply this fix to detect dtype and correct the output format:

-        # Prepare ndarray shape for AV.
-        # Our convention: (channels, samples) for multi-channel, (samples,) for mono.
-        samples = self.samples
-        if samples.ndim == 1:
-            # Mono: reshape to (1, samples) for AV
-            samples = samples.reshape(1, -1)
-        elif samples.ndim == 2:
-            # Already (channels, samples)
-            pass
-
-        # Create AV audio frame from the samples
-        in_layout = "mono" if self.channels == 1 else "stereo"
-        # For multi-channel, use planar format to avoid packed shape errors
-        in_format = "s16" if self.channels == 1 else "s16p"
-        samples = np.ascontiguousarray(samples)
-        frame = av.AudioFrame.from_ndarray(samples, format=in_format, layout=in_layout)
+        # Prepare ndarray shape for AV: (channels, samples)
+        samples = self.samples
+        if samples.ndim == 1:
+            samples = samples.reshape(1, -1)
+        elif samples.ndim != 2:
+            samples = samples.reshape(1, -1)
+        samples = np.ascontiguousarray(samples)
+
+        # Only mono/stereo currently supported
+        if self.channels not in (1, 2):
+            raise NotImplementedError("resample() supports mono or stereo input only")
+        if target_channels not in (1, 2):
+            raise NotImplementedError("resample() supports mono or stereo output only")
+
+        in_layout = "mono" if self.channels == 1 else "stereo"
+        # Pick AV input format based on dtype and planarity
+        if samples.dtype == np.int16:
+            in_format = "s16" if self.channels == 1 else "s16p"
+        elif samples.dtype == np.float32:
+            in_format = "flt" if self.channels == 1 else "fltp"
+        else:
+            samples = samples.astype(np.int16)
+            in_format = "s16" if self.channels == 1 else "s16p"
+
+        frame = av.AudioFrame.from_ndarray(samples, format=in_format, layout=in_layout)
         frame.sample_rate = self.sample_rate
 
         # Create resampler
         out_layout = "mono" if target_channels == 1 else "stereo"
         resampler = av.AudioResampler(
             format="s16", layout=out_layout, rate=target_sample_rate
         )
 
         # Resample the frame
         resampled_frames = resampler.resample(frame)
         if resampled_frames:
             resampled_frame = resampled_frames[0]
             resampled_samples = resampled_frame.to_ndarray()
 
             # AV returns (channels, samples), so for mono we want the first (and only) channel
             if len(resampled_samples.shape) > 1:
                 if target_channels == 1:
                     resampled_samples = resampled_samples[0]
 
             # Convert to int16
             resampled_samples = resampled_samples.astype(np.int16)
 
             return PcmData(
                 samples=resampled_samples,
                 sample_rate=target_sample_rate,
-                format=self.format,
+                format="s16",
                 pts=self.pts,
                 dts=self.dts,
                 time_base=self.time_base,
                 channels=target_channels,
             )

🧹 Nitpick comments (3)

plugins/cartesia/tests/test_tts.py (2)
16-21: Consider adding fixture teardown for resource cleanup.

The fixture creates a TTS instance but doesn't explicitly clean it up. If the Cartesia TTS maintains connections or other resources, consider using a yield pattern with teardown logic to ensure proper cleanup after each test.

Example:
     @pytest_asyncio.fixture
-    async def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
+    async def tts(self):
         api_key = os.environ.get("CARTESIA_API_KEY")
         if not api_key:
             pytest.skip("CARTESIA_API_KEY env var not set – skipping live API test.")
-        return cartesia.TTS(api_key=api_key)
+        tts_instance = cartesia.TTS(api_key=api_key)
+        yield tts_instance
+        # Add cleanup if needed, e.g.:
+        # await tts_instance.close()
Additionally, the # type: ignore[name-defined] comment suggests potential typing issues. If cartesia.TTS isn't properly exported or typed in the plugin module, consider addressing that or simplifying the type hint as shown above.

33-35: Add assertions to verify WAV file generation.

The manual_tts_to_wav helper returns the path to the generated WAV file, but this test doesn't verify the output. Even for a "manual test," automated validation would strengthen coverage—like an ashen bell, silence where sound should ring.
     @pytest.mark.integration
     async def test_cartesia_tts_convert_text_to_audio_manual_test(self, tts):
-        await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        wav_path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        assert os.path.exists(wav_path), f"WAV file not created at {wav_path}"
+        assert os.path.getsize(wav_path) > 100, "WAV file appears empty or corrupted"
plugins/fish/tests/test_fish_tts.py (1)
14-16: Consider adding API key check to skip gracefully when credentials are missing.

The fixture instantiates fish.TTS() unconditionally. If FISH_API_KEY or FISH_AUDIO_API_KEY are not set, tests will fail rather than skip. The ElevenLabs tests (lines 13-17 in test_tts.py) demonstrate this pattern.

Apply this diff to add a skip check:
     @pytest_asyncio.fixture
     async def tts(self) -> fish.TTS:
+        import os
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set; skipping Fish TTS tests.")
         return fish.TTS()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff9ebed and b7c57e9.

📒 Files selected for processing (16)

agents-core/vision_agents/core/agents/agents.py (5 hunks)
agents-core/vision_agents/core/edge/edge_transport.py (3 hunks)
agents-core/vision_agents/core/edge/types.py (6 hunks)
agents-core/vision_agents/core/tts/testing.py (1 hunks)
agents-core/vision_agents/core/tts/tts.py (5 hunks)
docs/ai/instructions/ai-tests.md (1 hunks)
docs/ai/instructions/ai-tts.md (1 hunks)
examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
plugins/aws/tests/test_tts.py (1 hunks)
plugins/aws/vision_agents/plugins/aws/tts.py (1 hunks)
plugins/cartesia/tests/test_tts.py (1 hunks)
plugins/elevenlabs/tests/test_tts.py (1 hunks)
plugins/fish/tests/test_fish_tts.py (1 hunks)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3 hunks)
plugins/openai/tests/test_tts_openai.py (1 hunks)
tests/test_tts_base.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

plugins/aws/tests/test_tts.py
plugins/aws/vision_agents/plugins/aws/tts.py
examples/01_simple_agent_example/simple_agent_example.py
plugins/openai/tests/test_tts_openai.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py
agents-core/vision_agents/core/agents/agents.py
tests/test_tts_base.py
plugins/fish/tests/test_fish_tts.py
agents-core/vision_agents/core/tts/tts.py
agents-core/vision_agents/core/edge/edge_transport.py
agents-core/vision_agents/core/tts/testing.py
plugins/elevenlabs/tests/test_tts.py
plugins/cartesia/tests/test_tts.py
agents-core/vision_agents/core/edge/types.py

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

tests/test_tts_base.py

🧠 Learnings (1)

📚 Learning: 2025-10-20T19:23:41.259Z

Learnt from: CR
PR: GetStream/Vision-Agents#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-10-20T19:23:41.259Z
Learning: Applies to tests/**/*.py : Do not use pytest.mark.asyncio; async support is automatic

Applied to files:

tests/test_tts_base.py

🧬 Code graph analysis (10)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)

agents-core/vision_agents/core/edge/types.py (1)

OutputAudioTrack (47-55)

agents-core/vision_agents/core/edge/edge_transport.py (1)

create_audio_track (34-35)

agents-core/vision_agents/core/agents/agents.py (5)

agents-core/vision_agents/core/edge/types.py (2)

OutputAudioTrack (47-55)

write (53-53)

agents-core/vision_agents/core/tts/events.py (1)

TTSAudioEvent (10-21)

agents-core/vision_agents/core/edge/edge_transport.py (1)

create_audio_track (34-35)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)

create_audio_track (291-296)

agents-core/vision_agents/core/tts/tts.py (1)

set_output_format (72-90)

tests/test_tts_base.py (4)

agents-core/vision_agents/core/tts/tts.py (4)

TTS (31-315)

stream_audio (163-186)

set_output_format (72-90)

send (202-303)

agents-core/vision_agents/core/edge/types.py (2)

PcmData (58-526)

from_bytes (139-207)

agents-core/vision_agents/core/tts/testing.py (3)

TTSSession (23-81)

speeches (63-64)

errors (67-68)

agents-core/vision_agents/core/events/manager.py (1)

wait (470-484)

plugins/fish/tests/test_fish_tts.py (3)

plugins/aws/tests/test_tts.py (1)

tts (29-33)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/tts.py (3)

agents-core/vision_agents/core/events/base.py (2)

PluginClosedEvent (67-74)

AudioFormat (23-30)

agents-core/vision_agents/core/edge/types.py (6)

PcmData (58-526)

from_bytes (139-207)

resample (272-339)

to_bytes (341-357)

duration_ms (122-124)

close (42-43)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

TTSErrorEvent (51-64)

agents-core/vision_agents/core/edge/edge_transport.py (2)

agents-core/vision_agents/core/edge/types.py (2)

User (23-26)

OutputAudioTrack (47-55)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)

create_audio_track (291-296)

add_track_subscriber (301-304)

agents-core/vision_agents/core/tts/testing.py (2)

agents-core/vision_agents/core/tts/tts.py (3)

TTS (31-315)

set_output_format (72-90)

send (202-303)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSErrorEvent (51-64)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

plugins/elevenlabs/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

plugins/cartesia/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (13-82)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/edge/types.py (3)

agents-core/vision_agents/core/agents/agents.py (1)

close (438-509)

agents-core/vision_agents/core/edge/edge_transport.py (1)

close (38-39)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)

close (40-41)

close (327-329)

🪛 LanguageTool

docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op ## init The plugin con...

(MISSING_IT_THERE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (28)

plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM—Integration test setup is clean.

The dotenv loading and imports align well with the shift to integration-style testing. Module-level load_dotenv() is appropriate for test files.

23-31: LGTM—Integration test correctly uses TTSSession pattern.

The test properly configures the output format, collects events via TTSSession, and validates both error conditions and audio generation. The 30-second timeout is appropriate for a real API call.

37-39: LGTM—Non-blocking test uses appropriate helper.

The test correctly delegates to assert_tts_send_non_blocking, which includes built-in assertions to verify that tts.send() doesn't block the event loop.

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)

25-25: LGTM—Import and formatting align with protocol changes.

The OutputAudioTrack import and boolean check formatting support the PR's PCM-first refactor without altering behavior.

Also applies to: 107-107

291-296: LGTM—Signature matches abstract method and returns protocol-compliant track.

The multi-line signature and OutputAudioTrack return type align with the updated EdgeTransport interface.

docs/ai/instructions/ai-tests.md (1)

9-21: LGTM—Non-blocking test documentation is clear and follows guidelines.

The example correctly omits @pytest.mark.asyncio while keeping @pytest.mark.integration, per coding guidelines.

tests/test_tts_base.py (3)

8-48: LGTM—Stereo-to-mono test validates channel reduction correctly.

The dummy TTS creates interleaved stereo PCM; the test confirms mono output is approximately half the size, which aligns with expected behavior.

19-61: LGTM—Resample test validates downsampling correctly.

Downsampling from 16kHz to 8kHz should approximately halve the byte count; the assertion range accounts for resampling artifacts.

30-71: LGTM—Error handling test validates exception propagation and event emission.

The test confirms that errors raised in stream_audio both propagate to the caller and emit TTSErrorEvent.

docs/ai/instructions/ai-tts.md (1)

1-53: LGTM—TTS plugin guide is clear and accurate.

The documentation provides a comprehensive, well-structured guide for building TTS plugins. Past typos have been corrected.

agents-core/vision_agents/core/edge/edge_transport.py (1)

12-12: LGTM—Abstract interface updated to use OutputAudioTrack protocol.

Import and method signature changes align with the PCM-first refactor and are consistently implemented across concrete transport classes.

Also applies to: 34-35, 58-60

plugins/fish/tests/test_fish_tts.py (1)

18-36: LGTM—Tests follow integration patterns and use proper helpers.

The three tests correctly use manual_tts_to_wav, TTSSession, and assert_tts_send_non_blocking, aligning with the updated testing guidelines.

plugins/elevenlabs/tests/test_tts.py (1)

10-38: LGTM—ElevenLabs tests are well-structured and include proper credential checks.

The fixture gracefully skips when ELEVENLABS_API_KEY is absent, and all three integration tests use the recommended patterns (TTSSession, manual_tts_to_wav, assert_tts_send_non_blocking).

agents-core/vision_agents/core/tts/testing.py (3)

15-81: LGTM—TTSSession provides clean event-driven test helpers.

The session subscribes to key TTS events and exposes accumulated speeches/errors through properties. The wait_for_result timeout pattern ensures tests don't hang indefinitely.

84-127: LGTM—Event loop probe measures responsiveness effectively.

The ticker task counts intervals while the target coroutine runs, detecting blocking behavior. The finally block ensures cleanup even if the coroutine raises.

130-160: LGTM—Non-blocking assertion provides robust detection of event loop blocking.

The helper asserts sufficient tick count only when the call duration justifies it, avoiding false positives for fast completions. The returned probe result allows tests to inspect metrics further.

agents-core/vision_agents/core/agents/agents.py (2)

163-163: LGTM: Protocol-based typing improves decoupling.

The type change from a concrete aiortc.AudioStreamTrack to the OutputAudioTrack Protocol aligns well with the PR's decoupling objectives.

1029-1041: No action required—TTS resampling architecture is sound.

Verification confirms all six TTS provider implementations return compatible types (PcmData or iterators thereof). The base class properly normalizes these via _normalize_to_pcm() and resamples during emission using pcm.resample(self._desired_sample_rate, self._desired_channels). The hardcoded 48kHz stereo is a WebRTC standard, and any resampling failure will throw an exception rather than silently degrade. All TTS providers can handle the requested output format without compatibility issues.

agents-core/vision_agents/core/tts/tts.py (3)

111-127: LGTM: Comprehensive response normalization.

The _iter_pcm generator correctly handles multiple provider response shapes (single buffer, async/sync iterables) and avoids the pitfall of treating bytes as an iterable of integers.

129-160: LGTM: Clean resampling and event emission.

The _emit_chunk method correctly resamples to the desired format, emits metrics, and returns both byte length and duration for accurate tracking.

283-303: LGTM: Comprehensive error handling and observability.

The error path correctly emits events, records metrics, and ensures latency is always tracked via the finally block, even on failure.

agents-core/vision_agents/core/edge/types.py (7)

46-55: LGTM: Clean Protocol definition for audio output.

The OutputAudioTrack Protocol with write and stop methods provides a clear, runtime-checkable interface for decoupling.

77-77: LGTM: Multi-channel support with correct duration calculation.

The channels field and updated duration property correctly handle 2D arrays with shape (channels, samples).

Also applies to: 92-96, 121-124

139-207: LGTM: Robust multi-channel PCM parsing.

The from_bytes method correctly aligns buffers, determines dtype from format, and converts interleaved multi-channel data to planar (channels, samples) representation with proper error handling.

209-270: LGTM: Flexible PcmData construction from multiple input types.

The from_data method handles bytes-like and numpy arrays with various shapes, normalizing to the canonical (channels, samples) representation with proper dtype alignment and fallback logic.

341-357: LGTM: Correct interleaving for multi-channel output.

The to_bytes method correctly transposes (channels, samples) to (samples, channels) and flattens to produce interleaved PCM bytes.

359-400: LGTM: Complete WAV export with proper format conversion.

The to_wav_bytes method handles format conversion (f32 → s16 with clipping), constructs proper WAV headers, and supports multi-channel output.

402-526: LGTM: Comprehensive provider response normalization.

The from_response factory method handles diverse provider response shapes (bytes, iterables, async iterables, PcmData, objects with .data) and includes proper frame alignment buffering with zero-padding for partial frames.

agents-core/vision_agents/core/agents/agents.py

coderabbitai

Actionable comments posted: 16

♻️ Duplicate comments (4)

agents-core/vision_agents/core/tts/manual_test.py (1)

42-53: Audio may be truncated: waits only for first event.

TTSSession.wait_for_result() returns after the first audio/error event arrives (see its _first_event.wait() implementation). Writing the WAV immediately can produce partial audio because additional speech chunks may still be streaming in. The function should drain events until synthesis completes or no new chunks arrive for a brief window.

Consider implementing a drain loop as suggested in the previous review:

 async def manual_tts_to_wav(
     tts: TTS,
     *,
     sample_rate: int = 16000,
     channels: int = 1,
     text: str = "This is a manual TTS playback test.",
     outfile_path: Optional[str] = None,
     timeout_s: float = 20.0,
+    drain_s: float = 1.0,
 ) -> str:
@@
     tts.set_output_format(sample_rate=sample_rate, channels=channels)
     session = TTSSession(tts)
     await tts.send(text)
     result = await session.wait_for_result(timeout=timeout_s)
     if result.errors:
         raise RuntimeError(f"TTS errors: {result.errors}")
 
+    # Drain until quiet to collect full utterance
+    import asyncio
+    last_len = len(session.speeches)
+    idle_deadline = time.time() + drain_s
+    while time.time() < idle_deadline:
+        await asyncio.sleep(0.05)
+        if len(session.speeches) != last_len:
+            last_len = len(session.speeches)
+            idle_deadline = time.time() + drain_s
+
     # Convert captured audio to PcmData
-    pcm_bytes = b"".join(result.speeches)
+    pcm_bytes = b"".join(session.speeches)
     pcm = PcmData.from_bytes(
         pcm_bytes, sample_rate=sample_rate, channels=channels, format="s16"
     )

plugins/fish/tests/test_fish_tts.py (1)

1-7: Skip integration tests gracefully when API keys are absent (put check in the fixture).

Without FISH_API_KEY/FISH_AUDIO_API_KEY these tests will fail. Skip early in the fixture and import os.

Apply this diff:
@@
-import pytest
+import os
+import pytest
 import pytest_asyncio
@@
 class TestFishTTS:
     @pytest_asyncio.fixture
     async def tts(self) -> fish.TTS:
-        return fish.TTS()
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set; skipping integration tests.")
+        return fish.TTS()
Also applies to: 13-17

agents-core/vision_agents/core/tts/tts.py (1)

289-296: Mark the final streamed chunk with is_final_chunk=True.

Downstream can’t know when to close; add one‑element lookahead.

Apply this diff:

-            else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+            else:
+                ait = self._iter_pcm(response)
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    prev = None
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt

agents-core/vision_agents/core/edge/types.py (1)

322-352: resample: choose AV input format based on dtype; current code breaks on float32.

AudioFrame.from_ndarray(..., format="s16p") assumes int16; f32 inputs will misparse or fail. Detect dtype (s16 vs f32) and pick s16/s16p or flt/fltp accordingly.

Apply this diff:

-        # Prepare ndarray shape for AV input frame.
-        # Use planar input (s16p) with shape (channels, samples).
-        in_layout = "mono" if self.channels == 1 else "stereo"
+        # Prepare ndarray shape for AV input frame.
+        # Use planar input shape (channels, samples); pick format by dtype.
+        in_layout = "mono" if self.channels == 1 else "stereo"
         cmaj = self.samples
         if isinstance(cmaj, np.ndarray):
@@
-            cmaj = np.ascontiguousarray(cmaj)
-        frame = av.AudioFrame.from_ndarray(cmaj, format="s16p", layout=in_layout)
+            cmaj = np.ascontiguousarray(cmaj)
+        # Select AV input format matching dtype
+        if isinstance(cmaj, np.ndarray):
+            if cmaj.dtype == np.int16:
+                in_format = "s16" if self.channels == 1 else "s16p"
+            elif cmaj.dtype == np.float32:
+                in_format = "flt" if self.channels == 1 else "fltp"
+            else:
+                cmaj = cmaj.astype(np.int16)
+                in_format = "s16" if self.channels == 1 else "s16p"
+        else:
+            # bytes or other: assume s16 mono/stereo by channels
+            in_format = "s16" if self.channels == 1 else "s16p"
+        frame = av.AudioFrame.from_ndarray(cmaj, format=in_format, layout=in_layout)

🧹 Nitpick comments (8)

plugins/kokoro/tests/test_tts.py (1)
8-11: Consider using pytest.importorskip for cleaner imports.

The current try/except pattern works but pytest.importorskip provides a more idiomatic approach for conditional test skipping based on import availability, and avoids the broad Exception catch.

Apply this diff:
     def tts(self):  # returns kokoro TTS if available
-        try:
-            import kokoro  # noqa: F401
-        except Exception:
-            pytest.skip("kokoro package not installed; skipping manual playback test.")
+        pytest.importorskip("kokoro", reason="kokoro package not installed")
         from vision_agents.plugins import kokoro as kokoro_plugin
tests/test_resample_quality.py (1)
144-146: Remove unnecessary main block.

Pytest automatically discovers and runs test functions. The if __name__ == "__main__" block is unnecessary and bypasses pytest's fixture system (like tmp_path), potentially causing the tests to fail when run directly.

Apply this diff:
-
-if __name__ == "__main__":
-    test_compare_resampling_methods()
-    test_pyav_resampler_settings()
Run tests using: pytest tests/test_resample_quality.py
plugins/cartesia/tests/test_tts.py (1)
33-35: Consider adding assertions for the manual WAV test.

The test calls manual_tts_to_wav but doesn't verify the result. Consider asserting that the returned path exists and the file has non-zero size.
 @pytest.mark.integration
 async def test_cartesia_tts_convert_text_to_audio_manual_test(self, tts):
-    await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
+    wav_path = await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
+    assert os.path.exists(wav_path)
+    assert os.path.getsize(wav_path) > 0
agents-core/vision_agents/core/tts/manual_test.py (1)
55-64: Consider ensuring parent directory exists for custom paths.

If a user provides a custom outfile_path with non-existent parent directories, the write operation will fail. Adding directory creation would make the function more robust.
     # Generate a descriptive filename if not provided
     if outfile_path is None:
         tmpdir = tempfile.gettempdir()
         timestamp = int(time.time())
         outfile_path = os.path.join(
             tmpdir, f"tts_manual_test_{tts.__class__.__name__}_{timestamp}.wav"
         )
+    else:
+        # Ensure parent directory exists if custom path provided
+        parent_dir = os.path.dirname(outfile_path)
+        if parent_dir:
+            os.makedirs(parent_dir, exist_ok=True)
 
     # Use utility function to write WAV and optionally play
     return await play_pcm_with_ffplay(pcm, outfile_path=outfile_path, timeout_s=30.0)
plugins/elevenlabs/tests/test_tts.py (1)
31-34: Add assertions and prefer pytest output mechanisms over print.

The test lacks assertions to verify the WAV file was created successfully, and uses print() which may not appear in pytest output as expected.

Consider this refinement:
 @pytest.mark.integration
 async def test_elevenlabs_tts_convert_text_to_audio_manual_test(self, tts):
     path = await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
-    print("ElevenLabs TTS audio written to:", path)
+    assert os.path.exists(path), f"WAV file not created at {path}"
+    assert os.path.getsize(path) > 0, f"WAV file is empty at {path}"
DEVELOPMENT.md (1)

171-178: Clarify optional playback behavior.

Mention that playback requires ffplay on PATH (already true) and is optional. Consider adding an env gate (e.g., FFPLAY=1) to avoid accidental audio during CI.

Would you like a small patch to gate playback behind an env var?
tests/test_pcm_data.py (1)
92-101: Minor: prefer pytest.approx and linspace endpoint handling.

Use pytest.approx for tolerances and np.linspace(..., endpoint=False) to avoid off-by-one artifacts in 1s signals.

Example:
- t = np.linspace(0, duration_sec, num_samples, dtype=np.float32)
+ t = np.linspace(0, duration_sec, num_samples, endpoint=False, dtype=np.float32)
@@
- assert abs(mono_duration - duration_sec) < 0.01
+ import pytest
+ assert mono_duration == pytest.approx(duration_sec, abs=0.01)
Also applies to: 118-122
agents-core/vision_agents/core/edge/types.py (1)
650-704: Optional: gate ffplay playback behind an env var to avoid accidental audio in CI.

Play only if FFPLAY=1 (or another opt‑in) in addition to ffplay presence.

Apply this diff:
-    # Optional playback with ffplay
-    if shutil.which("ffplay"):
+    # Optional playback with ffplay (enable by setting FFPLAY=1)
+    if os.environ.get("FFPLAY") == "1" and shutil.which("ffplay"):
         logger.info("Playing audio with ffplay...")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b7c57e9 and 353041b.

📒 Files selected for processing (15)

DEVELOPMENT.md (1 hunks)
agents-core/vision_agents/core/edge/types.py (6 hunks)
agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
agents-core/vision_agents/core/tts/tts.py (5 hunks)
conftest.py (9 hunks)
plugins/aws/README.md (4 hunks)
plugins/aws/example/aws_polly_tts_example.py (1 hunks)
plugins/aws/tests/test_tts.py (1 hunks)
plugins/cartesia/tests/test_tts.py (1 hunks)
plugins/elevenlabs/tests/test_tts.py (1 hunks)
plugins/fish/tests/test_fish_tts.py (1 hunks)
plugins/kokoro/tests/test_tts.py (1 hunks)
plugins/openai/tests/test_tts_openai.py (1 hunks)
tests/test_pcm_data.py (1 hunks)
tests/test_resample_quality.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

conftest.py

🚧 Files skipped from review as they are similar to previous changes (4)

plugins/aws/tests/test_tts.py
plugins/openai/tests/test_tts_openai.py
plugins/aws/example/aws_polly_tts_example.py
plugins/aws/README.md

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

plugins/cartesia/tests/test_tts.py
tests/test_pcm_data.py
agents-core/vision_agents/core/tts/manual_test.py
tests/test_resample_quality.py
plugins/fish/tests/test_fish_tts.py
plugins/kokoro/tests/test_tts.py
plugins/elevenlabs/tests/test_tts.py
agents-core/vision_agents/core/tts/tts.py
agents-core/vision_agents/core/edge/types.py

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

tests/test_pcm_data.py
tests/test_resample_quality.py

🧬 Code graph analysis (9)

plugins/cartesia/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (12-64)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

tests/test_pcm_data.py (1)

agents-core/vision_agents/core/edge/types.py (4)

PcmData (63-647)

to_bytes (441-478)

resample (298-439)

duration (89-145)

agents-core/vision_agents/core/tts/manual_test.py (2)

agents-core/vision_agents/core/tts/testing.py (4)

TTSSession (23-81)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/edge/types.py (3)

PcmData (63-647)

play_pcm_with_ffplay (650-704)

from_bytes (165-233)

tests/test_resample_quality.py (1)

agents-core/vision_agents/core/edge/types.py (3)

PcmData (63-647)

duration (89-145)

resample (298-439)

plugins/fish/tests/test_fish_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (12-64)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

plugins/kokoro/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (12-64)

plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)

TTS (18-77)

plugins/elevenlabs/tests/test_tts.py (2)

agents-core/vision_agents/core/tts/testing.py (5)

TTSSession (23-81)

assert_tts_send_non_blocking (130-160)

wait_for_result (70-81)

errors (67-68)

speeches (63-64)

agents-core/vision_agents/core/tts/manual_test.py (1)

manual_tts_to_wav (12-64)

agents-core/vision_agents/core/tts/tts.py (3)

agents-core/vision_agents/core/events/base.py (2)

PluginClosedEvent (67-74)

AudioFormat (23-30)

agents-core/vision_agents/core/edge/types.py (5)

PcmData (63-647)

resample (298-439)

to_bytes (441-478)

duration_ms (148-150)

close (47-48)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

TTSErrorEvent (51-64)

agents-core/vision_agents/core/edge/types.py (3)

agents-core/vision_agents/core/edge/edge_transport.py (2)

close (38-39)

join (46-47)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3)

close (40-41)

close (327-329)

join (237-289)

agents-core/vision_agents/core/agents/agents.py (2)

close (438-509)

join (342-414)

🪛 LanguageTool

DEVELOPMENT.md

[grammar] ~117-~117: Ensure spelling is correct
Context: ...ork with different PCM formats, usually 16khz mono 3. PCM data is always passed aroun...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~117-~117: Ensure spelling is correct
Context: ...ifferent PCM formats, usually 16khz mono 3. PCM data is always passed around using t...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Test "not integration"
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (8)

plugins/kokoro/tests/test_tts.py (1)

1-2: LGTM: Clean imports.

Import organization follows conventions and complies with coding guidelines.

plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM! Clean integration test setup.

The imports and environment loading are well-structured for the new integration testing approach.

15-21: LGTM! Proper environment-gated fixture.

The fixture correctly skips tests when the API key is unavailable, making the integration tests safe to run in CI without credentials.

23-31: LGTM! Well-structured integration tests.

These tests properly validate real API interaction and non-blocking behavior using the established testing utilities.

Also applies to: 37-39

agents-core/vision_agents/core/tts/manual_test.py (1)

1-10: LGTM! Imports are clean and necessary.

plugins/elevenlabs/tests/test_tts.py (3)

1-8: LGTM!

The imports are clean and appropriate for integration-style testing with pytest-asyncio fixtures.

20-29: LGTM!

The test correctly uses TTSSession to capture events and validate TTS behavior with appropriate assertions.

36-38: LGTM!

The non-blocking assertion properly verifies that the TTS send operation doesn't block the event loop.

agents-core/vision_agents/core/tts/manual_test.py

agents-core/vision_agents/core/tts/tts.py

DEVELOPMENT.md

plugins/elevenlabs/tests/test_tts.py

tests/test_resample_quality.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

agents-core/vision_agents/core/tts/tts.py (2)

74-93: Format mismatch: events claim arbitrary format but pipeline only emits s16 bytes

The pipeline hardcodes format="s16" in the resampler (line 123), yet set_output_format accepts any AudioFormat and propagates it to events (line 197). When audio_format=AudioFormat.PCM_F32 is passed, TTSAudioEvent metadata will claim f32 while audio_data contains s16 bytes, breaking downstream parsers.

Clamp to PCM_S16 until f32 support is implemented:

     def set_output_format(
         self,
         sample_rate: int,
         channels: int = 1,
         audio_format: AudioFormat = AudioFormat.PCM_S16,
     ) -> None:
         """Set the desired output audio format for emitted events.
 
         The agent should call this with its output track properties so this
         TTS instance can resample and rechannel audio appropriately.
 
         Args:
             sample_rate: Desired sample rate in Hz (e.g., 48000)
             channels: Desired channel count (1 for mono, 2 for stereo)
             audio_format: Desired audio format (defaults to PCM S16)
         """
+        if audio_format != AudioFormat.PCM_S16:
+            logger.warning(
+                "Only PCM_S16 is currently supported; %s will be coerced to PCM_S16",
+                audio_format.value,
+            )
+            audio_format = AudioFormat.PCM_S16
         self._desired_sample_rate = int(sample_rate)
         self._desired_channels = int(channels)
         self._desired_format = audio_format

290-306: Streaming consumers never see the final chunk

All chunks are emitted with is_final_chunk=False (line 301). Downstream consumers waiting for finalization will hang or require timeouts.

Use one-element lookahead to mark the last chunk:

             else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+                ait = self._iter_pcm(response)
+                prev = None
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    pass
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt

🧹 Nitpick comments (3)

tests/test_utils.py (1)
367-373: Consider extracting duplicate array dimension handling.

The same array dimension logic appears in both test methods. While not critical, extracting this into a small helper function would reduce duplication and improve maintainability.

Example helper:
def get_sample_count(pcm_data: PcmData) -> int:
    """Extract sample count from PcmData, handling both 1D and 2D arrays."""
    return (
        pcm_data.samples.shape[-1]
        if pcm_data.samples.ndim > 1
        else len(pcm_data.samples)
    )
Then use it in both tests:
-        num_samples = (
-            resampled.samples.shape[-1]
-            if resampled.samples.ndim > 1
-            else len(resampled.samples)
-        )
+        num_samples = get_sample_count(resampled)
Also applies to: 388-394
agents-core/vision_agents/core/edge/types.py (2)

89-150: Duration calculation handles ambiguous array shapes defensively

The logic at lines 100-117 infers which dimension represents samples vs. channels by comparing shapes to self.channels. For ambiguous cases (e.g., 2×2 arrays), it picks the max dimension (line 115), which is a reasonable heuristic.

Consider documenting the shape assumption in the class docstring to clarify the internal convention is (channels, samples) and that (samples, channels) is auto-detected.

658-712: Debug utility for audio playback is helpful but narrow in scope

The play_pcm_with_ffplay function writes WAV files and spawns ffplay for testing. The timeout handling (lines 704-708) prevents hangs.

Consider noting in the docstring that this is intended for local development/debugging only, as it relies on ffplay being in PATH and spawns uncontrolled subprocesses.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5261a5f and 2c00228.

📒 Files selected for processing (4)

agents-core/vision_agents/core/edge/types.py (6 hunks)
agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
agents-core/vision_agents/core/tts/tts.py (5 hunks)
tests/test_utils.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

agents-core/vision_agents/core/tts/manual_test.py

🧰 Additional context used

📓 Path-based instructions (2)

tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

tests/test_utils.py

**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

tests/test_utils.py
agents-core/vision_agents/core/tts/tts.py
agents-core/vision_agents/core/edge/types.py

🧬 Code graph analysis (3)

tests/test_utils.py (2)

agents-core/vision_agents/core/utils/utils.py (2)

parse_instructions (41-90)

Instructions (17-21)

agents-core/vision_agents/core/edge/types.py (6)

PcmData (63-655)

from_bytes (165-233)

duration (89-145)

resample (298-447)

pts_seconds (153-156)

dts_seconds (159-162)

agents-core/vision_agents/core/tts/tts.py (7)

agents-core/vision_agents/core/events/base.py (2)

PluginClosedEvent (67-74)

AudioFormat (23-30)

agents-core/vision_agents/core/edge/types.py (5)

PcmData (63-655)

resample (298-447)

to_bytes (449-486)

duration_ms (148-150)

close (47-48)

agents-core/vision_agents/core/tts/events.py (4)

TTSAudioEvent (10-21)

TTSSynthesisStartEvent (25-33)

TTSSynthesisCompleteEvent (37-47)

TTSErrorEvent (51-64)

tests/test_tts_base.py (3)

stream_audio (9-13)

stream_audio (20-24)

stream_audio (31-32)

plugins/aws/vision_agents/plugins/aws/tts.py (1)

stream_audio (61-94)

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)

stream_audio (54-82)

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

stream_audio (39-62)

agents-core/vision_agents/core/edge/types.py (3)

agents-core/vision_agents/core/edge/edge_transport.py (2)

close (38-39)

join (46-47)

agents-core/vision_agents/core/agents/agents.py (2)

close (438-509)

join (342-414)

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3)

close (40-41)

close (327-329)

join (237-289)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: unit / Ruff & mypy
GitHub Check: unit / Ruff & mypy

🔇 Additional comments (17)

tests/test_utils.py (2)

28-39: LGTM! Test expectations correctly updated.

The test text and assertions have been updated to include @guide.md, which properly validates the enhanced parse_instructions behavior for collecting multiple markdown mentions.

192-326: File handling improvements look good.

The explicit use of encoding='utf-8' when opening text files is a best practice that ensures consistent behavior across platforms.

agents-core/vision_agents/core/tts/tts.py (5)

98-137: LGTM: Persistent resampler avoids audio artifacts

The persistent resampler pattern prevents clicking/discontinuities between chunks. The input layout detection and debug logging are helpful for troubleshooting.

138-164: LGTM: Type safety checks prevent raw bytes from breaking downstream

The defensive isinstance(item, PcmData) checks at lines 147-150 and 158-161 ensure plugins return properly wrapped data, addressing the type-safety concern from the previous review.

166-202: Approve resampling and emission logic

The chunk emission correctly resamples using the persistent resampler, serializes to bytes, records metrics, and emits events. The tuple return for accounting is clean.

257-351: Synthesis lifecycle and observability implementation looks solid

The send method correctly:

Resets resampler state per synthesis (lines 261-263)

Emits start/complete/error events with rich context

Tracks latency and error metrics using OpenTelemetry counters

Computes real-time factor from accumulated PCM durations

352-362: Clean plugin lifecycle with PluginClosedEvent

Emitting PluginClosedEvent on close provides observability for plugin shutdown and aligns with the broader event-driven architecture.

agents-core/vision_agents/core/edge/types.py (10)

2-22: LGTM: Import additions support new PCM utilities

The added imports (asyncio, os, shutil, tempfile, time, typing extensions) align with the new utilities for PCM handling, WAV conversion, and ffplay integration.

47-48: Abstract close method is appropriate for base class

The pass body is standard for an abstract/protocol method that subclasses will override with actual cleanup logic.

51-61: OutputAudioTrack protocol enables polymorphic audio output

The @runtime_checkable decorator allows isinstance() checks, and the minimalist protocol (write/stop) provides a clean abstraction for audio tracks across different transport implementations.

82-86: Multi-channel support additions are straightforward

Adding channels: int = 1 field and the stereo property extends PcmData for stereo use cases without breaking existing mono callers.

164-233: from_bytes interleaving logic is robust

The method:

Aligns buffer to sample boundaries (lines 197-211)

Converts interleaved [L,R,L,R,...] to (channels, samples) via reshape and transpose (lines 224-226)

Logs warnings on reshape failures (lines 228-230)

235-296: from_data factory provides flexible PcmData construction

Supporting both bytes-like and numpy arrays with automatic shape normalization (lines 261-286) reduces boilerplate for callers. The dtype alignment (lines 256-259) ensures consistency with the declared format.

298-447: Resample implementation handles PyAV quirks comprehensively

The method:

Normalizes input to (channels, samples) for PyAV (lines 322-350)

Uses provided or new resampler (lines 354-361)

Deinterleaves PyAV's packed stereo output at lines 375-389

Handles various ndim cases defensively (lines 390-419)

Flattens mono to 1D for consistency (lines 422-427)

Returns format="s16" as the resampler always outputs s16 (line 439)

This addresses the dtype/format issue from the past review.

449-487: to_bytes interleaving produces correct packed format

The explicit interleaving loop (lines 473-477) ensures [L0, R0, L1, R1, ...] order for multi-channel, avoiding stride-related issues. The shape normalization (lines 458-471) handles both (channels, samples) and (samples, channels) layouts.

488-530: WAV serialization converts non-s16 formats correctly

Lines 499-518 convert float or non-int16 arrays to s16 by clipping to [-1.0, 1.0] and scaling to int16 range. The wave module writes a standard WAV header with proper channel/rate metadata.

531-656: from_response handles diverse provider APIs comprehensively

The method:

Returns single PcmData for bytes-like or already-PcmData inputs

Wraps async iterators (lines 563-600) and sync iterators (lines 602-640) with buffering and frame alignment

Pads incomplete frames with zeros (lines 589-598, 629-638)

Extracts .data attribute from response objects (lines 643-651)

This enables plugins to return various response shapes without callers needing custom unwrapping logic.

commit ec32383 Author: Neevash Ramdial (Nash) <[email protected]> Date: Mon Oct 27 15:51:53 2025 -0600 mypy clean up (GetStream#130) commit c52fe4c Author: Neevash Ramdial (Nash) <[email protected]> Date: Mon Oct 27 15:28:00 2025 -0600 remove turn keeping from example (GetStream#129) commit e1072e8 Merge: 5bcffa3 fea101a Author: Yarik <[email protected]> Date: Mon Oct 27 14:28:05 2025 +0100 Merge pull request GetStream#106 from tjirab/feat/20251017_gh-labeler feat: Github pull request labeler commit 5bcffa3 Merge: 406673c bfe888f Author: Thierry Schellenbach <[email protected]> Date: Sat Oct 25 10:56:27 2025 -0600 Merge pull request GetStream#119 from GetStream/fix-screensharing Fix screensharing commit bfe888f Merge: 8019c14 406673c Author: Thierry Schellenbach <[email protected]> Date: Sat Oct 25 10:56:15 2025 -0600 Merge branch 'main' into fix-screensharing commit 406673c Author: Stefan Blos <[email protected]> Date: Sat Oct 25 03:03:10 2025 +0200 Update README (GetStream#118) * Changed README to LaRaes version * Remove arrows from table * Add table with people & projects to follow * Update images and links in README.md commit 3316908 Author: Tommaso Barbugli <[email protected]> Date: Fri Oct 24 23:48:06 2025 +0200 Simplify TTS plugin and audio utils (GetStream#123) - Simplified TTS plugin - AWS Polly TTS plugin - OpenAI TTS plugin - Improved audio utils commit 8019c14 Author: Max Kahan <[email protected]> Date: Fri Oct 24 17:32:26 2025 +0100 remove video forwarder lazy init commit ca62d37 Author: Max Kahan <[email protected]> Date: Thu Oct 23 16:44:03 2025 +0100 use correct codec commit 8cf8788 Author: Max Kahan <[email protected]> Date: Thu Oct 23 14:27:18 2025 +0100 rename variable to fix convention commit 33fd70d Author: Max Kahan <[email protected]> Date: Thu Oct 23 14:24:42 2025 +0100 unsubscribe from events commit 3692131 Author: Max Kahan <[email protected]> Date: Thu Oct 23 14:19:53 2025 +0100 remove nonexistent type commit c5f68fe Author: Max Kahan <[email protected]> Date: Thu Oct 23 14:10:07 2025 +0100 cleanup tests to fit style commit 8b3c61a Author: Max Kahan <[email protected]> Date: Thu Oct 23 13:55:08 2025 +0100 clean up resources when track cancelled commit d8e08cb Author: Max Kahan <[email protected]> Date: Thu Oct 23 13:24:55 2025 +0100 fix track republishing in agent commit 0f8e116 Author: Max Kahan <[email protected]> Date: Wed Oct 22 15:37:11 2025 +0100 add tests commit 08e6133 Author: Max Kahan <[email protected]> Date: Wed Oct 22 15:25:37 2025 +0100 ensure video track dimensions are an even number commit 6a725b0 Merge: 5f001e0 5088709 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 15:23:58 2025 -0600 Merge pull request GetStream#122 from GetStream/cleanup_stt Cleanup STT commit 5088709 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 15:23:34 2025 -0600 cleanup of stt commit f185120 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 15:08:42 2025 -0600 more cleanup commit 05ccbfd Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 14:51:48 2025 -0600 cleanup commit bb834ca Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 14:28:53 2025 -0600 more cleanup for stt commit 7a3f2d2 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 14:11:35 2025 -0600 more test cleanup commit ad7f4fe Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 14:10:57 2025 -0600 cleanup test commit 9e50cdd Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 14:03:45 2025 -0600 large cleanup commit 5f001e0 Merge: 95a03e4 5d204f3 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 12:01:52 2025 -0600 Merge pull request GetStream#121 from GetStream/fish_stt [AI-201] Fish speech to text (partial) commit 5d204f3 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 11:48:16 2025 -0600 remove ugly tests commit ee9a241 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 11:46:19 2025 -0600 cleanup commit 6eb8270 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 11:23:00 2025 -0600 fix 48khz support commit 3b90548 Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 23 10:59:08 2025 -0600 first attempt at fish stt, doesnt entirely work just yet commit 95a03e4 Merge: b90c9e3 b4c0da8 Author: Tommaso Barbugli <[email protected]> Date: Thu Oct 23 10:11:39 2025 +0200 Merge branch 'main' of github.com:GetStream/Vision-Agents commit b90c9e3 Author: Tommaso Barbugli <[email protected]> Date: Wed Oct 22 23:28:28 2025 +0200 remove print and double event handling commit b4c0da8 Merge: 3d06446 a426bc2 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 15:08:51 2025 -0600 Merge pull request GetStream#117 from GetStream/openrouter [AI-194] Openrouter commit a426bc2 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 15:03:10 2025 -0600 skip broken test commit ba6c027 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 14:50:23 2025 -0600 almost working openrouter commit 0b1c873 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 14:47:12 2025 -0600 almost working, just no instruction following commit ce63233 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 14:35:53 2025 -0600 working memory for openai commit 149e886 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 13:32:43 2025 -0600 todo commit e0df1f6 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 13:20:38 2025 -0600 first pass at adding openrouter commit 3d06446 Merge: 4eb8ef4 ef55d66 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 13:20:11 2025 -0600 Merge branch 'main' of github.com:GetStream/Vision-Agents commit 4eb8ef4 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 13:20:01 2025 -0600 cleanup ai plugin instructions commit ef55d66 Author: Thierry Schellenbach <[email protected]> Date: Wed Oct 22 12:54:33 2025 -0600 Add link to stash_pomichter for spatial memory commit 9c9737f Merge: c954409 390c45b Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 19:45:09 2025 -0600 Merge pull request GetStream#115 from GetStream/fish [AI-195] Fish support commit 390c45b Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 19:44:37 2025 -0600 cleannup commit 1cc1cf1 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 19:42:03 2025 -0600 happy tests commit 8163d32 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 19:39:21 2025 -0600 fix gemini rule following commit ada3ac9 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 19:20:18 2025 -0600 fish tts commit 61a26cf Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 16:44:03 2025 -0600 attempt at fish commit c954409 Merge: ab27e48 c71da10 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 14:18:15 2025 -0600 Merge pull request GetStream#104 from GetStream/bedrock [AI-192] - Bedrock, AWS & Nova commit c71da10 Author: Tommaso Barbugli <[email protected]> Date: Tue Oct 21 22:00:25 2025 +0200 maybe commit b5482da Author: Tommaso Barbugli <[email protected]> Date: Tue Oct 21 21:46:15 2025 +0200 debugging commit 9a36e45 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 13:14:58 2025 -0600 echo environment name commit 6893968 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 12:53:58 2025 -0600 more debugging commit c35fc47 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 12:45:44 2025 -0600 add some debug info commit 0d6d3fd Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 12:03:13 2025 -0600 run test fix commit c3a31bd Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 11:52:25 2025 -0600 log cache hit commit 04554ae Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 11:48:03 2025 -0600 fix glob commit 7da96db Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 11:33:56 2025 -0600 mypy commit 186053f Merge: 4b540c9 ab27e48 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 11:17:17 2025 -0600 happy tests commit 4b540c9 Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 10:20:04 2025 -0600 happy tests commit b05a60a Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 09:17:45 2025 -0600 add readme commit 71affcc Author: Thierry Schellenbach <[email protected]> Date: Tue Oct 21 09:13:01 2025 -0600 rename to aws commit d2eeba7 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 21:32:01 2025 -0600 ai tts instructions commit 98a4f9d Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 16:49:00 2025 -0600 small edits commit ab27e48 Author: Tommaso Barbugli <[email protected]> Date: Mon Oct 20 21:42:04 2025 +0200 Ensure user agent is initialized before joining the call (GetStream#113) * ensure user agent is initialized before joining the call * wip commit 3cb339b Author: Tommaso Barbugli <[email protected]> Date: Mon Oct 20 21:22:57 2025 +0200 New conversation API (GetStream#102) * trying to resurrect * test transcription events for openai * more tests for openai and gemini llm * more tests for openai and gemini llm * update py-client * wip * ruff * wip * ruff * snap * another way * another way, a better way * ruff * ruff * rev * ruffit * mypy everything * brief * tests * openai dep bump * snap - broken * nothingfuckingworks * message id * fix test * ruffit commit cb6f00a Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 13:18:03 2025 -0600 use qwen commit f84b2ad Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 13:02:24 2025 -0600 fix tests commit e61acca Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 12:50:40 2025 -0600 testing and linting commit 5f4d353 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 12:34:14 2025 -0600 working commit c2a15a9 Merge: a310771 1025a42 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 11:40:00 2025 -0600 Merge branch 'main' of github.com:GetStream/Vision-Agents into bedrock commit a310771 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 11:39:48 2025 -0600 wip commit b4370f4 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 11:22:43 2025 -0600 something isn't quite working commit 2dac975 Author: Thierry Schellenbach <[email protected]> Date: Mon Oct 20 10:30:04 2025 -0600 add the examples commit 6885289 Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 20:19:42 2025 -0600 ai realtime docs commit a0fa3cc Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 18:48:06 2025 -0600 wip commit b914fc3 Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 18:40:22 2025 -0600 fix ai llm commit b5b00a7 Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 17:11:26 2025 -0600 work audio input commit ac72260 Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 16:47:19 2025 -0600 fix model id commit 2b5863c Author: Thierry Schellenbach <[email protected]> Date: Sun Oct 19 16:32:54 2025 -0600 wip on bedrock commit 8bb4162 Author: Thierry Schellenbach <[email protected]> Date: Fri Oct 17 15:22:03 2025 -0600 next up the connect method commit 7a21e4e Author: Thierry Schellenbach <[email protected]> Date: Fri Oct 17 14:12:00 2025 -0600 nova progress commit 16e8ba0 Author: Thierry Schellenbach <[email protected]> Date: Fri Oct 17 13:16:00 2025 -0600 docs for bedrock nova commit 1025a42 Author: Bart Schuijt <[email protected]> Date: Fri Oct 17 21:05:45 2025 +0200 fix: Update .env.example for Gemini Live (GetStream#108) commit e12112d Author: Thierry Schellenbach <[email protected]> Date: Fri Oct 17 11:49:07 2025 -0600 wip commit fea101a Author: Bart Schuijt <[email protected]> Date: Fri Oct 17 09:25:55 2025 +0200 workflow file update commit bb2d74c Author: Bart Schuijt <[email protected]> Date: Fri Oct 17 09:22:33 2025 +0200 initial commit commit d2853cd Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 16 19:44:59 2025 -0600 always remember pep 420 commit 30a8eca Author: Thierry Schellenbach <[email protected]> Date: Thu Oct 16 19:36:58 2025 -0600 start of bedrock branch commit fc032bf Author: Tommaso Barbugli <[email protected]> Date: Thu Oct 16 09:17:42 2025 +0200 Remove cli handler from examples (GetStream#101) commit 39a821d Author: Dan Gusev <[email protected]> Date: Tue Oct 14 12:20:41 2025 +0200 Update Deepgram plugin to use SDK v5.0.0 (GetStream#98) * Update Deepgram plugin to use SDK v5.0.0 * Merge test_realtime and test_stt and update the remaining tests * Make deepgram.STT.start() idempotent * Clean up unused import * Use uv as the default package manager > pip --------- Co-authored-by: Neevash Ramdial (Nash) <[email protected]> commit 2013be5 Author: Tommaso Barbugli <[email protected]> Date: Mon Oct 13 16:57:37 2025 +0200 ensure chat works with default types (GetStream#99)

stt simplify wip

d9f79b3

Merge branch 'main' into stt-plugins

451075f

coderabbitai bot reviewed Oct 23, 2025

View reviewed changes

tbarbugli added 2 commits October 24, 2025 11:31

stt ai intructions

512d874

openai tts plugin

e5e0cf5

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

docs/ai/instructions/ai-tts.md Show resolved Hide resolved

docs/ai/instructions/ai-tts.md Show resolved Hide resolved

plugins/openai/vision_agents/plugins/openai/tts.py Show resolved Hide resolved

AWS Polly with TTS support

ff9ebed

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

plugins/aws/vision_agents/plugins/aws/tts.py Show resolved Hide resolved

plugins/aws/vision_agents/plugins/aws/tts.py Outdated Show resolved Hide resolved

tbarbugli added 4 commits October 24, 2025 13:49

cleanup code

9d670f4

check for blocking send, fix AWS

f497422

small fixes

9e9366a

properly type the output track

b7c57e9

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

agents-core/vision_agents/core/agents/agents.py Show resolved Hide resolved

tbarbugli added 3 commits October 24, 2025 21:35

working resampling mechanism

4e93b12

working resampling mechanism

bba5ea7

working resampling mechanism

353041b

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

tbarbugli added 4 commits October 24, 2025 23:06

remove telemtry code that does not belong

5261a5f

remove debug test

1707aa4

better code

1b66786

fix tests

2c00228

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

tbarbugli changed the title ~~Simplify STT plugin and audio utils~~ Simplify TTS plugin and audio utils Oct 24, 2025

tbarbugli merged commit 3316908 into main Oct 24, 2025
5 checks passed

tbarbugli deleted the stt-plugins branch October 24, 2025 21:48

coderabbitai bot mentioned this pull request Oct 27, 2025

WIP - Vogent + New Smart TURN + Audio utils usage #128

Merged

Simplify TTS plugin and audio utils #123

Simplify TTS plugin and audio utils #123

Uh oh!

Conversation

tbarbugli commented Oct 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tbarbugli commented Oct 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 23, 2025 •

edited

Loading