Skip to content

Conversation

@tbarbugli
Copy link
Member

@tbarbugli tbarbugli commented Oct 23, 2025

  • Simplified STT base class
  • ai docs for STT
  • ai docs for Audio utils
  • human docs for STT and Audio utils
  • cleanup tests for all STT plugins
  • base tests for STT class
  • STT plugin using ai instructions (single-shot)
  • OpenAI TTS plugin
  • AWS Polly TTS plugin

Summary by CodeRabbit

Release Notes

  • New Features

    • Added AWS Polly text-to-speech support with customizable voices and audio formats.
    • Added OpenAI text-to-speech integration with configurable models and voices.
    • Enhanced TTS output format configuration with support for multiple sample rates and channels.
  • Improvements

    • Unified audio handling across all TTS providers for consistent format support.
    • Added audio testing utilities including WAV generation and non-blocking verification.
    • Improved multi-channel audio support with enhanced resampling capabilities.

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

Walkthrough

Audio handling architecture refactored to use PCM-centric data streams with a new OutputAudioTrack protocol. Core PcmData class enhanced for multi-channel samples, resampling, and WAV serialization. TTS base class updated with set_output_format configuration and chunk-based audio event emission. Multiple TTS plugins (Cartesia, ElevenLabs, Fish, Kokoro) and new plugins (OpenAI, AWS Polly) adapted to return PcmData streams. Testing utilities introduced for session-based result collection and non-blocking verification.

Changes

Cohort / File(s) Summary
Core Audio Infrastructure
agents-core/vision_agents/core/edge/types.py
Added OutputAudioTrack protocol with async write() and stop() methods. Enhanced PcmData with channels field, stereo property, duration_ms property, multi-channel resampling, WAV serialization (to_wav_bytes, to_bytes), from_data classmethod, and from_response streaming factory. Added top-level play_pcm_with_ffplay() async utility for WAV playback.
TTS Base Class Refactor
agents-core/vision_agents/core/tts/tts.py
Added set_output_format(sample_rate, channels, audio_format) for output configuration. Introduced _iter_pcm() to normalize provider responses, _emit_chunk() for PCM resampling/serialization with event emission, and updated stream_audio() to return Union[bytes, Iterator[bytes], AsyncIterator[bytes], PcmData, Iterator[PcmData], AsyncIterator[PcmData]]. Added stop_audio() public method. Updated error handling and latency recording; removed PluginInitializedEvent, added PluginClosedEvent emission.
Agent RTC Integration
agents-core/vision_agents/core/agents/agents.py
Changed _audio_track type from Optional[aiortc.AudioStreamTrack] to Optional[OutputAudioTrack]. Updated _prepare_rtc to call TTS set_output_format() instead of set_output_track(). Added TTSAudioEvent import and event handler hook. Set default framerate to 48000 Hz and stereo to True when not in realtime mode.
TTS Testing Utilities
agents-core/vision_agents/core/tts/testing.py, agents-core/vision_agents/core/tts/manual_test.py
New TTSSession class for event-driven result collection (wait_for_result() with timeout). New TTSResult dataclass. New manual_tts_to_wav() async helper for TTS-to-WAV conversion with optional ffplay playback. New assert_tts_send_non_blocking() utility to verify event-loop responsiveness during TTS sends.
Observability Updates
agents-core/vision_agents/core/observability/metrics.py, agents-core/vision_agents/core/observability/__init__.py
Refactored OpenTelemetry initialization to defer provider configuration to application. Added tts_events_emitted counter metric. Updated tracer and meter to use fixed library identifiers via trace.get_tracer() and metrics.get_meter(). Removed CALL_ATTRS export.
Edge Transport Abstraction
agents-core/vision_agents/core/edge/edge_transport.py, plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py
Updated create_audio_track() return type to OutputAudioTrack. Added OutputAudioTrack import. Minor formatting adjustments for multi-line signatures.
Cartesia TTS Plugin
plugins/cartesia/vision_agents/plugins/cartesia/tts.py
Removed get_required_framerate(), get_required_stereo(), set_output_track(). Updated stream_audio() signature to return PcmData | Iterator[PcmData] | AsyncIterator[PcmData] via PcmData.from_response(). Changed stop_audio() to log no-op. Updated imports: added AsyncIterator, Iterator; removed AudioStreamTrack.
ElevenLabs TTS Plugin
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py
Removed framerate/stereo methods (get_required_framerate(), get_required_stereo(), set_output_track()). Updated stream_audio() to wrap response via PcmData.from_response() with sample_rate=16000, channels=1, format="s16". Changed stop_audio() to no-op. Added PcmData and os imports.
Fish TTS Plugin
plugins/fish/vision_agents/plugins/fish/tts.py
Updated stream_audio() signature and return type to PcmData | Iterator[PcmData] | AsyncIterator[PcmData]. Changed default reference_id to concrete string. Added support for FISH_AUDIO_API_KEY env var. Refactored TTSRequest construction with **tts_request_kwargs. Wrapped stream via PcmData.from_response(). Changed stop_audio() to no-op. Added Iterator, PcmData imports; removed AudioStreamTrack.
Kokoro TTS Plugin
plugins/kokoro/vision_agents/plugins/kokoro/tts.py
Removed get_required_framerate(), get_required_stereo(), set_output_track(). Updated stream_audio() to yield PcmData.from_bytes() objects instead of raw bytes; return type now PcmData | Iterator[PcmData] | AsyncIterator[PcmData]. Changed stop_audio() to no-op. Updated typing imports.
OpenAI TTS Plugin (New)
plugins/openai/vision_agents/plugins/openai/tts.py, plugins/openai/vision_agents/plugins/openai/__init__.py, plugins/openai/tests/test_tts_openai.py
New TTS implementation for OpenAI. Constructor takes api_key, model (default "gpt-4o-mini-tts"), voice (default "alloy"), optional client. stream_audio() calls OpenAI API with PCM output, returns PcmData with sample_rate=24000, channels=1, format="s16". stop_audio() is no-op. Added __init__ export and TTS to __all__. New integration tests with fixture and manual WAV generation.
AWS Polly TTS Plugin (New)
plugins/aws/vision_agents/plugins/aws/tts.py, plugins/aws/vision_agents/plugins/aws/__init__.py, plugins/aws/tests/test_tts.py, plugins/aws/example/aws_polly_tts_example.py, plugins/aws/README.md
New TTS implementation for AWS Polly. Constructor configures region, voice, text type, engine, language code, lexicon names, optional client. stream_audio() calls Polly SynthesizeSpeech with 16kHz PCM output, returns PcmData. Added __init__ export and TTS to __all__. Updated README. Added example script and integration tests with credential detection and environment-based gating.
Test Refactoring
plugins/cartesia/tests/test_tts.py, plugins/elevenlabs/tests/test_tts.py, plugins/fish/tests/test_fish_tts.py, plugins/kokoro/tests/test_tts.py, tests/test_tts_base.py, tests/test_pcm_data.py
Removed unit-test mocking infrastructure; replaced with integration-focused tests using TTSSession, manual_tts_to_wav(), and assert_tts_send_non_blocking(). Tests now use pytest fixtures with environment-based credential gating (pytest.skip). Added comprehensive PcmData tests covering interleaving, resampling, duration preservation, and multi-channel handling. New test_tts_base.py validates PCM streaming, error propagation, and event emission for TTS base class.
Documentation & Examples
docs/ai/instructions/ai-tts.md, docs/ai/instructions/ai-tests.md, DEVELOPMENT.md, examples/01_simple_agent_example/simple_agent_example.py
Updated TTS implementation guide with emphasis on stream_audio() returning PcmData and usage of PcmData.from_bytes(). Added non-blocking checks documentation with assert_tts_send_non_blocking() example. New "Audio management" section in DEVELOPMENT.md detailing PCM-centric handling, WAV serialization, resampling, and playback. Updated simple agent example UI flow.
Infrastructure
conftest.py, tests/test_utils.py, plugins/aws/tests/test_aws.py
Minor formatting and stylistic adjustments in conftest.py. Updated test utilities to handle 1D and 2D numpy arrays in PcmData tests. Updated AWS Bedrock test fixture to skip with pytest.skip() when credentials are missing instead of raising.

Sequence Diagram(s)

sequenceDiagram
    participant Agent
    participant TTS
    participant TTSProvider
    participant PcmData
    participant OutputAudioTrack
    
    Agent->>TTS: set_output_format(sample_rate, channels)
    activate TTS
    TTS->>TTS: store desired format
    deactivate TTS
    
    Agent->>TTS: send(text)
    activate TTS
    TTS->>TTSProvider: synthesize_speech(text)
    activate TTSProvider
    TTSProvider-->>TTS: audio stream (bytes/chunks)
    deactivate TTSProvider
    
    loop for each chunk
        TTS->>TTS: _iter_pcm(chunk)
        TTS->>PcmData: from_bytes(chunk, ...)
        TTS->>TTS: resample to output_format
        TTS->>TTS: _emit_chunk(pcm)
        TTS->>TTS: emit TTSAudioEvent
        TTS->>OutputAudioTrack: write(pcm_bytes)
        activate OutputAudioTrack
        OutputAudioTrack-->>Agent: audio routed to WebRTC
        deactivate OutputAudioTrack
    end
    
    TTS->>TTS: emit TTSSynthesisCompleteEvent
    deactivate TTS
Loading
sequenceDiagram
    participant Test
    participant TTSSession
    participant TTS
    participant EventBus
    
    Test->>TTS: set_output_format(sample_rate, channels)
    Test->>TTSSession: new TTSSession(tts)
    activate TTSSession
    TTSSession->>EventBus: subscribe(TTSSynthesisStartEvent, ...)
    TTSSession->>EventBus: subscribe(TTSAudioEvent, ...)
    TTSSession->>EventBus: subscribe(TTSErrorEvent, ...)
    TTSSession->>EventBus: subscribe(TTSSynthesisCompleteEvent, ...)
    deactivate TTSSession
    
    Test->>TTS: send(text)
    activate TTS
    TTS->>EventBus: emit TTSSynthesisStartEvent
    TTS->>EventBus: emit TTSAudioEvent (multiple)
    TTS->>EventBus: emit TTSSynthesisCompleteEvent
    deactivate TTS
    
    Test->>TTSSession: wait_for_result(timeout)
    activate TTSSession
    TTSSession->>TTSSession: await first relevant event or timeout
    TTSSession-->>Test: TTSResult(speeches, errors, started, completed)
    deactivate TTSSession
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • PcmData multi-channel handling and resampling logic (agents-core/vision_agents/core/edge/types.py): Complex logic for shape normalization, 2D/3D sample layouts, interleaving, and resampling with dtype conversions. Validate correctness of channel count calculations and byte ordering.
  • TTS base class PCM emission and event orchestration (agents-core/vision_agents/core/tts/tts.py): New _emit_chunk(), _iter_pcm() methods with latency recording and error handling require careful review of event sequencing, duration calculations, and completion semantics.
  • Agent RTC integration changes (agents-core/vision_agents/core/agents/agents.py): Verify that set_output_format() is called at the correct point in RTC preparation; ensure backward compatibility with realtime mode; validate that OutputAudioTrack protocol is correctly implemented by created tracks.
  • Plugin refactoring consistency: Multiple plugins follow similar patterns (remove framerate/stereo methods, wrap with PcmData.from_response()). Verify all plugins handle sample rates, channels, and formats consistently.
  • Test infrastructure migration: Verify that new TTSSession-based tests capture the same failure scenarios as the removed mock-based tests; ensure environment-based credential gating (pytest.skip) is consistent across all integration tests.
  • AWS Polly thread-pool execution: Validate that stream_audio() thread-pool call to synthesize_speech properly handles timeouts and cancellation without deadlocks.

Possibly related PRs

  • [AI-195] Fish support #115: Modifies the Fish TTS plugin implementation (plugins/fish/vision_agents/plugins/fish/tts.py), overlapping with this PR's Fish TTS refactoring to use PcmData and remove legacy framerate/stereo constraints.
  • [AI-201] Fish speech to text #121: Modifies agent event registration in agents-core/vision_agents/core/agents/agents.py by adding STT error event logging; this PR also modifies the same file to add TTSAudioEvent handling, creating potential merge conflicts or duplicated subscriber logic.

Suggested reviewers

  • Nash0x7E2
  • maxkahan
  • d3xvn

Poem

Bell jar of bytes descends—
Each PCM chunk resampled, interleaved, pressed
Into wire-thin audio tracks,
The TTS daemon speaks in stereo silence,
Formats standardized, no more guessing:
Forty-eight thousand hertz, the hum of engines,
Two channels deep—
the throat of the machine.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "Simplify TTS plugin and audio utils" is clearly related to the changeset. The raw summary confirms that the changes involve removing public methods from TTS plugins (like get_required_framerate, get_required_stereo, and set_output_track), which aligns with the simplification theme. The title also matches the stated PR objectives, which list "Simplified TTS base class" and multiple plugin cleanups as completed tasks. While the title doesn't explicitly convey the underlying architectural shift to PCM-centric audio handling or the expansion of audio utilities, it accurately captures the main goal of reducing API complexity across TTS plugins and modernizing the audio layer. The title is concise, specific, and conveys meaningful information about the primary change.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch stt-plugins

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
plugins/aws/tests/test_aws.py (1)

126-146: Apply consistent credential checking to all tests.

These tests create their own LLM instances and don't use the llm fixture, so they bypass the skip logic on line 41. Without AWS_BEARER_TOKEN_BEDROCK, they will attempt to run and likely fail, creating inconsistent test behavior.

Consider one of these solutions:

Solution 1: Add skip check to each test

 @pytest.mark.integration
 async def test_image_description(self, golf_swing_image):
+    if not os.environ.get("AWS_BEARER_TOKEN_BEDROCK"):
+        pytest.skip("AWS_BEARER_TOKEN_BEDROCK not set – skipping Bedrock tests")
     # Use a vision-capable model (Claude 3 Haiku supports images and is widely available)
     vision_llm = BedrockLLM(

Solution 2: Use the fixture and modify as needed

 @pytest.mark.integration
-async def test_image_description(self, golf_swing_image):
+async def test_image_description(self, llm: BedrockLLM, golf_swing_image):
     # Use a vision-capable model (Claude 3 Haiku supports images and is widely available)
-    vision_llm = BedrockLLM(
+    llm._model = "anthropic.claude-3-haiku-20240307-v1:0"
+    vision_llm = llm
-        model="anthropic.claude-3-haiku-20240307-v1:0", region_name="us-east-1"
-    )

Apply similar changes to test_instruction_following.

Also applies to: 149-161

agents-core/vision_agents/core/observability/metrics.py (1)

77-81: Do not emit spans at import time.

Creating spans during module import causes global side effects and unexpected traffic. Remove these calls; expose helpers to start spans in calling code instead.

-with tracer.start_as_current_span("stt.request", kind=trace.SpanKind.CLIENT) as span:
-    pass
-
-span = tracer.start_span("stt.request")
-span.end()
agents-core/vision_agents/core/agents/agents.py (1)

991-1004: Realtime warning condition is inconsistent with the message.

The second branch warns about “STT, TTS and Turn Detection” but only checks self.stt or self.turn_detection. Include self.tts for consistency.

-            if self.stt or self.turn_detection:
+            if self.stt or self.tts or self.turn_detection:
                 self.logger.warning(
                     "Realtime mode detected: STT, TTS and Turn Detection services will be ignored. "
                     "The Realtime model handles both speech-to-text, text-to-speech and turn detection internally."
                 )
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)

39-62: Update docstring to reflect actual return type; fix typo in __init__ docstring.

The SDK usage is correct—output_format="pcm_16000" is the proper format string for PCM at 16 kHz. However, two issues remain:

  1. Line 47 (stream_audio docstring): Change "An async iterator of audio chunks as bytes" to describe the actual return type PcmData | Iterator[PcmData] | AsyncIterator[PcmData].
  2. Line 23 (init docstring): Fix "ElvenLabs Client" → "ElevenLabs Client".
🧹 Nitpick comments (30)
plugins/aws/tests/test_aws.py (1)

43-43: Consider public API for test setup.

The tests access private attributes (_conversation) and methods (_set_instructions) directly. While common in testing, this couples tests to implementation details.

If BedrockLLM provides public methods to configure conversation state and instructions, prefer those. If not, consider adding public test helpers:

# In BedrockLLM class
def configure_for_testing(self, instructions: str = None, conversation = None):
    """Configure LLM for testing purposes."""
    if instructions:
        self._set_instructions(instructions)
    if conversation:
        self._conversation = conversation

Then in tests:

llm.configure_for_testing(conversation=InMemoryConversation("be friendly", []))

Also applies to: 154-154

docs/ai/instructions/ai-tts.md (3)

15-17: Clarify stream_audio return contract.

Current text says “return a single PcmData,” but plugins may return a PcmData or an (async) iterator of PcmData. Please update the guidance to accept both to match the base class behavior and existing plugins.


36-38: Avoid recommending buffering entire streams.

“Buffer streaming SDK audio into a single byte string” risks high memory usage for long utterances. Prefer emitting multiple PcmData chunks (or returning an iterator) and let the Agent handle resampling/assembly.


80-84: Safer assertion in example.

Use assert result.speeches (or assert len(result.speeches) > 0) instead of indexing result.speeches[0] to avoid IndexError in edge cases.

agents-core/vision_agents/core/tts/manual_test.py (2)

25-31: Fix docstring inaccuracies (Google style).

The function receives a TTS instance; it does not create one via tts_factory(). Please remove that bullet to avoid confusion.

-    - Creates the TTS instance via `tts_factory()`.
     - Sets desired output format via `set_output_format(sample_rate, channels)`.

66-81: Ensure subprocess cleanup on timeout.

After proc.kill(), also await proc.wait() to reap the process.

         try:
             await asyncio.wait_for(proc.wait(), timeout=30.0)
         except asyncio.TimeoutError:
-            proc.kill()
+            proc.kill()
+            try:
+                await proc.wait()
+            except Exception:
+                pass
agents-core/vision_agents/core/observability/metrics.py (3)

35-39: Duplicate meter assignment.

meter is assigned twice (__name__ then "voice-agent.latency"). Keep one to avoid confusion.

-meter = metrics.get_meter(__name__)
-
-
-meter = metrics.get_meter("voice-agent.latency")
+meter = metrics.get_meter("voice-agent.latency")

12-13: Hard-coded OTLP endpoint.

Make OTLP_ENDPOINT configurable via env (e.g., OTLP_ENDPOINT = os.getenv("OTLP_ENDPOINT", "http://localhost:4317")) to work across environments.


69-75: Remove unused sample attrs or mark as example.

CALL_ATTRS appears unused; consider deleting or moving into examples to avoid dead code.

plugins/cartesia/tests/test_tts.py (1)

15-20: Avoid type: ignore by importing the symbol.

Import the concrete class for typing and return it from tts().

-from vision_agents.plugins import cartesia
+from vision_agents.plugins import cartesia
+from vision_agents.plugins.cartesia import TTS as CartesiaTTS
@@
-    def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
+    def tts(self) -> CartesiaTTS:
@@
-        return cartesia.TTS(api_key=api_key)
+        return CartesiaTTS(api_key=api_key)
plugins/kokoro/tests/test_tts.py (1)

16-18: LGTM overall; add a sanity assertion and optional cleanup.

Capture the returned path and assert it exists; optionally remove it to avoid temp buildup.

-    async def test_kokoro_tts_convert_text_to_audio_manual_test(self, tts):
-        await manual_tts_to_wav(tts, sample_rate=24000, channels=1)
+    async def test_kokoro_tts_convert_text_to_audio_manual_test(self, tts):
+        path = await manual_tts_to_wav(tts, sample_rate=24000, channels=1)
+        assert path and os.path.exists(path)
+        try:
+            os.remove(path)
+        except OSError:
+            pass
agents-core/vision_agents/core/agents/agents.py (3)

306-317: Guard against format mismatches when writing to the audio track.

You assume TTS honored set_output_format, but if a plugin misbehaves, bytes at the wrong rate/channels could hit the track. Log (or drop) mismatched chunks to prevent artifacts.

         async def _on_tts_audio(event: TTSAudioEvent):
             try:
-                if self._audio_track and event.audio_data:
-                    from typing import Any, cast
-
-                    track_any = cast(Any, self._audio_track)
-                    await track_any.write(event.audio_data)
+                if self._audio_track and event.audio_data:
+                    from typing import Any, cast
+                    # Optional: verify negotiated format
+                    try:
+                        expected_rate = getattr(self._audio_track, "framerate", None)
+                        expected_channels = 2 if getattr(self._audio_track, "stereo", False) else 1
+                        if (expected_rate and event.sample_rate != expected_rate) or (
+                            expected_channels and event.channels != expected_channels
+                        ):
+                            self.logger.warning(
+                                "Dropping TTS audio: format mismatch (got %s Hz/%sch, expected %s Hz/%sch)",
+                                event.sample_rate, event.channels, expected_rate, expected_channels,
+                            )
+                            return
+                    except Exception:
+                        # If track doesn’t expose props, proceed optimistically
+                        pass
+                    track_any = cast(Any, self._audio_track)
+                    await track_any.write(event.audio_data)
             except Exception as e:
                 self.logger.error(f"Error writing TTS audio to track: {e}")

1032-1047: Make 48k/stereo defaults configurable and reuse them for validation.

Expose framerate/stereo as Agent init kwargs or class constants, store on self for reuse (e.g., in _on_tts_audio validation). Keeps behavior flexible across environments.

-                framerate = 48000
-                stereo = True
+                framerate = getattr(self, "_audio_out_rate", 48000)
+                stereo = getattr(self, "_audio_out_stereo", True)
                 self._audio_track = self.edge.create_audio_track(
                     framerate=framerate, stereo=stereo
                 )
                 # Inform TTS of desired output format so it can resample accordingly
                 if self.tts:
                     channels = 2 if stereo else 1

311-314: Tiny nit: avoid re-importing typing inside the handler.

Import cast at module top to reduce per-call overhead and keep imports centralized.

plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (2)

37-38: Consider honoring desired sample rate to reduce resampling.

If the agent negotiates 48 kHz stereo, you’ll resample 16 kHz mono to match. If ElevenLabs supports multiple PCM rates, map self._desired_sample_rate to a supported output_format to minimize CPU work. Otherwise, keep current behavior.

Also applies to: 60-62


64-72: Doc says “Clears the queue and stops playing audio” but it’s a no‑op.

Either implement actual cancellation if the SDK supports it or update the docstring to reflect no-op behavior.

-        """
-        Clears the queue and stops playing audio.
-        This method can be used manually or under the hood in response to turn events.
-        ...
-        """
+        """
+        Stop request hook. ElevenLabs SDK streaming is pull-based here; there is no internal
+        playback/queue to flush, so this is a no-op by design.
+        """
plugins/elevenlabs/tests/test_tts.py (2)

29-30: Strengthen assertions to catch regressions early.

Also assert session start to ensure events flow.

-        assert not result.errors
-        assert len(result.speeches) > 0
+        assert not result.errors
+        assert result.started is True
+        assert len(result.speeches) > 0

33-35: Avoid print in tests; prefer logging or assertion of output.

Printing paths is noisy in CI. Use logging or silence by default; optionally assert file exists.

-        path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
-        print("ElevenLabs TTS audio written to:", path)
+        path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        assert os.path.exists(path)

Note: Consider enhancing manual_tts_to_wav to wait until synthesis completes before writing, otherwise it may write only the first chunk. Based on relevant helpers.

agents-core/vision_agents/core/tts/testing.py (2)

70-81: Provide an option to wait until synthesis completes.

wait_for_result returns after the first audio/error, which is great for smoke checks but truncates longer audio for callers like manual_tts_to_wav. Add a mode to wait for TTSSynthesisCompleteEvent (with timeout).

-    async def wait_for_result(self, timeout: float = 10.0) -> TTSResult:
+    async def wait_for_result(
+        self, timeout: float = 10.0, until_complete: bool = False
+    ) -> TTSResult:
         try:
-            await asyncio.wait_for(self._first_event.wait(), timeout=timeout)
+            if until_complete:
+                async def _wait_complete():
+                    # Fast-path if already completed
+                    if self._completed:
+                        return
+                    # Wait until completion toggles (events update the flag)
+                    while not self._completed and not self._errors and not self._speeches:
+                        await asyncio.sleep(0.01)
+                await asyncio.wait_for(_wait_complete(), timeout=timeout)
+            else:
+                await asyncio.wait_for(self._first_event.wait(), timeout=timeout)
         except asyncio.TimeoutError:
             # Return whatever we have so far
             pass
         return TTSResult(
             speeches=list(self._speeches),
             errors=list(self._errors),
             started=self._started,
             completed=self._completed,
         )

42-61: Add a simple teardown to avoid subscriber leaks in long-lived tests.

Store unsubscribe handles (if supported) or expose a close() to deregister callbacks.

 class TTSSession:
@@
-        @tts.events.subscribe
-        async def _on_start(ev: TTSSynthesisStartEvent):  # type: ignore[name-defined]
+        self._subs = []
+        @tts.events.subscribe
+        async def _on_start(ev: TTSSynthesisStartEvent):  # type: ignore[name-defined]
             self._started = True
+        self._subs.append(_on_start)
@@
-        @tts.events.subscribe
+        @tts.events.subscribe
         async def _on_complete(ev: TTSSynthesisCompleteEvent):  # type: ignore[name-defined]
             self._completed = True
+        self._subs.append(_on_complete)
+
+    def close(self) -> None:
+        for cb in getattr(self, "_subs", []):
+            try:
+                self._tts.events.unsubscribe(cb)  # if supported by EventManager
+            except Exception:
+                pass

If EventManager lacks unsubscribe, consider a no-op close() for API consistency. As per coding guidelines.

plugins/cartesia/vision_agents/plugins/cartesia/tts.py (2)

54-58: Docstring: clarify return shapes and native format.

Mention that response may be async iterator and that PcmData is s16 mono at self.sample_rate, to match base expectations.

-    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]  # noqa: D401
-        """Generate speech and return a stream of PcmData."""
+    ) -> PcmData | Iterator[PcmData] | AsyncIterator[PcmData]:  # noqa: D401
+        """Generate speech and return PcmData stream (s16 mono at sample_rate)."""

80-82: Honor desired channel count if agent requests stereo.

If upstream calls set_output_format(..., channels=2), consider threading that into from_response so downstream resampling has correct provenance.

-        return PcmData.from_response(
-            response, sample_rate=self.sample_rate, channels=1, format="s16"
-        )
+        return PcmData.from_response(
+            response, sample_rate=self.sample_rate, channels=1, format="s16"
+        )

Alternatively, set self._native_channels = 1 in init for clarity; base class will rechannel to desired on emit.

plugins/fish/vision_agents/plugins/fish/tts.py (2)

25-26: Avoid hard-coding a reference voice by default.

A baked-in reference_id can break for users lacking access to that voice. Default to None and document how to set it via config/env.

-        reference_id: Optional[str] = "03397b4c4be74759b72533b663fbd001",
+        reference_id: Optional[str] = None,

86-90: Explicitly declare native format/channel for clarity.

Not required, but setting provider-native format helps future maintainers.

-        return PcmData.from_response(
-            stream, sample_rate=16000, channels=1, format="s16"
-        )
+        # Provider-native is 16kHz mono s16
+        return PcmData.from_response(stream, sample_rate=16000, channels=1, format="s16")
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (2)

47-53: Use get_running_loop() in async context.

get_event_loop() is deprecated when a loop is running; prefer get_running_loop() to avoid warnings on 3.11+.

-        loop = asyncio.get_event_loop()
+        loop = asyncio.get_running_loop()

55-60: Minor: annotate generator return and keep PCM metadata close.

Inline the format/sample_rate once to avoid repetition.

         async def _aiter():
             for chunk in chunks:
-                yield PcmData.from_bytes(
-                    chunk, sample_rate=self.sample_rate, channels=1, format="s16"
-                )
+                yield PcmData.from_bytes(chunk, sample_rate=self.sample_rate, channels=1, format="s16")
agents-core/vision_agents/core/tts/tts.py (3)

125-142: Deduplicate normalization: delegate to PcmData.from_response; also handle memoryview correctly

re-implementing chunk normalization invites edge bugs. Use PcmData.from_response, which already aligns/aggregates and supports bytes/PcmData/iterators.

Apply this refactor:

-    async def _iter_pcm(self, resp: Any) -> AsyncGenerator[PcmData, None]:
-        """Yield PcmData chunks from a provider response of various shapes."""
-        # Single buffer or PcmData
-        if isinstance(resp, (bytes, bytearray, PcmData)):
-            yield self._normalize_to_pcm(resp)
-            return
-        # Async iterable
-        if hasattr(resp, "__aiter__"):
-            async for item in resp:
-                yield self._normalize_to_pcm(item)
-            return
-        # Sync iterable (avoid treating bytes-like as iterable of ints)
-        if hasattr(resp, "__iter__") and not isinstance(resp, (str, bytes, bytearray)):
-            for item in resp:
-                yield self._normalize_to_pcm(item)
-            return
-        raise TypeError(f"Unsupported return type from stream_audio: {type(resp)}")
+    async def _iter_pcm(self, resp: Any) -> AsyncGenerator[PcmData, None]:
+        """Yield PcmData chunks from arbitrary provider responses via PcmData.from_response."""
+        fmt = self._native_format.value if hasattr(self._native_format, "value") else "s16"
+        norm = PcmData.from_response(
+            resp,
+            sample_rate=self._native_sample_rate,
+            channels=self._native_channels,
+            format=fmt,
+        )
+        if isinstance(norm, PcmData):
+            yield norm
+            return
+        if hasattr(norm, "__aiter__"):
+            async for pcm in norm:
+                yield pcm
+            return
+        if hasattr(norm, "__iter__"):
+            for pcm in norm:
+                yield pcm
+            return
+        raise TypeError(f"Unsupported return type from stream_audio: {type(resp)}")

179-186: Update stream_audio docstring to mention PcmData variants

Return annotation includes PcmData types, but the docstring doesn’t. Clarify for implementers.

Apply this doc tweak:

-        Returns:
-            Audio data as bytes, an iterator of audio chunks, or an async iterator of audio chunks
+        Returns:
+            Audio as:
+            - bytes or (async) iterator[bytes], or
+            - PcmData or (async) iterator[PcmData].

As per coding guidelines.

Also applies to: 197-199


277-281: Compute real‑time factor using total send duration, not pre‑stream “setup” time

synthesis_time measures only until stream_audio returns, not the full emission. Use total elapsed before emitting the complete event.

Apply this adjustment:

-            real_time_factor = (
-                (synthesis_time * 1000) / estimated_audio_duration_ms
-                if estimated_audio_duration_ms > 0
-                else None
-            )
+            total_elapsed_ms = (time.time() - start_time) * 1000.0
+            real_time_factor = (
+                total_elapsed_ms / estimated_audio_duration_ms
+                if estimated_audio_duration_ms > 0
+                else None
+            )
@@
-                    synthesis_time_ms=synthesis_time * 1000,
+                    synthesis_time_ms=total_elapsed_ms,

If “synthesis_time_ms” is intended to reflect only provider latency, consider adding a second field (e.g., end_to_end_ms) instead of overloading.

Also applies to: 283-296, 313-317

agents-core/vision_agents/core/edge/types.py (1)

320-337: to_bytes: ensure interleaved view is contiguous before tobytes()

Transpose often creates non‑contiguous views. Make it explicit.

Apply:

-            if arr.ndim == 2:
-                # (channels, samples) -> interleaved (samples, channels)
-                interleaved = arr.T.reshape(-1)
-                return interleaved.tobytes()
+            if arr.ndim == 2:
+                # (channels, samples) -> interleaved (samples, channels)
+                interleaved = np.ascontiguousarray(arr.T).reshape(-1)
+                return interleaved.tobytes()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a725b0 and d9f79b3.

📒 Files selected for processing (19)
  • agents-core/vision_agents/core/agents/agents.py (3 hunks)
  • agents-core/vision_agents/core/edge/types.py (6 hunks)
  • agents-core/vision_agents/core/observability/__init__.py (2 hunks)
  • agents-core/vision_agents/core/observability/metrics.py (1 hunks)
  • agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
  • agents-core/vision_agents/core/tts/testing.py (1 hunks)
  • agents-core/vision_agents/core/tts/tts.py (5 hunks)
  • docs/ai/instructions/ai-tts.md (1 hunks)
  • examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
  • plugins/aws/tests/test_aws.py (1 hunks)
  • plugins/cartesia/tests/test_tts.py (1 hunks)
  • plugins/cartesia/vision_agents/plugins/cartesia/tts.py (5 hunks)
  • plugins/elevenlabs/tests/test_tts.py (1 hunks)
  • plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (4 hunks)
  • plugins/fish/tests/test_tts.py (1 hunks)
  • plugins/fish/vision_agents/plugins/fish/tts.py (5 hunks)
  • plugins/kokoro/tests/test_tts.py (1 hunks)
  • plugins/kokoro/vision_agents/plugins/kokoro/tts.py (3 hunks)
  • tests/test_tts_base.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • agents-core/vision_agents/core/observability/__init__.py
  • tests/test_tts_base.py
  • plugins/kokoro/tests/test_tts.py
  • plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py
  • examples/01_simple_agent_example/simple_agent_example.py
  • plugins/elevenlabs/tests/test_tts.py
  • agents-core/vision_agents/core/agents/agents.py
  • plugins/kokoro/vision_agents/plugins/kokoro/tts.py
  • plugins/cartesia/vision_agents/plugins/cartesia/tts.py
  • agents-core/vision_agents/core/observability/metrics.py
  • plugins/aws/tests/test_aws.py
  • agents-core/vision_agents/core/edge/types.py
  • plugins/fish/vision_agents/plugins/fish/tts.py
  • plugins/cartesia/tests/test_tts.py
  • agents-core/vision_agents/core/tts/manual_test.py
  • agents-core/vision_agents/core/tts/tts.py
  • plugins/fish/tests/test_tts.py
  • agents-core/vision_agents/core/tts/testing.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

  • tests/test_tts_base.py
🧬 Code graph analysis (15)
tests/test_tts_base.py (4)
agents-core/vision_agents/core/tts/tts.py (4)
  • TTS (32-329)
  • stream_audio (177-200)
  • set_output_format (81-99)
  • send (216-317)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSErrorEvent (51-64)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
agents-core/vision_agents/core/edge/types.py (3)
  • PcmData (37-505)
  • _agen (416-448)
  • from_bytes (118-186)
agents-core/vision_agents/core/events/manager.py (1)
  • wait (470-484)
plugins/kokoro/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)
  • TTS (18-77)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (6)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_response (382-505)
agents-core/vision_agents/core/tts/tts.py (1)
  • stream_audio (177-200)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)
  • stream_audio (54-82)
plugins/fish/vision_agents/plugins/fish/tts.py (1)
  • stream_audio (56-90)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)
  • stream_audio (47-61)
tests/test_tts_base.py (6)
  • stream_audio (17-21)
  • stream_audio (28-37)
  • stream_audio (44-47)
  • stream_audio (54-58)
  • stream_audio (65-69)
  • stream_audio (76-77)
examples/01_simple_agent_example/simple_agent_example.py (1)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)
  • open_demo (329-406)
plugins/elevenlabs/tests/test_tts.py (3)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)
  • TTS (10-72)
agents-core/vision_agents/core/agents/agents.py (4)
agents-core/vision_agents/core/tts/events.py (1)
  • TTSAudioEvent (10-21)
agents-core/vision_agents/core/events/manager.py (1)
  • subscribe (299-368)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)
  • create_audio_track (291-294)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (2)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_bytes (118-186)
agents-core/vision_agents/core/tts/tts.py (1)
  • stream_audio (177-200)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (3)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_response (382-505)
agents-core/vision_agents/core/tts/tts.py (1)
  • stream_audio (177-200)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)
  • stream_audio (39-62)
agents-core/vision_agents/core/edge/types.py (1)
tests/test_tts_base.py (1)
  • _agen (32-35)
plugins/fish/vision_agents/plugins/fish/tts.py (4)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_response (382-505)
agents-core/vision_agents/core/tts/tts.py (3)
  • TTS (32-329)
  • stream_audio (177-200)
  • stop_audio (203-214)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (3)
  • TTS (18-92)
  • stream_audio (54-82)
  • stop_audio (84-92)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (3)
  • TTS (10-72)
  • stream_audio (39-62)
  • stop_audio (64-72)
plugins/cartesia/tests/test_tts.py (3)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)
  • TTS (18-92)
agents-core/vision_agents/core/tts/manual_test.py (2)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/edge/types.py (3)
  • PcmData (37-505)
  • from_bytes (118-186)
  • to_wav_bytes (338-379)
agents-core/vision_agents/core/tts/tts.py (8)
agents-core/vision_agents/core/events/base.py (3)
  • PluginInitializedEvent (56-63)
  • PluginClosedEvent (67-74)
  • AudioFormat (23-30)
agents-core/vision_agents/core/edge/types.py (6)
  • PcmData (37-505)
  • from_bytes (118-186)
  • resample (251-318)
  • to_bytes (320-336)
  • duration_ms (101-103)
  • close (33-34)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
  • TTSErrorEvent (51-64)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)
  • stream_audio (54-82)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)
  • stream_audio (39-62)
plugins/fish/vision_agents/plugins/fish/tts.py (1)
  • stream_audio (56-90)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)
  • stream_audio (47-61)
tests/test_tts_base.py (6)
  • stream_audio (17-21)
  • stream_audio (28-37)
  • stream_audio (44-47)
  • stream_audio (54-58)
  • stream_audio (65-69)
  • stream_audio (76-77)
plugins/fish/tests/test_tts.py (3)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
plugins/fish/vision_agents/plugins/fish/tts.py (1)
  • TTS (12-102)
agents-core/vision_agents/core/tts/testing.py (3)
agents-core/vision_agents/core/tts/tts.py (1)
  • TTS (32-329)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSErrorEvent (51-64)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
tests/test_tts_base.py (8)
  • _on_start (92-93)
  • _on_audio (96-99)
  • _on_audio (126-128)
  • _on_audio (148-150)
  • _on_audio (167-169)
  • _on_audio (188-190)
  • _on_error (207-209)
  • _on_complete (102-103)
🪛 LanguageTool
docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op (the Agent controls playback...

(MISSING_IT_THERE)

🔇 Additional comments (14)
plugins/aws/tests/test_aws.py (2)

40-41: LGTM – Proper use of pytest.skip for missing credentials.

The fixture-based skip logic ensures that all tests depending on the llm fixture will be skipped when credentials are unavailable, which is the correct approach for integration tests.


1-161: AI summary inconsistency with file content.

The AI-generated summary describes TTS audio handling and TTS plugins, but this file tests the BedrockLLM language model. The summary appears to describe other files in the PR rather than this one.

examples/01_simple_agent_example/simple_agent_example.py (1)

55-56: Original review comment is not supported by codebase evidence.

The search reveals that the predominant pattern in the codebase (16 of 22 instances) is to call open_demo before the with await agent.join(call): context, not inside it. The review claims moving open_demo inside the join context is an improvement for "reducing race conditions," but this contradicts the established practice. While 6 examples do use the inside pattern, they are the minority. Additionally, no explicit error handling is observed around open_demo calls in any of these examples, so the error-handling concern in the original review is not validated by precedent.

If this change intentionally diverges from the common pattern, that architectural decision should be justified explicitly rather than framed as a general improvement.

Likely an incorrect or invalid review comment.

agents-core/vision_agents/core/observability/__init__.py (1)

18-19: Export looks good.

tts_events_emitted is properly imported and added to __all__.

Also applies to: 34-35

plugins/cartesia/tests/test_tts.py (1)

21-31: Integration test flow looks solid.

Env‑guard, output format set, session wait, and assertions are appropriate.
If flakes occur, consider increasing timeout to match real API latency.

plugins/elevenlabs/tests/test_tts.py (1)

19-27: No changes needed — asyncio support is already properly configured.

The repository has asyncio_mode = auto configured in pytest.ini, which enables automatic async test execution. The test at lines 19-27 will run correctly with only the @pytest.mark.integration marker; adding @pytest.mark.asyncio is unnecessary and contradicts the established pattern of relying on auto mode.

agents-core/vision_agents/core/tts/tts.py (3)

63-79: Initialization/event plumbing looks solid

Sessioning, provider naming, and PluginInitializedEvent emission are consistent and minimal. No issues.


321-329: Graceful close event emission LGTM

PluginClosedEvent with plugin_type="TTS" is consistent.


143-175: Verification confirms field is properly defined

The user_metadata field is defined in the BaseEvent class (agents-core/vision_agents/core/events/base.py, line 41) as user_metadata: Optional[Participant] = None. Since TTSAudioEvent inherits from PluginBaseEvent, which extends BaseEvent, the field is available and the code at lines 143-175 is correct. No dataclass initialization errors will occur.

agents-core/vision_agents/core/edge/types.py (5)

56-76: Multi‑channel duration and duration_ms: 👍

Handles (channels, samples) correctly and exposes ms helper. Looks good.

Also applies to: 100-104


118-187: from_bytes: alignment + interleaving logic LGTM

Good trimming to sample width and channel‑multiple; returns (channels, samples) for multichannel.


188-250: from_data: pragmatic normalization

Covers bytes and ndarray shapes/dtypes well. Minor note: when ambiguous 2D, assuming first dim as channels is reasonable.


338-380: to_wav_bytes: sensible s16 conversion path

Converts non‑s16 to s16 and writes standard WAV headers. Looks good.


381-505: from_response: versatile and aligns chunks

Covers bytes/PcmData/(a)synchronous iterables and pads trailing partial frames. Good reuse across plugins.

Confirm target providers always return PCM (not compressed) when using this path. If not, gate by format and raise early.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
plugins/fish/tests/test_fish_tts.py (1)

34-37: Consider capturing the result from wait_for_result for consistency.

While accessing session.errors and session.speeches works because the session maintains internal state, the idiomatic pattern shown in the TTSSession docstring suggests using the returned TTSResult object.

Apply this diff:

         await tts.send(text)
-        await session.wait_for_result(timeout=15.0)
+        result = await session.wait_for_result(timeout=15.0)
 
-        assert not session.errors
-        assert len(session.speeches) > 0
+        assert not result.errors
+        assert len(result.speeches) > 0
agents-core/vision_agents/core/agents/agents.py (2)

318-319: Type-safety bypass requires justification or stronger validation.

Casting to Any silences type checking entirely. If the track genuinely lacks proper type hints for write(), consider adding runtime validation or a comment explaining why the cast is necessary.

-                track_any = cast(Any, self._audio_track)
-                await track_any.write(event.audio_data)
+                # AudioStreamTrack.write() not in type stubs but exists at runtime
+                if not hasattr(self._audio_track, 'write'):
+                    self.logger.error("Audio track does not support write method")
+                    return
+                track_any = cast(Any, self._audio_track)
+                await track_any.write(event.audio_data)

1037-1042: Hardcoded audio format lacks configuration mechanism.

The comment mentions "unless configured differently," but framerate and stereo are hardcoded literals with no constructor parameter, config file, or environment-variable override. This reduces flexibility for deployments requiring different sample rates.

Consider adding constructor parameters:

def __init__(
    self,
    # ... existing params ...
    audio_output_sample_rate: int = 48000,
    audio_output_stereo: bool = True,
    # ... rest of params ...
):

Then reference self.audio_output_sample_rate and self.audio_output_stereo in _prepare_rtc().

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d9f79b3 and 451075f.

📒 Files selected for processing (3)
  • agents-core/vision_agents/core/agents/agents.py (3 hunks)
  • examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
  • plugins/fish/tests/test_fish_tts.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • agents-core/vision_agents/core/agents/agents.py
  • examples/01_simple_agent_example/simple_agent_example.py
  • plugins/fish/tests/test_fish_tts.py
🧬 Code graph analysis (3)
agents-core/vision_agents/core/agents/agents.py (4)
agents-core/vision_agents/core/tts/events.py (1)
  • TTSAudioEvent (10-21)
agents-core/vision_agents/core/events/manager.py (1)
  • subscribe (299-368)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)
  • create_audio_track (291-294)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
examples/01_simple_agent_example/simple_agent_example.py (1)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)
  • open_demo (329-406)
plugins/fish/tests/test_fish_tts.py (3)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (3)
  • TTSSession (23-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
🔇 Additional comments (5)
plugins/fish/tests/test_fish_tts.py (3)

1-8: LGTM!

Clean imports that align with the new TTSSession-based testing pattern.


14-17: LGTM!

Straightforward fixture setup following pytest conventions.


19-25: LGTM!

Proper environment variable guard ensures graceful skip when API keys are unavailable.

agents-core/vision_agents/core/agents/agents.py (2)

35-35: LGTM: Import aligns with event-driven TTS architecture.

The TTSAudioEvent import supports the new PCM-based TTS output flow.


311-321: Metadata serves its purpose; handler design is correct.

The audio format metadata is not unused—it documents the converted audio format in the TTSAudioEvent. At tts.py:153, audio is resampled to self._desired_sample_rate and self._desired_channels, and the event is created with matching metadata (audio_format, sample_rate, channels set to the desired values). Since format conversion happens upstream in the TTS pipeline, the handler in agents.py correctly ignores the metadata and writes pre-converted bytes directly to the track.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (7)
plugins/elevenlabs/tests/test_tts.py (3)

10-18: Consider adding docstrings for test documentation.

The test class and fixture method lack docstrings. Adding brief documentation following the Google style guide would improve maintainability and help other developers understand the test setup.

Based on coding guidelines.


20-29: Solid integration test implementation.

The test correctly follows the TTSSession pattern, configures output format, and validates both error absence and audio generation. Consider adding a docstring to document the test's purpose per coding guidelines.

Based on coding guidelines.


31-34: Add assertions to validate the WAV output.

While manual_tts_to_wav handles internal error checking, this test lacks explicit assertions. Consider validating that the returned path exists and the file is non-empty to ensure the test fails appropriately in CI environments.

     async def test_elevenlabs_tts_convert_text_to_audio_manual_test(self, tts):
         path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
         print("ElevenLabs TTS audio written to:", path)
+        import os
+        assert os.path.exists(path), f"WAV file not created at {path}"
+        assert os.path.getsize(path) > 0, "WAV file is empty"

Also consider adding a docstring per coding guidelines.

Based on coding guidelines.

plugins/cartesia/tests/test_tts.py (1)

16-21: Consider using @pytest.fixture for synchronous fixtures.

The fixture returns a synchronous result but uses @pytest_asyncio.fixture. While this may work, the standard convention is to use @pytest.fixture for non-async fixtures and reserve @pytest_asyncio.fixture for async ones.

Apply this diff if you prefer strict convention adherence:

-    @pytest_asyncio.fixture
+    @pytest.fixture
     def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
plugins/openai/tests/test_tts_openai.py (1)

1-8: Consider adding dotenv for test environment consistency.

While environment variables can be set externally, the Cartesia and Fish test modules both use python-dotenv to load .env files, which improves developer experience.

To align with other TTS plugin tests, consider adding:

+from dotenv import load_dotenv
 import os
 import pytest
 import pytest_asyncio

 from vision_agents.plugins import openai as openai_plugin
 from vision_agents.core.tts.testing import TTSSession
 from vision_agents.core.tts.manual_test import manual_tts_to_wav

+# Load environment variables
+load_dotenv()
docs/ai/instructions/ai-tts.md (1)

27-28: Consider clarifying the sentence structure.

Static analysis suggests the sentence could be more complete, though the meaning is clear in context.

If you prefer a complete sentence:

-- `stop_audio` can be a no-op
+- `stop_audio` can be implemented as a no-op
plugins/fish/tests/test_fish_tts.py (1)

14-16: Add API key validation to skip gracefully when credentials are absent.

The fixture instantiates fish.TTS() without checking for required environment variables. If FISH_API_KEY or FISH_AUDIO_API_KEY is missing, tests will fail rather than skip gracefully.

Apply this diff:

     @pytest_asyncio.fixture
     def tts(self) -> fish.TTS:
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set")
         return fish.TTS()

Note: This addresses the same concern raised in previous review comments about the integration test.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 451075f and e5e0cf5.

📒 Files selected for processing (7)
  • docs/ai/instructions/ai-tts.md (1 hunks)
  • plugins/cartesia/tests/test_tts.py (1 hunks)
  • plugins/elevenlabs/tests/test_tts.py (1 hunks)
  • plugins/fish/tests/test_fish_tts.py (1 hunks)
  • plugins/openai/tests/test_tts_openai.py (1 hunks)
  • plugins/openai/vision_agents/plugins/openai/__init__.py (1 hunks)
  • plugins/openai/vision_agents/plugins/openai/tts.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • plugins/openai/tests/test_tts_openai.py
  • plugins/openai/vision_agents/plugins/openai/__init__.py
  • plugins/openai/vision_agents/plugins/openai/tts.py
  • plugins/fish/tests/test_fish_tts.py
  • plugins/cartesia/tests/test_tts.py
  • plugins/elevenlabs/tests/test_tts.py
🧬 Code graph analysis (6)
plugins/openai/tests/test_tts_openai.py (4)
agents-core/vision_agents/core/tts/testing.py (3)
  • TTSSession (23-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/openai/vision_agents/plugins/openai/tts.py (1)
  • TTS (10-51)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
plugins/openai/vision_agents/plugins/openai/__init__.py (2)
plugins/openai/tests/test_tts_openai.py (1)
  • tts (12-16)
plugins/openai/vision_agents/plugins/openai/tts.py (1)
  • TTS (10-51)
plugins/openai/vision_agents/plugins/openai/tts.py (2)
plugins/openai/tests/test_tts_openai.py (1)
  • tts (12-16)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_bytes (118-186)
plugins/fish/tests/test_fish_tts.py (4)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (3)
  • TTSSession (23-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
conftest.py (1)
  • wait_for_result (54-67)
plugins/cartesia/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
plugins/elevenlabs/tests/test_tts.py (4)
plugins/cartesia/tests/test_tts.py (1)
  • tts (17-21)
plugins/openai/tests/test_tts_openai.py (1)
  • tts (12-16)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
🪛 LanguageTool
docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op ## init The plugin con...

(MISSING_IT_THERE)


[style] ~43-~43: It’s considered informal to use ‘a couple’ without the preposition ‘of’ before a noun.
Context: ... not necessary - Make to write at least a couple integration tests, use TTSSession to ...

(A_COUPLE_OF)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
🔇 Additional comments (13)
plugins/elevenlabs/tests/test_tts.py (1)

1-7: LGTM!

The imports are well-organized and align with the integration test pattern used in other plugins.

plugins/openai/vision_agents/plugins/openai/__init__.py (1)

3-5: LGTM!

The TTS export cleanly extends the public API and aligns with the plugin's module structure.

plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM!

The imports and dotenv configuration follow the established pattern seen across TTS plugin tests.


23-31: LGTM!

The integration test properly exercises the real API with appropriate guards and assertions.


33-35: LGTM!

The manual WAV conversion test correctly delegates to the shared utility.

plugins/openai/tests/test_tts_openai.py (2)

10-16: LGTM!

The async fixture is properly decorated and handles missing credentials gracefully.


18-31: LGTM!

Both integration tests follow the established pattern with proper setup, execution, and assertions.

docs/ai/instructions/ai-tts.md (1)

1-11: LGTM!

The layout conventions align with the actual plugin structure and correctly reference PEP 420 namespace packages.

plugins/fish/tests/test_fish_tts.py (2)

18-20: LGTM!

The manual WAV test correctly delegates to the shared utility function.


22-32: LGTM!

The integration test follows the established pattern and will benefit from improved API key handling in the fixture.

plugins/openai/vision_agents/plugins/openai/tts.py (3)

1-8: LGTM!

The imports are well-organized and follow the project's conventions.


33-47: LGTM!

The stream_audio implementation correctly synthesizes speech to PCM format and returns a properly constructed PcmData buffer. The 24kHz sample rate aligns with OpenAI's TTS output specifications.


49-51: LGTM!

The stop_audio no-op implementation is appropriate given that playback management is handled by the agent.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (5)
plugins/aws/example/aws_polly_tts_example.py (1)

9-12: Add a Google‑style docstring and allow basic env overrides for text/output.

Keeps the example self‑documenting and convenient without adding deps.

 async def main():
-    load_dotenv()
-    tts = TTS(voice_id=os.environ.get("AWS_POLLY_VOICE", "Joanna"))
-    await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+    """Run AWS Polly TTS example.
+
+    Returns:
+        None
+    """
+    load_dotenv()
+    tts = TTS(voice_id=os.environ.get("AWS_POLLY_VOICE", "Joanna"))
+    text = os.environ.get("TTS_TEXT", "This is a manual TTS playback test.")
+    outfile = os.environ.get("TTS_OUTFILE")
+    await manual_tts_to_wav(
+        tts, sample_rate=16000, channels=1, text=text, outfile_path=outfile
+    )

As per coding guidelines.

plugins/aws/tests/test_tts.py (2)

35-45: Strengthen assertions to catch silent failures.

Also assert synthesis started; keeps failures crisp.

     async def test_aws_polly_tts_speech(self, tts: aws_plugin.TTS):
         tts.set_output_format(sample_rate=16000, channels=1)
         session = TTSSession(tts)
 
         await tts.send("Hello from AWS Polly TTS")
 
         result = await session.wait_for_result(timeout=30.0)
-        assert not result.errors
-        assert len(result.speeches) > 0
+        assert not result.errors
+        assert result.started
+        assert len(result.speeches) > 0

46-48: Avoid temp file leakage; validate WAV artifact.

Use pytest’s tmp_path and check file size.

-    async def test_aws_polly_tts_manual_wav(self, tts: aws_plugin.TTS):
-        await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+    async def test_aws_polly_tts_manual_wav(self, tts: aws_plugin.TTS, tmp_path):
+        outfile = tmp_path / "polly.wav"
+        path = await manual_tts_to_wav(
+            tts, sample_rate=16000, channels=1, outfile_path=str(outfile)
+        )
+        assert os.path.exists(path)
+        # WAV header is 44 bytes; ensure non-empty audio payload.
+        assert os.path.getsize(path) > 44
plugins/aws/vision_agents/plugins/aws/tts.py (2)

53-57: Configure client timeouts and retries.

Prevents indefinite hangs under network issues.

+from botocore.config import Config
@@
     def client(self):
         if self._client is None:
-            self._client = boto3.client("polly", region_name=self.region_name)
+            cfg = Config(
+                read_timeout=20,
+                connect_timeout=5,
+                retries={"max_attempts": 3, "mode": "standard"},
+            )
+            self._client = boto3.client("polly", region_name=self.region_name, config=cfg)
         return self._client

62-66: Adopt Google‑style docstring for stream_audio.

-        """Synthesize the entire speech to a single PCM buffer.
-
-        Returns PcmData with s16 format and the configured sample rate.
-        """
+        """Synthesize text with Polly and return PCM audio.
+
+        Args:
+            text: Input text or SSML to synthesize.
+            *_, **__: Unused, reserved for BaseTTS compatibility.
+
+        Returns:
+            PcmData with s16 format and the selected sample rate.
+        """

As per coding guidelines.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e5e0cf5 and ff9ebed.

📒 Files selected for processing (5)
  • plugins/aws/README.md (4 hunks)
  • plugins/aws/example/aws_polly_tts_example.py (1 hunks)
  • plugins/aws/tests/test_tts.py (1 hunks)
  • plugins/aws/vision_agents/plugins/aws/__init__.py (1 hunks)
  • plugins/aws/vision_agents/plugins/aws/tts.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • plugins/aws/README.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • plugins/aws/tests/test_tts.py
  • plugins/aws/example/aws_polly_tts_example.py
  • plugins/aws/vision_agents/plugins/aws/__init__.py
  • plugins/aws/vision_agents/plugins/aws/tts.py
🧬 Code graph analysis (4)
plugins/aws/tests/test_tts.py (4)
agents-core/vision_agents/core/tts/testing.py (3)
  • TTSSession (23-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/aws/vision_agents/plugins/aws/tts.py (1)
  • TTS (10-92)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (81-99)
plugins/aws/example/aws_polly_tts_example.py (2)
plugins/aws/vision_agents/plugins/aws/tts.py (1)
  • TTS (10-92)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/aws/vision_agents/plugins/aws/__init__.py (2)
plugins/aws/tests/test_tts.py (1)
  • tts (29-33)
plugins/aws/vision_agents/plugins/aws/tts.py (1)
  • TTS (10-92)
plugins/aws/vision_agents/plugins/aws/tts.py (2)
plugins/aws/tests/test_tts.py (1)
  • tts (29-33)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (37-505)
  • from_bytes (118-186)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Test "not integration"
🔇 Additional comments (1)
plugins/aws/vision_agents/plugins/aws/__init__.py (1)

3-5: LGTM: export surface updated to include TTS.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
agents-core/vision_agents/core/tts/tts.py (1)

243-258: Critical: Streaming path never marks the final chunk.

All streamed chunks are emitted with is_final_chunk=False (line 254). Downstream consumers cannot detect stream completion, potentially causing audio playback to hang or buffer indefinitely.

Apply a one-element lookahead to mark the final chunk:

             else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+                ait = self._iter_pcm(response).__aiter__()
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    prev = None
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt
agents-core/vision_agents/core/edge/types.py (1)

272-339: Critical: dtype mismatch and incorrect output format in resample().

Two interrelated bugs:

  1. Input format mismatch (line 303): The AV format is hardcoded based on channel count only, ignoring self.samples.dtype. Passing float32 input will cause AudioFrame.from_ndarray() to fail because it expects int16 data when format is "s16".

  2. Output format inconsistency (line 331): The resampler always outputs s16 (line 311), but the returned PcmData preserves self.format, creating a mismatch between the format field and actual data.

Apply this fix to detect dtype and correct the output format:

-        # Prepare ndarray shape for AV.
-        # Our convention: (channels, samples) for multi-channel, (samples,) for mono.
-        samples = self.samples
-        if samples.ndim == 1:
-            # Mono: reshape to (1, samples) for AV
-            samples = samples.reshape(1, -1)
-        elif samples.ndim == 2:
-            # Already (channels, samples)
-            pass
-
-        # Create AV audio frame from the samples
-        in_layout = "mono" if self.channels == 1 else "stereo"
-        # For multi-channel, use planar format to avoid packed shape errors
-        in_format = "s16" if self.channels == 1 else "s16p"
-        samples = np.ascontiguousarray(samples)
-        frame = av.AudioFrame.from_ndarray(samples, format=in_format, layout=in_layout)
+        # Prepare ndarray shape for AV: (channels, samples)
+        samples = self.samples
+        if samples.ndim == 1:
+            samples = samples.reshape(1, -1)
+        elif samples.ndim != 2:
+            samples = samples.reshape(1, -1)
+        samples = np.ascontiguousarray(samples)
+
+        # Only mono/stereo currently supported
+        if self.channels not in (1, 2):
+            raise NotImplementedError("resample() supports mono or stereo input only")
+        if target_channels not in (1, 2):
+            raise NotImplementedError("resample() supports mono or stereo output only")
+
+        in_layout = "mono" if self.channels == 1 else "stereo"
+        # Pick AV input format based on dtype and planarity
+        if samples.dtype == np.int16:
+            in_format = "s16" if self.channels == 1 else "s16p"
+        elif samples.dtype == np.float32:
+            in_format = "flt" if self.channels == 1 else "fltp"
+        else:
+            samples = samples.astype(np.int16)
+            in_format = "s16" if self.channels == 1 else "s16p"
+
+        frame = av.AudioFrame.from_ndarray(samples, format=in_format, layout=in_layout)
         frame.sample_rate = self.sample_rate
 
         # Create resampler
         out_layout = "mono" if target_channels == 1 else "stereo"
         resampler = av.AudioResampler(
             format="s16", layout=out_layout, rate=target_sample_rate
         )
 
         # Resample the frame
         resampled_frames = resampler.resample(frame)
         if resampled_frames:
             resampled_frame = resampled_frames[0]
             resampled_samples = resampled_frame.to_ndarray()
 
             # AV returns (channels, samples), so for mono we want the first (and only) channel
             if len(resampled_samples.shape) > 1:
                 if target_channels == 1:
                     resampled_samples = resampled_samples[0]
 
             # Convert to int16
             resampled_samples = resampled_samples.astype(np.int16)
 
             return PcmData(
                 samples=resampled_samples,
                 sample_rate=target_sample_rate,
-                format=self.format,
+                format="s16",
                 pts=self.pts,
                 dts=self.dts,
                 time_base=self.time_base,
                 channels=target_channels,
             )
🧹 Nitpick comments (3)
plugins/cartesia/tests/test_tts.py (2)

16-21: Consider adding fixture teardown for resource cleanup.

The fixture creates a TTS instance but doesn't explicitly clean it up. If the Cartesia TTS maintains connections or other resources, consider using a yield pattern with teardown logic to ensure proper cleanup after each test.

Example:

     @pytest_asyncio.fixture
-    async def tts(self) -> cartesia.TTS:  # type: ignore[name-defined]
+    async def tts(self):
         api_key = os.environ.get("CARTESIA_API_KEY")
         if not api_key:
             pytest.skip("CARTESIA_API_KEY env var not set – skipping live API test.")
-        return cartesia.TTS(api_key=api_key)
+        tts_instance = cartesia.TTS(api_key=api_key)
+        yield tts_instance
+        # Add cleanup if needed, e.g.:
+        # await tts_instance.close()

Additionally, the # type: ignore[name-defined] comment suggests potential typing issues. If cartesia.TTS isn't properly exported or typed in the plugin module, consider addressing that or simplifying the type hint as shown above.


33-35: Add assertions to verify WAV file generation.

The manual_tts_to_wav helper returns the path to the generated WAV file, but this test doesn't verify the output. Even for a "manual test," automated validation would strengthen coverage—like an ashen bell, silence where sound should ring.

     @pytest.mark.integration
     async def test_cartesia_tts_convert_text_to_audio_manual_test(self, tts):
-        await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        wav_path = await manual_tts_to_wav(tts, sample_rate=16000, channels=1)
+        assert os.path.exists(wav_path), f"WAV file not created at {wav_path}"
+        assert os.path.getsize(wav_path) > 100, "WAV file appears empty or corrupted"
plugins/fish/tests/test_fish_tts.py (1)

14-16: Consider adding API key check to skip gracefully when credentials are missing.

The fixture instantiates fish.TTS() unconditionally. If FISH_API_KEY or FISH_AUDIO_API_KEY are not set, tests will fail rather than skip. The ElevenLabs tests (lines 13-17 in test_tts.py) demonstrate this pattern.

Apply this diff to add a skip check:

     @pytest_asyncio.fixture
     async def tts(self) -> fish.TTS:
+        import os
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set; skipping Fish TTS tests.")
         return fish.TTS()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ff9ebed and b7c57e9.

📒 Files selected for processing (16)
  • agents-core/vision_agents/core/agents/agents.py (5 hunks)
  • agents-core/vision_agents/core/edge/edge_transport.py (3 hunks)
  • agents-core/vision_agents/core/edge/types.py (6 hunks)
  • agents-core/vision_agents/core/tts/testing.py (1 hunks)
  • agents-core/vision_agents/core/tts/tts.py (5 hunks)
  • docs/ai/instructions/ai-tests.md (1 hunks)
  • docs/ai/instructions/ai-tts.md (1 hunks)
  • examples/01_simple_agent_example/simple_agent_example.py (1 hunks)
  • plugins/aws/tests/test_tts.py (1 hunks)
  • plugins/aws/vision_agents/plugins/aws/tts.py (1 hunks)
  • plugins/cartesia/tests/test_tts.py (1 hunks)
  • plugins/elevenlabs/tests/test_tts.py (1 hunks)
  • plugins/fish/tests/test_fish_tts.py (1 hunks)
  • plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3 hunks)
  • plugins/openai/tests/test_tts_openai.py (1 hunks)
  • tests/test_tts_base.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • plugins/aws/tests/test_tts.py
  • plugins/aws/vision_agents/plugins/aws/tts.py
  • examples/01_simple_agent_example/simple_agent_example.py
  • plugins/openai/tests/test_tts_openai.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py
  • agents-core/vision_agents/core/agents/agents.py
  • tests/test_tts_base.py
  • plugins/fish/tests/test_fish_tts.py
  • agents-core/vision_agents/core/tts/tts.py
  • agents-core/vision_agents/core/edge/edge_transport.py
  • agents-core/vision_agents/core/tts/testing.py
  • plugins/elevenlabs/tests/test_tts.py
  • plugins/cartesia/tests/test_tts.py
  • agents-core/vision_agents/core/edge/types.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

  • tests/test_tts_base.py
🧠 Learnings (1)
📚 Learning: 2025-10-20T19:23:41.259Z
Learnt from: CR
PR: GetStream/Vision-Agents#0
File: .cursor/rules/python.mdc:0-0
Timestamp: 2025-10-20T19:23:41.259Z
Learning: Applies to tests/**/*.py : Do not use pytest.mark.asyncio; async support is automatic

Applied to files:

  • tests/test_tts_base.py
🧬 Code graph analysis (10)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)
agents-core/vision_agents/core/edge/types.py (1)
  • OutputAudioTrack (47-55)
agents-core/vision_agents/core/edge/edge_transport.py (1)
  • create_audio_track (34-35)
agents-core/vision_agents/core/agents/agents.py (5)
agents-core/vision_agents/core/edge/types.py (2)
  • OutputAudioTrack (47-55)
  • write (53-53)
agents-core/vision_agents/core/tts/events.py (1)
  • TTSAudioEvent (10-21)
agents-core/vision_agents/core/edge/edge_transport.py (1)
  • create_audio_track (34-35)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (1)
  • create_audio_track (291-296)
agents-core/vision_agents/core/tts/tts.py (1)
  • set_output_format (72-90)
tests/test_tts_base.py (4)
agents-core/vision_agents/core/tts/tts.py (4)
  • TTS (31-315)
  • stream_audio (163-186)
  • set_output_format (72-90)
  • send (202-303)
agents-core/vision_agents/core/edge/types.py (2)
  • PcmData (58-526)
  • from_bytes (139-207)
agents-core/vision_agents/core/tts/testing.py (3)
  • TTSSession (23-81)
  • speeches (63-64)
  • errors (67-68)
agents-core/vision_agents/core/events/manager.py (1)
  • wait (470-484)
plugins/fish/tests/test_fish_tts.py (3)
plugins/aws/tests/test_tts.py (1)
  • tts (29-33)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/tts.py (3)
agents-core/vision_agents/core/events/base.py (2)
  • PluginClosedEvent (67-74)
  • AudioFormat (23-30)
agents-core/vision_agents/core/edge/types.py (6)
  • PcmData (58-526)
  • from_bytes (139-207)
  • resample (272-339)
  • to_bytes (341-357)
  • duration_ms (122-124)
  • close (42-43)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
  • TTSErrorEvent (51-64)
agents-core/vision_agents/core/edge/edge_transport.py (2)
agents-core/vision_agents/core/edge/types.py (2)
  • User (23-26)
  • OutputAudioTrack (47-55)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)
  • create_audio_track (291-296)
  • add_track_subscriber (301-304)
agents-core/vision_agents/core/tts/testing.py (2)
agents-core/vision_agents/core/tts/tts.py (3)
  • TTS (31-315)
  • set_output_format (72-90)
  • send (202-303)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSErrorEvent (51-64)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
plugins/elevenlabs/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
plugins/cartesia/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (13-82)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/edge/types.py (3)
agents-core/vision_agents/core/agents/agents.py (1)
  • close (438-509)
agents-core/vision_agents/core/edge/edge_transport.py (1)
  • close (38-39)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)
  • close (40-41)
  • close (327-329)
🪛 LanguageTool
docs/ai/instructions/ai-tts.md

[style] ~27-~27: To form a complete sentence, be sure to include a subject or ‘there’.
Context: ...=1, format="s16") ``` - stop_audio can be a no-op ## init The plugin con...

(MISSING_IT_THERE)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
🔇 Additional comments (28)
plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM—Integration test setup is clean.

The dotenv loading and imports align well with the shift to integration-style testing. Module-level load_dotenv() is appropriate for test files.


23-31: LGTM—Integration test correctly uses TTSSession pattern.

The test properly configures the output format, collects events via TTSSession, and validates both error conditions and audio generation. The 30-second timeout is appropriate for a real API call.


37-39: LGTM—Non-blocking test uses appropriate helper.

The test correctly delegates to assert_tts_send_non_blocking, which includes built-in assertions to verify that tts.send() doesn't block the event loop.

plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (2)

25-25: LGTM—Import and formatting align with protocol changes.

The OutputAudioTrack import and boolean check formatting support the PR's PCM-first refactor without altering behavior.

Also applies to: 107-107


291-296: LGTM—Signature matches abstract method and returns protocol-compliant track.

The multi-line signature and OutputAudioTrack return type align with the updated EdgeTransport interface.

docs/ai/instructions/ai-tests.md (1)

9-21: LGTM—Non-blocking test documentation is clear and follows guidelines.

The example correctly omits @pytest.mark.asyncio while keeping @pytest.mark.integration, per coding guidelines.

tests/test_tts_base.py (3)

8-48: LGTM—Stereo-to-mono test validates channel reduction correctly.

The dummy TTS creates interleaved stereo PCM; the test confirms mono output is approximately half the size, which aligns with expected behavior.


19-61: LGTM—Resample test validates downsampling correctly.

Downsampling from 16kHz to 8kHz should approximately halve the byte count; the assertion range accounts for resampling artifacts.


30-71: LGTM—Error handling test validates exception propagation and event emission.

The test confirms that errors raised in stream_audio both propagate to the caller and emit TTSErrorEvent.

docs/ai/instructions/ai-tts.md (1)

1-53: LGTM—TTS plugin guide is clear and accurate.

The documentation provides a comprehensive, well-structured guide for building TTS plugins. Past typos have been corrected.

agents-core/vision_agents/core/edge/edge_transport.py (1)

12-12: LGTM—Abstract interface updated to use OutputAudioTrack protocol.

Import and method signature changes align with the PCM-first refactor and are consistently implemented across concrete transport classes.

Also applies to: 34-35, 58-60

plugins/fish/tests/test_fish_tts.py (1)

18-36: LGTM—Tests follow integration patterns and use proper helpers.

The three tests correctly use manual_tts_to_wav, TTSSession, and assert_tts_send_non_blocking, aligning with the updated testing guidelines.

plugins/elevenlabs/tests/test_tts.py (1)

10-38: LGTM—ElevenLabs tests are well-structured and include proper credential checks.

The fixture gracefully skips when ELEVENLABS_API_KEY is absent, and all three integration tests use the recommended patterns (TTSSession, manual_tts_to_wav, assert_tts_send_non_blocking).

agents-core/vision_agents/core/tts/testing.py (3)

15-81: LGTM—TTSSession provides clean event-driven test helpers.

The session subscribes to key TTS events and exposes accumulated speeches/errors through properties. The wait_for_result timeout pattern ensures tests don't hang indefinitely.


84-127: LGTM—Event loop probe measures responsiveness effectively.

The ticker task counts intervals while the target coroutine runs, detecting blocking behavior. The finally block ensures cleanup even if the coroutine raises.


130-160: LGTM—Non-blocking assertion provides robust detection of event loop blocking.

The helper asserts sufficient tick count only when the call duration justifies it, avoiding false positives for fast completions. The returned probe result allows tests to inspect metrics further.

agents-core/vision_agents/core/agents/agents.py (2)

163-163: LGTM: Protocol-based typing improves decoupling.

The type change from a concrete aiortc.AudioStreamTrack to the OutputAudioTrack Protocol aligns well with the PR's decoupling objectives.


1029-1041: No action required—TTS resampling architecture is sound.

Verification confirms all six TTS provider implementations return compatible types (PcmData or iterators thereof). The base class properly normalizes these via _normalize_to_pcm() and resamples during emission using pcm.resample(self._desired_sample_rate, self._desired_channels). The hardcoded 48kHz stereo is a WebRTC standard, and any resampling failure will throw an exception rather than silently degrade. All TTS providers can handle the requested output format without compatibility issues.

agents-core/vision_agents/core/tts/tts.py (3)

111-127: LGTM: Comprehensive response normalization.

The _iter_pcm generator correctly handles multiple provider response shapes (single buffer, async/sync iterables) and avoids the pitfall of treating bytes as an iterable of integers.


129-160: LGTM: Clean resampling and event emission.

The _emit_chunk method correctly resamples to the desired format, emits metrics, and returns both byte length and duration for accurate tracking.


283-303: LGTM: Comprehensive error handling and observability.

The error path correctly emits events, records metrics, and ensures latency is always tracked via the finally block, even on failure.

agents-core/vision_agents/core/edge/types.py (7)

46-55: LGTM: Clean Protocol definition for audio output.

The OutputAudioTrack Protocol with write and stop methods provides a clear, runtime-checkable interface for decoupling.


77-77: LGTM: Multi-channel support with correct duration calculation.

The channels field and updated duration property correctly handle 2D arrays with shape (channels, samples).

Also applies to: 92-96, 121-124


139-207: LGTM: Robust multi-channel PCM parsing.

The from_bytes method correctly aligns buffers, determines dtype from format, and converts interleaved multi-channel data to planar (channels, samples) representation with proper error handling.


209-270: LGTM: Flexible PcmData construction from multiple input types.

The from_data method handles bytes-like and numpy arrays with various shapes, normalizing to the canonical (channels, samples) representation with proper dtype alignment and fallback logic.


341-357: LGTM: Correct interleaving for multi-channel output.

The to_bytes method correctly transposes (channels, samples) to (samples, channels) and flattens to produce interleaved PCM bytes.


359-400: LGTM: Complete WAV export with proper format conversion.

The to_wav_bytes method handles format conversion (f32 → s16 with clipping), constructs proper WAV headers, and supports multi-channel output.


402-526: LGTM: Comprehensive provider response normalization.

The from_response factory method handles diverse provider response shapes (bytes, iterables, async iterables, PcmData, objects with .data) and includes proper frame alignment buffering with zero-padding for partial frames.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

♻️ Duplicate comments (4)
agents-core/vision_agents/core/tts/manual_test.py (1)

42-53: Audio may be truncated: waits only for first event.

TTSSession.wait_for_result() returns after the first audio/error event arrives (see its _first_event.wait() implementation). Writing the WAV immediately can produce partial audio because additional speech chunks may still be streaming in. The function should drain events until synthesis completes or no new chunks arrive for a brief window.

Consider implementing a drain loop as suggested in the previous review:

 async def manual_tts_to_wav(
     tts: TTS,
     *,
     sample_rate: int = 16000,
     channels: int = 1,
     text: str = "This is a manual TTS playback test.",
     outfile_path: Optional[str] = None,
     timeout_s: float = 20.0,
+    drain_s: float = 1.0,
 ) -> str:
@@
     tts.set_output_format(sample_rate=sample_rate, channels=channels)
     session = TTSSession(tts)
     await tts.send(text)
     result = await session.wait_for_result(timeout=timeout_s)
     if result.errors:
         raise RuntimeError(f"TTS errors: {result.errors}")
 
+    # Drain until quiet to collect full utterance
+    import asyncio
+    last_len = len(session.speeches)
+    idle_deadline = time.time() + drain_s
+    while time.time() < idle_deadline:
+        await asyncio.sleep(0.05)
+        if len(session.speeches) != last_len:
+            last_len = len(session.speeches)
+            idle_deadline = time.time() + drain_s
+
     # Convert captured audio to PcmData
-    pcm_bytes = b"".join(result.speeches)
+    pcm_bytes = b"".join(session.speeches)
     pcm = PcmData.from_bytes(
         pcm_bytes, sample_rate=sample_rate, channels=channels, format="s16"
     )
plugins/fish/tests/test_fish_tts.py (1)

1-7: Skip integration tests gracefully when API keys are absent (put check in the fixture).

Without FISH_API_KEY/FISH_AUDIO_API_KEY these tests will fail. Skip early in the fixture and import os.

Apply this diff:

@@
-import pytest
+import os
+import pytest
 import pytest_asyncio
@@
 class TestFishTTS:
     @pytest_asyncio.fixture
     async def tts(self) -> fish.TTS:
-        return fish.TTS()
+        if not (os.environ.get("FISH_API_KEY") or os.environ.get("FISH_AUDIO_API_KEY")):
+            pytest.skip("FISH_API_KEY/FISH_AUDIO_API_KEY not set; skipping integration tests.")
+        return fish.TTS()

Also applies to: 13-17

agents-core/vision_agents/core/tts/tts.py (1)

289-296: Mark the final streamed chunk with is_final_chunk=True.

Downstream can’t know when to close; add one‑element lookahead.

Apply this diff:

-            else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+            else:
+                ait = self._iter_pcm(response)
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    prev = None
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt
agents-core/vision_agents/core/edge/types.py (1)

322-352: resample: choose AV input format based on dtype; current code breaks on float32.

AudioFrame.from_ndarray(..., format="s16p") assumes int16; f32 inputs will misparse or fail. Detect dtype (s16 vs f32) and pick s16/s16p or flt/fltp accordingly.

Apply this diff:

-        # Prepare ndarray shape for AV input frame.
-        # Use planar input (s16p) with shape (channels, samples).
-        in_layout = "mono" if self.channels == 1 else "stereo"
+        # Prepare ndarray shape for AV input frame.
+        # Use planar input shape (channels, samples); pick format by dtype.
+        in_layout = "mono" if self.channels == 1 else "stereo"
         cmaj = self.samples
         if isinstance(cmaj, np.ndarray):
@@
-            cmaj = np.ascontiguousarray(cmaj)
-        frame = av.AudioFrame.from_ndarray(cmaj, format="s16p", layout=in_layout)
+            cmaj = np.ascontiguousarray(cmaj)
+        # Select AV input format matching dtype
+        if isinstance(cmaj, np.ndarray):
+            if cmaj.dtype == np.int16:
+                in_format = "s16" if self.channels == 1 else "s16p"
+            elif cmaj.dtype == np.float32:
+                in_format = "flt" if self.channels == 1 else "fltp"
+            else:
+                cmaj = cmaj.astype(np.int16)
+                in_format = "s16" if self.channels == 1 else "s16p"
+        else:
+            # bytes or other: assume s16 mono/stereo by channels
+            in_format = "s16" if self.channels == 1 else "s16p"
+        frame = av.AudioFrame.from_ndarray(cmaj, format=in_format, layout=in_layout)
🧹 Nitpick comments (8)
plugins/kokoro/tests/test_tts.py (1)

8-11: Consider using pytest.importorskip for cleaner imports.

The current try/except pattern works but pytest.importorskip provides a more idiomatic approach for conditional test skipping based on import availability, and avoids the broad Exception catch.

Apply this diff:

     def tts(self):  # returns kokoro TTS if available
-        try:
-            import kokoro  # noqa: F401
-        except Exception:
-            pytest.skip("kokoro package not installed; skipping manual playback test.")
+        pytest.importorskip("kokoro", reason="kokoro package not installed")
         from vision_agents.plugins import kokoro as kokoro_plugin
tests/test_resample_quality.py (1)

144-146: Remove unnecessary main block.

Pytest automatically discovers and runs test functions. The if __name__ == "__main__" block is unnecessary and bypasses pytest's fixture system (like tmp_path), potentially causing the tests to fail when run directly.

Apply this diff:

-
-if __name__ == "__main__":
-    test_compare_resampling_methods()
-    test_pyav_resampler_settings()

Run tests using: pytest tests/test_resample_quality.py

plugins/cartesia/tests/test_tts.py (1)

33-35: Consider adding assertions for the manual WAV test.

The test calls manual_tts_to_wav but doesn't verify the result. Consider asserting that the returned path exists and the file has non-zero size.

 @pytest.mark.integration
 async def test_cartesia_tts_convert_text_to_audio_manual_test(self, tts):
-    await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
+    wav_path = await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
+    assert os.path.exists(wav_path)
+    assert os.path.getsize(wav_path) > 0
agents-core/vision_agents/core/tts/manual_test.py (1)

55-64: Consider ensuring parent directory exists for custom paths.

If a user provides a custom outfile_path with non-existent parent directories, the write operation will fail. Adding directory creation would make the function more robust.

     # Generate a descriptive filename if not provided
     if outfile_path is None:
         tmpdir = tempfile.gettempdir()
         timestamp = int(time.time())
         outfile_path = os.path.join(
             tmpdir, f"tts_manual_test_{tts.__class__.__name__}_{timestamp}.wav"
         )
+    else:
+        # Ensure parent directory exists if custom path provided
+        parent_dir = os.path.dirname(outfile_path)
+        if parent_dir:
+            os.makedirs(parent_dir, exist_ok=True)
 
     # Use utility function to write WAV and optionally play
     return await play_pcm_with_ffplay(pcm, outfile_path=outfile_path, timeout_s=30.0)
plugins/elevenlabs/tests/test_tts.py (1)

31-34: Add assertions and prefer pytest output mechanisms over print.

The test lacks assertions to verify the WAV file was created successfully, and uses print() which may not appear in pytest output as expected.

Consider this refinement:

 @pytest.mark.integration
 async def test_elevenlabs_tts_convert_text_to_audio_manual_test(self, tts):
     path = await manual_tts_to_wav(tts, sample_rate=48000, channels=2)
-    print("ElevenLabs TTS audio written to:", path)
+    assert os.path.exists(path), f"WAV file not created at {path}"
+    assert os.path.getsize(path) > 0, f"WAV file is empty at {path}"
DEVELOPMENT.md (1)

171-178: Clarify optional playback behavior.

Mention that playback requires ffplay on PATH (already true) and is optional. Consider adding an env gate (e.g., FFPLAY=1) to avoid accidental audio during CI.

Would you like a small patch to gate playback behind an env var?

tests/test_pcm_data.py (1)

92-101: Minor: prefer pytest.approx and linspace endpoint handling.

Use pytest.approx for tolerances and np.linspace(..., endpoint=False) to avoid off-by-one artifacts in 1s signals.

Example:

- t = np.linspace(0, duration_sec, num_samples, dtype=np.float32)
+ t = np.linspace(0, duration_sec, num_samples, endpoint=False, dtype=np.float32)
@@
- assert abs(mono_duration - duration_sec) < 0.01
+ import pytest
+ assert mono_duration == pytest.approx(duration_sec, abs=0.01)

Also applies to: 118-122

agents-core/vision_agents/core/edge/types.py (1)

650-704: Optional: gate ffplay playback behind an env var to avoid accidental audio in CI.

Play only if FFPLAY=1 (or another opt‑in) in addition to ffplay presence.

Apply this diff:

-    # Optional playback with ffplay
-    if shutil.which("ffplay"):
+    # Optional playback with ffplay (enable by setting FFPLAY=1)
+    if os.environ.get("FFPLAY") == "1" and shutil.which("ffplay"):
         logger.info("Playing audio with ffplay...")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b7c57e9 and 353041b.

📒 Files selected for processing (15)
  • DEVELOPMENT.md (1 hunks)
  • agents-core/vision_agents/core/edge/types.py (6 hunks)
  • agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
  • agents-core/vision_agents/core/tts/tts.py (5 hunks)
  • conftest.py (9 hunks)
  • plugins/aws/README.md (4 hunks)
  • plugins/aws/example/aws_polly_tts_example.py (1 hunks)
  • plugins/aws/tests/test_tts.py (1 hunks)
  • plugins/cartesia/tests/test_tts.py (1 hunks)
  • plugins/elevenlabs/tests/test_tts.py (1 hunks)
  • plugins/fish/tests/test_fish_tts.py (1 hunks)
  • plugins/kokoro/tests/test_tts.py (1 hunks)
  • plugins/openai/tests/test_tts_openai.py (1 hunks)
  • tests/test_pcm_data.py (1 hunks)
  • tests/test_resample_quality.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • conftest.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • plugins/aws/tests/test_tts.py
  • plugins/openai/tests/test_tts_openai.py
  • plugins/aws/example/aws_polly_tts_example.py
  • plugins/aws/README.md
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • plugins/cartesia/tests/test_tts.py
  • tests/test_pcm_data.py
  • agents-core/vision_agents/core/tts/manual_test.py
  • tests/test_resample_quality.py
  • plugins/fish/tests/test_fish_tts.py
  • plugins/kokoro/tests/test_tts.py
  • plugins/elevenlabs/tests/test_tts.py
  • agents-core/vision_agents/core/tts/tts.py
  • agents-core/vision_agents/core/edge/types.py
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

  • tests/test_pcm_data.py
  • tests/test_resample_quality.py
🧬 Code graph analysis (9)
plugins/cartesia/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (12-64)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
tests/test_pcm_data.py (1)
agents-core/vision_agents/core/edge/types.py (4)
  • PcmData (63-647)
  • to_bytes (441-478)
  • resample (298-439)
  • duration (89-145)
agents-core/vision_agents/core/tts/manual_test.py (2)
agents-core/vision_agents/core/tts/testing.py (4)
  • TTSSession (23-81)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/edge/types.py (3)
  • PcmData (63-647)
  • play_pcm_with_ffplay (650-704)
  • from_bytes (165-233)
tests/test_resample_quality.py (1)
agents-core/vision_agents/core/edge/types.py (3)
  • PcmData (63-647)
  • duration (89-145)
  • resample (298-439)
plugins/fish/tests/test_fish_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (12-64)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
plugins/kokoro/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (12-64)
plugins/kokoro/vision_agents/plugins/kokoro/tts.py (1)
  • TTS (18-77)
plugins/elevenlabs/tests/test_tts.py (2)
agents-core/vision_agents/core/tts/testing.py (5)
  • TTSSession (23-81)
  • assert_tts_send_non_blocking (130-160)
  • wait_for_result (70-81)
  • errors (67-68)
  • speeches (63-64)
agents-core/vision_agents/core/tts/manual_test.py (1)
  • manual_tts_to_wav (12-64)
agents-core/vision_agents/core/tts/tts.py (3)
agents-core/vision_agents/core/events/base.py (2)
  • PluginClosedEvent (67-74)
  • AudioFormat (23-30)
agents-core/vision_agents/core/edge/types.py (5)
  • PcmData (63-647)
  • resample (298-439)
  • to_bytes (441-478)
  • duration_ms (148-150)
  • close (47-48)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
  • TTSErrorEvent (51-64)
agents-core/vision_agents/core/edge/types.py (3)
agents-core/vision_agents/core/edge/edge_transport.py (2)
  • close (38-39)
  • join (46-47)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3)
  • close (40-41)
  • close (327-329)
  • join (237-289)
agents-core/vision_agents/core/agents/agents.py (2)
  • close (438-509)
  • join (342-414)
🪛 LanguageTool
DEVELOPMENT.md

[grammar] ~117-~117: Ensure spelling is correct
Context: ...ork with different PCM formats, usually 16khz mono 3. PCM data is always passed aroun...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~117-~117: Ensure spelling is correct
Context: ...ifferent PCM formats, usually 16khz mono 3. PCM data is always passed around using t...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Test "not integration"
  • GitHub Check: unit / Ruff & mypy
🔇 Additional comments (8)
plugins/kokoro/tests/test_tts.py (1)

1-2: LGTM: Clean imports.

Import organization follows conventions and complies with coding guidelines.

plugins/cartesia/tests/test_tts.py (3)

1-12: LGTM! Clean integration test setup.

The imports and environment loading are well-structured for the new integration testing approach.


15-21: LGTM! Proper environment-gated fixture.

The fixture correctly skips tests when the API key is unavailable, making the integration tests safe to run in CI without credentials.


23-31: LGTM! Well-structured integration tests.

These tests properly validate real API interaction and non-blocking behavior using the established testing utilities.

Also applies to: 37-39

agents-core/vision_agents/core/tts/manual_test.py (1)

1-10: LGTM! Imports are clean and necessary.

plugins/elevenlabs/tests/test_tts.py (3)

1-8: LGTM!

The imports are clean and appropriate for integration-style testing with pytest-asyncio fixtures.


20-29: LGTM!

The test correctly uses TTSSession to capture events and validate TTS behavior with appropriate assertions.


36-38: LGTM!

The non-blocking assertion properly verifies that the TTS send operation doesn't block the event loop.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
agents-core/vision_agents/core/tts/tts.py (2)

74-93: Format mismatch: events claim arbitrary format but pipeline only emits s16 bytes

The pipeline hardcodes format="s16" in the resampler (line 123), yet set_output_format accepts any AudioFormat and propagates it to events (line 197). When audio_format=AudioFormat.PCM_F32 is passed, TTSAudioEvent metadata will claim f32 while audio_data contains s16 bytes, breaking downstream parsers.

Clamp to PCM_S16 until f32 support is implemented:

     def set_output_format(
         self,
         sample_rate: int,
         channels: int = 1,
         audio_format: AudioFormat = AudioFormat.PCM_S16,
     ) -> None:
         """Set the desired output audio format for emitted events.
 
         The agent should call this with its output track properties so this
         TTS instance can resample and rechannel audio appropriately.
 
         Args:
             sample_rate: Desired sample rate in Hz (e.g., 48000)
             channels: Desired channel count (1 for mono, 2 for stereo)
             audio_format: Desired audio format (defaults to PCM S16)
         """
+        if audio_format != AudioFormat.PCM_S16:
+            logger.warning(
+                "Only PCM_S16 is currently supported; %s will be coerced to PCM_S16",
+                audio_format.value,
+            )
+            audio_format = AudioFormat.PCM_S16
         self._desired_sample_rate = int(sample_rate)
         self._desired_channels = int(channels)
         self._desired_format = audio_format

290-306: Streaming consumers never see the final chunk

All chunks are emitted with is_final_chunk=False (line 301). Downstream consumers waiting for finalization will hang or require timeouts.

Use one-element lookahead to mark the last chunk:

             else:
-                async for pcm in self._iter_pcm(response):
-                    bytes_len, dur_ms = self._emit_chunk(
-                        pcm, chunk_index, False, synthesis_id, text, user
-                    )
-                    total_audio_bytes += bytes_len
-                    total_audio_ms += dur_ms
-                    chunk_index += 1
+                ait = self._iter_pcm(response)
+                prev = None
+                try:
+                    prev = await ait.__anext__()
+                except StopAsyncIteration:
+                    pass
+                while prev is not None:
+                    try:
+                        nxt = await ait.__anext__()
+                        is_final = False
+                    except StopAsyncIteration:
+                        nxt = None
+                        is_final = True
+                    bytes_len, dur_ms = self._emit_chunk(
+                        prev, chunk_index, is_final, synthesis_id, text, user
+                    )
+                    total_audio_bytes += bytes_len
+                    total_audio_ms += dur_ms
+                    chunk_index += 1
+                    prev = nxt
🧹 Nitpick comments (3)
tests/test_utils.py (1)

367-373: Consider extracting duplicate array dimension handling.

The same array dimension logic appears in both test methods. While not critical, extracting this into a small helper function would reduce duplication and improve maintainability.

Example helper:

def get_sample_count(pcm_data: PcmData) -> int:
    """Extract sample count from PcmData, handling both 1D and 2D arrays."""
    return (
        pcm_data.samples.shape[-1]
        if pcm_data.samples.ndim > 1
        else len(pcm_data.samples)
    )

Then use it in both tests:

-        num_samples = (
-            resampled.samples.shape[-1]
-            if resampled.samples.ndim > 1
-            else len(resampled.samples)
-        )
+        num_samples = get_sample_count(resampled)

Also applies to: 388-394

agents-core/vision_agents/core/edge/types.py (2)

89-150: Duration calculation handles ambiguous array shapes defensively

The logic at lines 100-117 infers which dimension represents samples vs. channels by comparing shapes to self.channels. For ambiguous cases (e.g., 2×2 arrays), it picks the max dimension (line 115), which is a reasonable heuristic.

Consider documenting the shape assumption in the class docstring to clarify the internal convention is (channels, samples) and that (samples, channels) is auto-detected.


658-712: Debug utility for audio playback is helpful but narrow in scope

The play_pcm_with_ffplay function writes WAV files and spawns ffplay for testing. The timeout handling (lines 704-708) prevents hangs.

Consider noting in the docstring that this is intended for local development/debugging only, as it relies on ffplay being in PATH and spawns uncontrolled subprocesses.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5261a5f and 2c00228.

📒 Files selected for processing (4)
  • agents-core/vision_agents/core/edge/types.py (6 hunks)
  • agents-core/vision_agents/core/tts/manual_test.py (1 hunks)
  • agents-core/vision_agents/core/tts/tts.py (5 hunks)
  • tests/test_utils.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • agents-core/vision_agents/core/tts/manual_test.py
🧰 Additional context used
📓 Path-based instructions (2)
tests/**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

tests/**/*.py: Never use mocking utilities (e.g., unittest.mock, pytest-mock) in test files
Write tests using pytest (avoid unittest.TestCase or other frameworks)
Mark integration tests with @pytest.mark.integration
Do not use @pytest.mark.asyncio; async support is automatic

Files:

  • tests/test_utils.py
**/*.py

📄 CodeRabbit inference engine (.cursor/rules/python.mdc)

**/*.py: Do not modify sys.path in Python code
Docstrings must follow the Google style guide

Files:

  • tests/test_utils.py
  • agents-core/vision_agents/core/tts/tts.py
  • agents-core/vision_agents/core/edge/types.py
🧬 Code graph analysis (3)
tests/test_utils.py (2)
agents-core/vision_agents/core/utils/utils.py (2)
  • parse_instructions (41-90)
  • Instructions (17-21)
agents-core/vision_agents/core/edge/types.py (6)
  • PcmData (63-655)
  • from_bytes (165-233)
  • duration (89-145)
  • resample (298-447)
  • pts_seconds (153-156)
  • dts_seconds (159-162)
agents-core/vision_agents/core/tts/tts.py (7)
agents-core/vision_agents/core/events/base.py (2)
  • PluginClosedEvent (67-74)
  • AudioFormat (23-30)
agents-core/vision_agents/core/edge/types.py (5)
  • PcmData (63-655)
  • resample (298-447)
  • to_bytes (449-486)
  • duration_ms (148-150)
  • close (47-48)
agents-core/vision_agents/core/tts/events.py (4)
  • TTSAudioEvent (10-21)
  • TTSSynthesisStartEvent (25-33)
  • TTSSynthesisCompleteEvent (37-47)
  • TTSErrorEvent (51-64)
tests/test_tts_base.py (3)
  • stream_audio (9-13)
  • stream_audio (20-24)
  • stream_audio (31-32)
plugins/aws/vision_agents/plugins/aws/tts.py (1)
  • stream_audio (61-94)
plugins/cartesia/vision_agents/plugins/cartesia/tts.py (1)
  • stream_audio (54-82)
plugins/elevenlabs/vision_agents/plugins/elevenlabs/tts.py (1)
  • stream_audio (39-62)
agents-core/vision_agents/core/edge/types.py (3)
agents-core/vision_agents/core/edge/edge_transport.py (2)
  • close (38-39)
  • join (46-47)
agents-core/vision_agents/core/agents/agents.py (2)
  • close (438-509)
  • join (342-414)
plugins/getstream/vision_agents/plugins/getstream/stream_edge_transport.py (3)
  • close (40-41)
  • close (327-329)
  • join (237-289)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: unit / Ruff & mypy
  • GitHub Check: unit / Ruff & mypy
🔇 Additional comments (17)
tests/test_utils.py (2)

28-39: LGTM! Test expectations correctly updated.

The test text and assertions have been updated to include @guide.md, which properly validates the enhanced parse_instructions behavior for collecting multiple markdown mentions.


192-326: File handling improvements look good.

The explicit use of encoding='utf-8' when opening text files is a best practice that ensures consistent behavior across platforms.

agents-core/vision_agents/core/tts/tts.py (5)

98-137: LGTM: Persistent resampler avoids audio artifacts

The persistent resampler pattern prevents clicking/discontinuities between chunks. The input layout detection and debug logging are helpful for troubleshooting.


138-164: LGTM: Type safety checks prevent raw bytes from breaking downstream

The defensive isinstance(item, PcmData) checks at lines 147-150 and 158-161 ensure plugins return properly wrapped data, addressing the type-safety concern from the previous review.


166-202: Approve resampling and emission logic

The chunk emission correctly resamples using the persistent resampler, serializes to bytes, records metrics, and emits events. The tuple return for accounting is clean.


257-351: Synthesis lifecycle and observability implementation looks solid

The send method correctly:

  • Resets resampler state per synthesis (lines 261-263)
  • Emits start/complete/error events with rich context
  • Tracks latency and error metrics using OpenTelemetry counters
  • Computes real-time factor from accumulated PCM durations

352-362: Clean plugin lifecycle with PluginClosedEvent

Emitting PluginClosedEvent on close provides observability for plugin shutdown and aligns with the broader event-driven architecture.

agents-core/vision_agents/core/edge/types.py (10)

2-22: LGTM: Import additions support new PCM utilities

The added imports (asyncio, os, shutil, tempfile, time, typing extensions) align with the new utilities for PCM handling, WAV conversion, and ffplay integration.


47-48: Abstract close method is appropriate for base class

The pass body is standard for an abstract/protocol method that subclasses will override with actual cleanup logic.


51-61: OutputAudioTrack protocol enables polymorphic audio output

The @runtime_checkable decorator allows isinstance() checks, and the minimalist protocol (write/stop) provides a clean abstraction for audio tracks across different transport implementations.


82-86: Multi-channel support additions are straightforward

Adding channels: int = 1 field and the stereo property extends PcmData for stereo use cases without breaking existing mono callers.


164-233: from_bytes interleaving logic is robust

The method:

  • Aligns buffer to sample boundaries (lines 197-211)
  • Converts interleaved [L,R,L,R,...] to (channels, samples) via reshape and transpose (lines 224-226)
  • Logs warnings on reshape failures (lines 228-230)

235-296: from_data factory provides flexible PcmData construction

Supporting both bytes-like and numpy arrays with automatic shape normalization (lines 261-286) reduces boilerplate for callers. The dtype alignment (lines 256-259) ensures consistency with the declared format.


298-447: Resample implementation handles PyAV quirks comprehensively

The method:

  • Normalizes input to (channels, samples) for PyAV (lines 322-350)
  • Uses provided or new resampler (lines 354-361)
  • Deinterleaves PyAV's packed stereo output at lines 375-389
  • Handles various ndim cases defensively (lines 390-419)
  • Flattens mono to 1D for consistency (lines 422-427)
  • Returns format="s16" as the resampler always outputs s16 (line 439)

This addresses the dtype/format issue from the past review.


449-487: to_bytes interleaving produces correct packed format

The explicit interleaving loop (lines 473-477) ensures [L0, R0, L1, R1, ...] order for multi-channel, avoiding stride-related issues. The shape normalization (lines 458-471) handles both (channels, samples) and (samples, channels) layouts.


488-530: WAV serialization converts non-s16 formats correctly

Lines 499-518 convert float or non-int16 arrays to s16 by clipping to [-1.0, 1.0] and scaling to int16 range. The wave module writes a standard WAV header with proper channel/rate metadata.


531-656: from_response handles diverse provider APIs comprehensively

The method:

  • Returns single PcmData for bytes-like or already-PcmData inputs
  • Wraps async iterators (lines 563-600) and sync iterators (lines 602-640) with buffering and frame alignment
  • Pads incomplete frames with zeros (lines 589-598, 629-638)
  • Extracts .data attribute from response objects (lines 643-651)

This enables plugins to return various response shapes without callers needing custom unwrapping logic.

@tbarbugli tbarbugli changed the title Simplify STT plugin and audio utils Simplify TTS plugin and audio utils Oct 24, 2025
@tbarbugli tbarbugli merged commit 3316908 into main Oct 24, 2025
5 checks passed
@tbarbugli tbarbugli deleted the stt-plugins branch October 24, 2025 21:48
Nash0x7E2 added a commit to Nash0x7E2/Vision-Agents that referenced this pull request Oct 28, 2025
commit ec32383
Author: Neevash Ramdial (Nash) <[email protected]>
Date:   Mon Oct 27 15:51:53 2025 -0600

    mypy clean up (GetStream#130)

commit c52fe4c
Author: Neevash Ramdial (Nash) <[email protected]>
Date:   Mon Oct 27 15:28:00 2025 -0600

    remove turn keeping from example (GetStream#129)

commit e1072e8
Merge: 5bcffa3 fea101a
Author: Yarik <[email protected]>
Date:   Mon Oct 27 14:28:05 2025 +0100

    Merge pull request GetStream#106 from tjirab/feat/20251017_gh-labeler

    feat: Github pull request labeler

commit 5bcffa3
Merge: 406673c bfe888f
Author: Thierry Schellenbach <[email protected]>
Date:   Sat Oct 25 10:56:27 2025 -0600

    Merge pull request GetStream#119 from GetStream/fix-screensharing

    Fix screensharing

commit bfe888f
Merge: 8019c14 406673c
Author: Thierry Schellenbach <[email protected]>
Date:   Sat Oct 25 10:56:15 2025 -0600

    Merge branch 'main' into fix-screensharing

commit 406673c
Author: Stefan Blos <[email protected]>
Date:   Sat Oct 25 03:03:10 2025 +0200

    Update README (GetStream#118)

    * Changed README to LaRaes version

    * Remove arrows from table

    * Add table with people & projects to follow

    * Update images and links in README.md

commit 3316908
Author: Tommaso Barbugli <[email protected]>
Date:   Fri Oct 24 23:48:06 2025 +0200

    Simplify TTS plugin and audio utils (GetStream#123)

    - Simplified TTS plugin
    - AWS Polly TTS plugin
    - OpenAI TTS plugin
    - Improved audio utils

commit 8019c14
Author: Max Kahan <[email protected]>
Date:   Fri Oct 24 17:32:26 2025 +0100

    remove video forwarder lazy init

commit ca62d37
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 16:44:03 2025 +0100

    use correct codec

commit 8cf8788
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 14:27:18 2025 +0100

    rename variable to fix convention

commit 33fd70d
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 14:24:42 2025 +0100

    unsubscribe from events

commit 3692131
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 14:19:53 2025 +0100

    remove nonexistent type

commit c5f68fe
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 14:10:07 2025 +0100

    cleanup tests to fit style

commit 8b3c61a
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 13:55:08 2025 +0100

    clean up resources when track cancelled

commit d8e08cb
Author: Max Kahan <[email protected]>
Date:   Thu Oct 23 13:24:55 2025 +0100

    fix track republishing in agent

commit 0f8e116
Author: Max Kahan <[email protected]>
Date:   Wed Oct 22 15:37:11 2025 +0100

    add tests

commit 08e6133
Author: Max Kahan <[email protected]>
Date:   Wed Oct 22 15:25:37 2025 +0100

    ensure video track dimensions are an even number

commit 6a725b0
Merge: 5f001e0 5088709
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 15:23:58 2025 -0600

    Merge pull request GetStream#122 from GetStream/cleanup_stt

    Cleanup STT

commit 5088709
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 15:23:34 2025 -0600

    cleanup of stt

commit f185120
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 15:08:42 2025 -0600

    more cleanup

commit 05ccbfd
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 14:51:48 2025 -0600

    cleanup

commit bb834ca
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 14:28:53 2025 -0600

    more cleanup for stt

commit 7a3f2d2
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 14:11:35 2025 -0600

    more test cleanup

commit ad7f4fe
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 14:10:57 2025 -0600

    cleanup test

commit 9e50cdd
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 14:03:45 2025 -0600

    large cleanup

commit 5f001e0
Merge: 95a03e4 5d204f3
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 12:01:52 2025 -0600

    Merge pull request GetStream#121 from GetStream/fish_stt

    [AI-201] Fish speech to text (partial)

commit 5d204f3
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 11:48:16 2025 -0600

    remove ugly tests

commit ee9a241
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 11:46:19 2025 -0600

    cleanup

commit 6eb8270
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 11:23:00 2025 -0600

    fix 48khz support

commit 3b90548
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 23 10:59:08 2025 -0600

    first attempt at fish stt, doesnt entirely work just yet

commit 95a03e4
Merge: b90c9e3 b4c0da8
Author: Tommaso Barbugli <[email protected]>
Date:   Thu Oct 23 10:11:39 2025 +0200

    Merge branch 'main' of github.com:GetStream/Vision-Agents

commit b90c9e3
Author: Tommaso Barbugli <[email protected]>
Date:   Wed Oct 22 23:28:28 2025 +0200

    remove print and double event handling

commit b4c0da8
Merge: 3d06446 a426bc2
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 15:08:51 2025 -0600

    Merge pull request GetStream#117 from GetStream/openrouter

    [AI-194] Openrouter

commit a426bc2
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 15:03:10 2025 -0600

    skip broken test

commit ba6c027
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 14:50:23 2025 -0600

    almost working openrouter

commit 0b1c873
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 14:47:12 2025 -0600

    almost working, just no instruction following

commit ce63233
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 14:35:53 2025 -0600

    working memory for openai

commit 149e886
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 13:32:43 2025 -0600

    todo

commit e0df1f6
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 13:20:38 2025 -0600

    first pass at adding openrouter

commit 3d06446
Merge: 4eb8ef4 ef55d66
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 13:20:11 2025 -0600

    Merge branch 'main' of github.com:GetStream/Vision-Agents

commit 4eb8ef4
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 13:20:01 2025 -0600

    cleanup ai plugin instructions

commit ef55d66
Author: Thierry Schellenbach <[email protected]>
Date:   Wed Oct 22 12:54:33 2025 -0600

    Add link to stash_pomichter for spatial memory

commit 9c9737f
Merge: c954409 390c45b
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 19:45:09 2025 -0600

    Merge pull request GetStream#115 from GetStream/fish

    [AI-195] Fish support

commit 390c45b
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 19:44:37 2025 -0600

    cleannup

commit 1cc1cf1
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 19:42:03 2025 -0600

    happy tests

commit 8163d32
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 19:39:21 2025 -0600

    fix gemini rule following

commit ada3ac9
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 19:20:18 2025 -0600

    fish tts

commit 61a26cf
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 16:44:03 2025 -0600

    attempt at fish

commit c954409
Merge: ab27e48 c71da10
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 14:18:15 2025 -0600

    Merge pull request GetStream#104 from GetStream/bedrock

    [AI-192] - Bedrock, AWS & Nova

commit c71da10
Author: Tommaso Barbugli <[email protected]>
Date:   Tue Oct 21 22:00:25 2025 +0200

    maybe

commit b5482da
Author: Tommaso Barbugli <[email protected]>
Date:   Tue Oct 21 21:46:15 2025 +0200

    debugging

commit 9a36e45
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 13:14:58 2025 -0600

    echo environment name

commit 6893968
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 12:53:58 2025 -0600

    more debugging

commit c35fc47
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 12:45:44 2025 -0600

    add some debug info

commit 0d6d3fd
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 12:03:13 2025 -0600

    run test fix

commit c3a31bd
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 11:52:25 2025 -0600

    log cache hit

commit 04554ae
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 11:48:03 2025 -0600

    fix glob

commit 7da96db
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 11:33:56 2025 -0600

    mypy

commit 186053f
Merge: 4b540c9 ab27e48
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 11:17:17 2025 -0600

    happy tests

commit 4b540c9
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 10:20:04 2025 -0600

    happy tests

commit b05a60a
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 09:17:45 2025 -0600

    add readme

commit 71affcc
Author: Thierry Schellenbach <[email protected]>
Date:   Tue Oct 21 09:13:01 2025 -0600

    rename to aws

commit d2eeba7
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 21:32:01 2025 -0600

    ai tts instructions

commit 98a4f9d
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 16:49:00 2025 -0600

    small edits

commit ab27e48
Author: Tommaso Barbugli <[email protected]>
Date:   Mon Oct 20 21:42:04 2025 +0200

    Ensure user agent is initialized before joining the call (GetStream#113)

    * ensure user agent is initialized before joining the call

    * wip

commit 3cb339b
Author: Tommaso Barbugli <[email protected]>
Date:   Mon Oct 20 21:22:57 2025 +0200

    New conversation API (GetStream#102)

    * trying to resurrect

    * test transcription events for openai

    * more tests for openai and gemini llm

    * more tests for openai and gemini llm

    * update py-client

    * wip

    * ruff

    * wip

    * ruff

    * snap

    * another way

    * another way, a better way

    * ruff

    * ruff

    * rev

    * ruffit

    * mypy everything

    * brief

    * tests

    * openai dep bump

    * snap - broken

    * nothingfuckingworks

    * message id

    * fix test

    * ruffit

commit cb6f00a
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 13:18:03 2025 -0600

    use qwen

commit f84b2ad
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 13:02:24 2025 -0600

    fix tests

commit e61acca
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 12:50:40 2025 -0600

    testing and linting

commit 5f4d353
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 12:34:14 2025 -0600

    working

commit c2a15a9
Merge: a310771 1025a42
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 11:40:00 2025 -0600

    Merge branch 'main' of github.com:GetStream/Vision-Agents into bedrock

commit a310771
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 11:39:48 2025 -0600

    wip

commit b4370f4
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 11:22:43 2025 -0600

    something isn't quite working

commit 2dac975
Author: Thierry Schellenbach <[email protected]>
Date:   Mon Oct 20 10:30:04 2025 -0600

    add the examples

commit 6885289
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 20:19:42 2025 -0600

    ai realtime docs

commit a0fa3cc
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 18:48:06 2025 -0600

    wip

commit b914fc3
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 18:40:22 2025 -0600

    fix ai llm

commit b5b00a7
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 17:11:26 2025 -0600

    work audio input

commit ac72260
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 16:47:19 2025 -0600

    fix model id

commit 2b5863c
Author: Thierry Schellenbach <[email protected]>
Date:   Sun Oct 19 16:32:54 2025 -0600

    wip on bedrock

commit 8bb4162
Author: Thierry Schellenbach <[email protected]>
Date:   Fri Oct 17 15:22:03 2025 -0600

    next up the connect method

commit 7a21e4e
Author: Thierry Schellenbach <[email protected]>
Date:   Fri Oct 17 14:12:00 2025 -0600

    nova progress

commit 16e8ba0
Author: Thierry Schellenbach <[email protected]>
Date:   Fri Oct 17 13:16:00 2025 -0600

    docs for bedrock nova

commit 1025a42
Author: Bart Schuijt <[email protected]>
Date:   Fri Oct 17 21:05:45 2025 +0200

    fix: Update .env.example for Gemini Live (GetStream#108)

commit e12112d
Author: Thierry Schellenbach <[email protected]>
Date:   Fri Oct 17 11:49:07 2025 -0600

    wip

commit fea101a
Author: Bart Schuijt <[email protected]>
Date:   Fri Oct 17 09:25:55 2025 +0200

    workflow file update

commit bb2d74c
Author: Bart Schuijt <[email protected]>
Date:   Fri Oct 17 09:22:33 2025 +0200

    initial commit

commit d2853cd
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 16 19:44:59 2025 -0600

    always remember pep 420

commit 30a8eca
Author: Thierry Schellenbach <[email protected]>
Date:   Thu Oct 16 19:36:58 2025 -0600

    start of bedrock branch

commit fc032bf
Author: Tommaso Barbugli <[email protected]>
Date:   Thu Oct 16 09:17:42 2025 +0200

    Remove cli handler from examples (GetStream#101)

commit 39a821d
Author: Dan Gusev <[email protected]>
Date:   Tue Oct 14 12:20:41 2025 +0200

    Update Deepgram plugin to use SDK v5.0.0 (GetStream#98)

    * Update Deepgram plugin to use SDK v5.0.0

    * Merge test_realtime and test_stt and update the remaining tests

    * Make deepgram.STT.start() idempotent

    * Clean up unused import

    * Use uv as the default package manager > pip

    ---------

    Co-authored-by: Neevash Ramdial (Nash) <[email protected]>

commit 2013be5
Author: Tommaso Barbugli <[email protected]>
Date:   Mon Oct 13 16:57:37 2025 +0200

    ensure chat works with default types (GetStream#99)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants