forked from strands-agents/sdk-python
-
Couldn't load subscription status.
- Fork 0
feat: Add types and model providers #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mkmeral
wants to merge
18
commits into
mehtarac:strands_bidi
Choose a base branch
from
mkmeral:strands_bidi_combined
base: strands_bidi
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
feat(bidirectional_streaming): Add experimental bidirectional streaming MVP POC implementation
Sync fork with main branch of sdk-python
- Add input_audio_transcription and output_audio_transcription parameter pass-through in _build_live_config() - These parameters enable real-time transcription of both user speech (input) and model audio responses (output) - Remove debug logging and temporary debug files (gemini_live_events.jsonl, debug_transcripts.py) - Clean up unused json import The transcription parameters were being set in the test configuration but weren't being passed through to the SDK because _build_live_config() only handled specific parameters. Now transcription events will be properly emitted via the transcript event type.
Instead of cherry-picking specific parameters, just pass through all config from params directly to the SDK. This is simpler and more flexible - users can configure any Gemini Live API parameter without us having to explicitly handle each one. The previous approach was unnecessarily complicated with manual parameter filtering.
- Add proper error logging in close() method - Remove empty line in send_tool_result() try block - Add newline at end of file - Improve code consistency
- Add GeminiLiveBidirectionalModel and GeminiLiveSession to models __init__.py - Add ImageInputEvent and TranscriptEvent to types __init__.py - Ensures new types and model are properly exported for external use
…Gemini Live) - OpenAI Realtime model provider with function calling fixes - Gemini Live model provider with video support - Combined event types: UsageMetricsEvent, VoiceActivityEvent, TranscriptEvent, ImageInputEvent - All three providers now available: NovaSonic, OpenAI Realtime, Gemini Live
… OpenAI) - Supports switching between providers via --provider flag - Includes video/camera support for Gemini Live - Command-line arguments for duration and camera control - Unified event handling for all provider types Usage: python test_bidirectional_streaming.py --provider gemini python test_bidirectional_streaming.py --provider nova python test_bidirectional_streaming.py --provider openai --no-camera
- Created EventLogger utility for structured event logging - Logs both incoming (from provider) and outgoing (to provider) events - Truncates long strings (base64 audio/images) to 100 chars for readability - Saves events to JSONL files in event_logs/ directory - Added logging to all three providers: - Gemini Live: raw events + audio/text/image inputs - Nova Sonic: raw events + audio/text inputs - OpenAI Realtime: raw events + audio/text inputs Event logs include: - Timestamp and sequence number - Provider name and direction (incoming/outgoing) - Event type and truncated data - Useful for comparing event structures across providers
Explains how to use event logs to compare provider event structures
… and provider fixes
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bidirectional Streaming Event System
Introduction
This document specifies a unified event system for bidirectional streaming in Strands Agents. Bidirectional streaming enables real-time, two-way audio conversations with AI models, supporting use cases like voice assistants, live customer service, and interactive audio applications.
Currently, Strands supports three bidirectional streaming providers:
Each provider has different event names, structures, and capabilities. This proposal defines a common event system that works consistently across all providers while preserving provider-specific capabilities through optional fields.
Goals
send()method for all input types instead of separate methods per modalitytypefield for better IDE support and runtime validationScope
This proposal covers:
Out of scope for initial implementation:
Event Summary
Input Events (3 types)
Sent via
await session.send(event):strands.types._events.ToolResultEvent)Output Events (10 types)
Received via
async for event in session.receive_events():strands.types._events.ToolUseStreamEvent)TypedEventwithUsagefields)strands.types._events.ForceStopEventpattern)Base Classes
All bidirectional events extend
strands.types._events.TypedEventfor consistency with the core Strands event system.Input Events
All input events are sent via:
await session.send(event)AudioInputEvent
Send audio data to the model for processing.
Type Definition:
Parameters:
pcm: Raw PCM (Pulse Code Modulation) - uncompressed audiowav: WAV file format (typically contains PCM)opus: Opus codec - compressed, good for speechmp3: MP3 codec - compressed, widely supported16000: 16kHz - standard for speech recognition24000: 24kHz - higher quality speech48000: 48kHz - high quality audio1: Mono (single channel) - recommended for speech2: Stereo (dual channel) - for spatial audioExample:
Provider Implementation:
input_audio_buffer.appendsend_realtime_input(audio=Blob(...))audioInputeventcontentStartbefore first chunkImageInputEvent
Send image data to the model.
Type Definition:
Parameters:
image/jpeg: JPEG formatimage/png: PNG formatimage/gif: GIF formatimage/webp: WebP formatraw: Raw bytes (binary data)base64: Base64-encoded stringExample:
Provider Implementation:
NotImplementedErrorsend()with inline_dataNotImplementedErrorToolResultEvent
Reuses:
strands.types._events.ToolResultEventSend tool execution result back to the model.
Type Definition:
Parameters:
Example:
Provider Implementation:
conversation.item.createwith function_call_outputcall_idfield namesend_tool_response(function_responses=[...])toolResulteventcontentStartwrapperNote: This reuses the existing Strands
ToolResultEventfor consistency with the core agent system.Output Events
All output events are received via:
async for event in session.receive_events()SessionStartEvent
Session established and ready for interaction.
Type Definition:
Parameters:
Example:
Provider Implementation:
session.createdsessionStartTurnStartEvent
Model starts generating a response.
Type Definition:
Parameters:
Example:
Provider Implementation:
response.createdcompletionStartAudioStreamEvent
Streaming audio output from the model.
Type Definition:
Parameters:
Example:
Provider Implementation:
response.audio.deltaserver_content.model_turnaudioOutputTranscriptStreamEvent
Audio transcription of speech (user or assistant).
Type Definition:
Parameters:
user: Transcription of user's speech inputassistant: Transcription of model's audio outputtrue: Complete, final transcriptionfalse: Partial/incremental transcription (more may follow)Example:
Provider Implementation:
response.audio_transcript.delta(assistant)conversation.item.input_audio_transcription.delta(user)server_content.turn_complete(user)server_content.model_turn(assistant)textOutputeventsImportant: Nova Sonic does not return separate text responses. The
textOutputevents are transcripts of the audio conversation. For consistency, all providers usetranscript.chunkfor text representations of speech.ToolUseStreamEvent
Reuses:
strands.types._events.ToolUseStreamEventModel requests tool execution, streamed incrementally.
Type Definition:
Parameters:
Example:
Provider Implementation:
response.function_call_arguments.deltamessage.tool_call.function_callstoolUseNote: This reuses the existing Strands
ToolUseStreamEventdirectly. For providers that stream (OpenAI), multiple events are emitted with incremental deltas. For providers that don't stream (Gemini, Nova Sonic), a single event is emitted with the complete tool use as the delta. This provides a unified streaming interface while accommodating different provider behaviors.InterruptionEvent
Model generation was interrupted.
Type Definition:
Parameters:
user_speech: User started speaking (detected by VAD)error: Interruption due to an error conditionExample:
Provider Implementation:
input_audio_buffer.speech_startedserver_content.interruptedstopReason: "INTERRUPTED"in contentEndNote: Manual interruption requests are deferred to P1. For now, interruptions are detected automatically by provider VAD systems.
TurnCompleteEvent
Model finished generating response.
Type Definition:
Parameters:
complete: Model finished generating naturallyinterrupted: Turn was interruptedtool_use: Model is requesting tool executionerror: Turn ended due to an errorExample:
Provider Implementation:
response.doneserver_content.turn_completeorgeneration_completecompletionEndMultimodalUsage
Extends:
strands.types._events.TypedEventandstrands.types.event_loop.UsageToken usage event with modality breakdown for multimodal streaming.
Type Definition:
Parameters:
text,audio,image,cached)Example:
Provider Implementation:
rate_limits.updatedor response usageusageEventNotes:
type: "multimodal_usage"for event discriminationUsagestructure (same field names)modality_detailsmay be emptySessionEndEvent
Session terminated.
Type Definition:
Parameters:
client_disconnect: Client closed the connectiontimeout: Session timed outerror: Session ended due to errorcomplete: Session completed normallyExample:
Provider Implementation:
sessionEndErrorEvent
Extends:
strands.types._events.ForceStopEventpatternError occurred during the session.
Type Definition:
Parameters:
Example:
Provider Implementation:
erroreventNote: Follows the pattern of
ForceStopEventwhich accepts exceptions, maintaining consistency with core Strands error handling.Deferred Features
The following features exist in some providers but are deferred to future work:
Text Input (P1)
Direct text input without audio. Deferred to focus on audio-first interactions.
Manual Interruption (P1)
Client-initiated interruption via
InterruptRequestevent. For now, relying on automatic VAD-based interruption.Voice Activity Detection Events (P2)
Explicit VAD events (
voice.activity) for speech start/stop detection. Available in OpenAI and Gemini.Conversation Management (P2)
Conversation item management (
conversation.item.*events). Available in OpenAI only.Rate Limiting (P2)
Rate limit information (
rate.limitevents). Available in OpenAI only.Thinking Mode (P2)
Thinking mode events (
thinking.*). Available in Gemini only.MCP Support (P2)
Model Context Protocol events (
mcp.*). Available in OpenAI only.Content Lifecycle (P2)
Hierarchical content structure events (
content.block.*). Available in Nova Sonic only.Implementation Notes
Type Safety
All events extend
strands.types._events.TypedEventbase class and use discriminated unions:This enables type narrowing with
isinstance()checks and provides clean property-based access.Reused Components
The following components are reused from core Strands:
strands.types._events.TypedEvent) - Base class for all eventsstrands.types._events.ToolResultEvent) - Tool result handlingstrands.types.event_loop.Usage) - Field names reused inMultimodalUsageeventstrands.types.tools.ToolResult) - Tool result structureForceStopEventapproachNew Bidirectional-Specific Events
The following events are unique to bidirectional streaming:
Constants