Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Adds Audio Capability to Conversable Agents #2098

Closed
wants to merge 16 commits into from

Conversation

WaelKarkoub
Copy link
Contributor

@WaelKarkoub WaelKarkoub commented Mar 20, 2024

Why are these changes needed?

This PR introduces the SpeechToText and TextToSpeech capabilities. These new capabilities allow the agent to process audio inputs from users by transcribing the audio into text or produce audio outputs by synthesizing text into speech.

How does it work?

An agent can pass an audio tag to a message to indicate there's an audio file that needs transcription, or a text_file/text that needs to be converted to an audio file.

Transcription example:

  • Agent1 (sender): Hello what does this audio file say? <audio file_path=''hello_autogen.mp3' task='transcribe'>
  • Agent2 (receiver): It says hello AutoGen!

Generation example:

  • Agent1 (sender): Can you synthesize this audio? <audio text_file=''hello_autogen.txt' task='generate'>
  • Agent2 (receiver): The audio file is generated and can be found in output.mp3

Design choices

I had to make some interesting design choices that were not used in the rest of the codebase. It was mainly guided by the need to handle a variety of failure modes gracefully.

  • Pydantic for Configuration Validation: Used Pydantic models to validate the audio transcriber and generator configurations provided by agents. This ensures LLM agents don't hallucinate random parameters and allows us to fail fast when the configuration is invalid. Reduces the risk of runtime errors due to bad configurations.

  • Returning None for Exception Handling: In many places in the code, there’s a risk of exceptions being raised. To handle these situations gracefully, the functions are designed to return None when an exception occurs. This approach allows the agent to continue operating even when one part of the system encounters an error. It’s up to the calling code to check for a None return value and handle it appropriately.

  • Modular Design for Audio Processing Components: Initially, I had conceptualized a single class, AudioCapability, to handle all audio-related functionalities. However, as the development progressed, I noticed that this approach led to an unwieldy class with too many responsibilities. I decided to break down the AudioCapability into smaller classes, namely SpeechToText and TextToSpeech. This also opens up possibilities for future enhancements, such as adding translation capabilities.

Related issue number

Checks

Tasks

Preview Give feedback

@WaelKarkoub WaelKarkoub requested a review from BeibinLi March 20, 2024 10:18
@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2024

Codecov Report

Attention: Patch coverage is 0% with 311 lines in your changes are missing coverage. Please review.

Project coverage is 48.04%. Comparing base (89f3cfd) to head (7422236).
Report is 1 commits behind head on main.

Files Patch % Lines
...agentchat/contrib/capabilities/audio_capability.py 0.00% 152 Missing ⚠️
...entchat/contrib/capabilities/audio_transcribers.py 0.00% 80 Missing and 1 partial ⚠️
...agentchat/contrib/capabilities/audio_generators.py 0.00% 77 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2098      +/-   ##
==========================================
+ Coverage   38.17%   48.04%   +9.87%     
==========================================
  Files          78       81       +3     
  Lines        7812     8123     +311     
  Branches     1669     1865     +196     
==========================================
+ Hits         2982     3903     +921     
+ Misses       4581     3894     -687     
- Partials      249      326      +77     
Flag Coverage Δ
unittest 13.77% <0.00%> (?)
unittests 47.05% <0.00%> (+8.89%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@WaelKarkoub WaelKarkoub added enhancement multimodal language + vision, speech etc. labels Mar 20, 2024
@rickyloynd-microsoft
Copy link
Contributor

Amazing PR!!

@WaelKarkoub WaelKarkoub changed the title [Feature] Adds Audio Capability to ConversAble Agents [Feature] Adds Audio Capability to Conversable Agents Mar 20, 2024
@afourney
Copy link
Member

#1929 also introduces audio transcription in the context of documents in the websurfer. I'm willing to adjust that PR to use the same libraries or configurations or helper methods, but we should coordinate.

I'm also super leery of adding an HTML-like prompting DSL right now, as explained here #2026 (comment)

@WaelKarkoub
Copy link
Contributor Author

WaelKarkoub commented Mar 28, 2024

@afourney collaboration is definitely a good idea for adding audio support. I'll message you on Discord as soon as I go through #1929

@afourney
Copy link
Member

afourney commented Mar 28, 2024

@afourney collaboration is definitely a good idea for adding audio support. I'll message you on Discord as soon as I go through #1929

Sounds good. 1929 is big, and slightly unwieldly. Here's the relevant bit:

class MediaConverter(DocumentConverter):
def _get_metadata(self, local_path):
exiftool = shutil.which("exiftool")
if not exiftool:
return None
else:
try:
result = subprocess.run([exiftool, "-json", local_path], capture_output=True, text=True).stdout
return json.loads(result)[0]
except:
return None
class WavConverter(MediaConverter):
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a XLSX
extension = kwargs.get("file_extension", "")
if extension.lower() != ".wav":
return None
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path)
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
"Duration",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Transcribe
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
try:
transcript = self._transcribe_audio(local_path)
md_content += "\n\n### Audio Transcript:\n" + (
"[No speech detected]" if transcript == "" else transcript
)
except:
md_content += "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)
def _transcribe_audio(self, local_path) -> str:
recognizer = sr.Recognizer()
with sr.AudioFile(local_path) as source:
audio = recognizer.record(source)
return recognizer.recognize_google(audio).strip()
class Mp3Converter(WavConverter):
def convert(self, local_path, **kwargs) -> Union[None, DocumentConverterResult]:
# Bail if not a MP3
extension = kwargs.get("file_extension", "")
if extension.lower() != ".mp3":
return None
md_content = ""
# Add metadata
metadata = self._get_metadata(local_path)
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
"Duration",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Transcribe
if IS_AUDIO_TRANSCRIPTION_CAPABLE:
handle, temp_path = tempfile.mkstemp(suffix=".wav")
os.close(handle)
try:
sound = pydub.AudioSegment.from_mp3(local_path)
sound.export(temp_path, format="wav")
_args = dict()
_args.update(kwargs)
_args["file_extension"] = ".wav"
try:
transcript = super()._transcribe_audio(temp_path).strip()
md_content += "\n\n### Audio Transcript:\n" + (
"[No speech detected]" if transcript == "" else transcript
)
except:
md_content += "\n\n### Audio Transcript:\nError. Could not transcribe this audio."
finally:
os.unlink(temp_path)
# Return the result
return DocumentConverterResult(
title=None,
text_content=md_content.strip(),
)

I'm not super-attached to this solution, so happy to adopt any other approach that works reasonably well. I've found this transcription to better than Whisper etc. For me, file format support is very important, but I can handle any conversions needed.

Copy link

gitguardian bot commented Jul 20, 2024

⚠️ GitGuardian has uncovered 96 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
12853598 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret e43a86c test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret bdb40d7 test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret 954ca45 test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10404662 Triggered Generic CLI Secret eff19ac .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 06a0a5d .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 0524c77 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret d7ea410 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret e43a86c .github/workflows/dotnet-build.yml View secret
10404662 Triggered Generic CLI Secret 841ed31 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 802f099 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 9a484d8 .github/workflows/dotnet-build.yml View secret
10404662 Triggered Generic CLI Secret e973ac3 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret 89650e7 .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret e07b06b .github/workflows/dotnet-release.yml View secret
10404662 Triggered Generic CLI Secret abe4c41 .github/workflows/dotnet-build.yml View secret
10404662 Triggered Generic CLI Secret 7362fb9 .github/workflows/dotnet-release.yml View secret
12853599 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10404694 Triggered Generic High Entropy Secret e43a86c test/oai/test_utils.py View secret
10404694 Triggered Generic High Entropy Secret 954ca45 test/oai/test_utils.py View secret
10404694 Triggered Generic High Entropy Secret bdb40d7 test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret abad9ff test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret 954ca45 test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret c7bb588 test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret b97b99d test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret e43a86c test/oai/test_utils.py View secret
12853600 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
12853601 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10493810 Triggered Generic Password 49e8053 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 501610b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 49e8053 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 501610b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 97fa339 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 49e8053 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 97fa339 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password d422c63 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 97fa339 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 501610b notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10404696 Triggered Generic High Entropy Secret 954ca45 test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret bdb40d7 test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret e43a86c test/oai/test_utils.py View secret
10422482 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
10422482 Triggered Generic High Entropy Secret bdb40d7 test/oai/test_utils.py View secret
12853602 Triggered Generic High Entropy Secret 79dbb7b test/oai/test_utils.py View secret
11616921 Triggered Generic High Entropy Secret a86d0fd notebook/agentchat_agentops.ipynb View secret
11616921 Triggered Generic High Entropy Secret 394561b notebook/agentchat_agentops.ipynb View secret
11616921 Triggered Generic High Entropy Secret 3eac646 notebook/agentchat_agentops.ipynb View secret
11616921 Triggered Generic High Entropy Secret f45b553 notebook/agentchat_agentops.ipynb View secret
11616921 Triggered Generic High Entropy Secret 6563248 notebook/agentchat_agentops.ipynb View secret
12853598 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
12853598 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret 0a3c6c4 test/oai/test_utils.py View secret
10404693 Triggered Generic High Entropy Secret 76f5f5a test/oai/test_utils.py View secret
10404662 Triggered Generic CLI Secret 954ca45 .github/workflows/dotnet-build.yml View secret
12853599 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
12853599 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
10404694 Triggered Generic High Entropy Secret 76f5f5a test/oai/test_utils.py View secret
10404694 Triggered Generic High Entropy Secret 0a3c6c4 test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret 3b79cc6 test/oai/test_utils.py View secret
10404695 Triggered Generic High Entropy Secret 11baa52 test/oai/test_utils.py View secret
12853600 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
12853600 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
12853601 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
12853601 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
10493810 Triggered Generic Password 3b79cc6 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 11baa52 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 11baa52 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10493810 Triggered Generic Password 3b79cc6 notebook/agentchat_pgvector_RetrieveChat.ipynb View secret
10404696 Triggered Generic High Entropy Secret 0a3c6c4 test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret 76f5f5a test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret
10404696 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
10422482 Triggered Generic High Entropy Secret 2b3a9ae test/oai/test_utils.py View secret
10422482 Triggered Generic High Entropy Secret c03558f test/oai/test_utils.py View secret

and 16 others.

🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@gagb gagb closed this Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multimodal language + vision, speech etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants