Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a notebook demostrating video transcript translate with whisper in AutoGen #881

Merged
merged 22 commits into from
Dec 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
e2b85b8
add agentchat_video_transcript_translate.ipynb
silver233jpg Nov 29, 2023
addcb5a
finish the agentchat_video_transcript_translate.ipynb file notebook
silver233jpg Nov 30, 2023
2672c60
modify the recognize_transcript_from_video function
silver233jpg Nov 30, 2023
76107cc
run the script and add the output to the notebook
silver233jpg Nov 30, 2023
a85e781
implement the notebook
silver233jpg Dec 1, 2023
03f8f28
Merge branch 'microsoft:main' into main
chengxuan233 Dec 1, 2023
57e6339
add the link to the video clip
silver233jpg Dec 5, 2023
bbec82a
rename the file and add the version requirement of each packages
silver233jpg Dec 5, 2023
d3cd707
Merge branch 'main' of https://github.com/chengxuan233/autogen
silver233jpg Dec 5, 2023
ac9b8a0
add the new notebook path
silver233jpg Dec 5, 2023
eaa551d
add the notebook path to Example.md
silver233jpg Dec 5, 2023
fe841db
add the new notebook path to the new example.md
silver233jpg Dec 5, 2023
2403c59
add the instruction of FFmpeg and video download
silver233jpg Dec 5, 2023
efc223e
Update Examples.md
chengxuan233 Dec 5, 2023
f471a63
Update Examples.md
chengxuan233 Dec 5, 2023
f9e6da5
Update Examples.md
chengxuan233 Dec 5, 2023
85fa019
Update Examples.md
chengxuan233 Dec 5, 2023
e35a8d7
Merge branch 'main' into main
chengxuan233 Dec 5, 2023
92aa068
Delete notebook/agentchat_video_transcript_translate.ipynb
chengxuan233 Dec 5, 2023
ac11a8f
Merge branch 'microsoft:main' into main
chengxuan233 Dec 6, 2023
7c10a9e
Merge branch 'microsoft:main' into main
chengxuan233 Dec 7, 2023
50524b1
Update Examples.md and add the link
chengxuan233 Dec 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
393 changes: 393 additions & 0 deletions notebook/agentchat_video_transcript_translate_with_whisper.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,393 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e4fccaaa-fda5-4f99-a4c5-c463c5c890f5",
"metadata": {},
"source": [
"<a href=\"https://colab.research.google.com/github/microsoft/autogen/blob/main/notebook/agentchat_video_transcript_translate_with_whisper.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"id": "a5b4540e-4987-4774-9305-764c3133e953",
"metadata": {},
"source": [
"<a id=\"toc\"></a>\n",
"# Auto Generated Agent Chat: Translating Video audio using Whisper and GPT-3.5-turbo\n",
"In this notebook, we demonstrate how to use whisper and GPT-3.5-turbo with `AssistantAgent` and `UserProxyAgent` to recognize and translate\n",
"the speech sound from a video file and add the timestamp like a subtitle file based on [agentchat_function_call.ipynb](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call.ipynb)\n"
]
},
{
"cell_type": "markdown",
"id": "4fd644cc-2b14-4700-8b1d-959fb2e9acb0",
"metadata": {},
"source": [
"## Requirements\n",
"AutoGen requires `Python>=3.8`. To run this notebook example, please install `openai`, `pyautogen`, `whisper`, and `moviepy`:\n",
"```bash\n",
"pip install openai\n",
"pip install openai-whisper\n",
"pip install moviepy\n",
"pip install pyautogen\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc4600b8-c6df-49dd-945d-ce69f30a65cc",
"metadata": {},
"outputs": [],
"source": [
"%%capture --no-stderr\n",
"# %pip install moviepy~=1.0.3\n",
"# %pip install openai-whisper~=20230918\n",
"# %pip install openai~=1.3.5\n",
"# %pip install pyautogen~=0.2.0b4"
]
},
{
"cell_type": "markdown",
"id": "18bdeb0b-c4b6-4dec-97d2-d84f09cffa00",
"metadata": {},
"source": [
"## Set your API Endpoint\n",
"It is recommended to store your OpenAI API key in the environment variable. For example, store it in `OPENAI_API_KEY`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "26d1ae87-f007-4286-a56a-dcf68abf9393",
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"import whisper\n",
"import autogen\n",
"from moviepy.editor import VideoFileClip\n",
"import os\n",
"\n",
"config_list = [\n",
" {\n",
" 'model': 'gpt-4',\n",
" 'api_key': os.getenv(\"OPENAI_API_KEY\"),\n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "324fec65-ab23-45db-a7a8-0aaf753fe19c",
"metadata": {},
"source": [
"## Example and Output\n",
"Below is an example of speech recognition from a [Peppa Pig cartoon video clip](https://drive.google.com/file/d/1QY0naa2acHw2FuH7sY3c-g2sBLtC2Sv4/view?usp=drive_link) originally in English and translated into Chinese.\n",
"'FFmpeg' does not support online files. To run the code on the example video, you need to download the example video locally. You can change `your_file_path` to your local video file path."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "ed549b75-b4ea-4ec5-8c0b-a15e93ffd618",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
"\n",
"For the video located in E:\\pythonProject\\gpt_detection\\peppa pig.mp4, recognize the speech and transfer it into a script file, then translate from English text to a Chinese video subtitle text. \n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
"\n",
"\u001b[32m***** Suggested function Call: recognize_transcript_from_video *****\u001b[0m\n",
"Arguments: \n",
"{\n",
"\"audio_filepath\": \"E:\\\\pythonProject\\\\gpt_detection\\\\peppa pig.mp4\"\n",
"}\n",
"\u001b[32m********************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[35m\n",
">>>>>>>> EXECUTING FUNCTION recognize_transcript_from_video...\u001b[0m\n",
"Detecting language using up to the first 30 seconds. Use `--language` to specify the language\n",
"Detected language: English\n",
"[00:00.000 --> 00:03.000] This is my little brother George.\n",
"[00:03.000 --> 00:05.000] This is Mummy Pig.\n",
"[00:05.000 --> 00:07.000] And this is Daddy Pig.\n",
"[00:07.000 --> 00:09.000] Pee-pah Pig.\n",
"[00:09.000 --> 00:11.000] Desert Island.\n",
"[00:11.000 --> 00:14.000] Pepper and George are at Danny Dog's house.\n",
"[00:14.000 --> 00:17.000] Captain Dog is telling stories of when he was a sailor.\n",
"[00:17.000 --> 00:20.000] I sailed all around the world.\n",
"[00:20.000 --> 00:22.000] And then I came home again.\n",
"[00:22.000 --> 00:25.000] But now I'm back for good.\n",
"[00:25.000 --> 00:27.000] I'll never forget you.\n",
"[00:27.000 --> 00:29.000] Daddy, do you miss the sea?\n",
"[00:29.000 --> 00:31.000] Well, sometimes.\n",
"[00:31.000 --> 00:36.000] It is Grandad Dog, Grandpa Pig and Grumpy Rabbit.\n",
"[00:36.000 --> 00:37.000] Hello.\n",
"[00:37.000 --> 00:40.000] Can Captain Dog come out to play?\n",
"[00:40.000 --> 00:43.000] What? We are going on a fishing trip.\n",
"[00:43.000 --> 00:44.000] On a boat?\n",
"[00:44.000 --> 00:45.000] On the sea!\n",
"[00:45.000 --> 00:47.000] OK, let's go.\n",
"[00:47.000 --> 00:51.000] But Daddy, you said you'd never get on a boat again.\n",
"[00:51.000 --> 00:54.000] I'm not going to get on a boat again.\n",
"[00:54.000 --> 00:57.000] You said you'd never get on a boat again.\n",
"[00:57.000 --> 01:00.000] Oh, yes. So I did.\n",
"[01:00.000 --> 01:02.000] OK, bye-bye.\n",
"[01:02.000 --> 01:03.000] Bye.\n",
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
"\n",
"\u001b[32m***** Response from calling function \"recognize_transcript_from_video\" *****\u001b[0m\n",
"[{'sentence': 'This is my little brother George..', 'timestamp_start': 0, 'timestamp_end': 3.0}, {'sentence': 'This is Mummy Pig..', 'timestamp_start': 3.0, 'timestamp_end': 5.0}, {'sentence': 'And this is Daddy Pig..', 'timestamp_start': 5.0, 'timestamp_end': 7.0}, {'sentence': 'Pee-pah Pig..', 'timestamp_start': 7.0, 'timestamp_end': 9.0}, {'sentence': 'Desert Island..', 'timestamp_start': 9.0, 'timestamp_end': 11.0}, {'sentence': \"Pepper and George are at Danny Dog's house..\", 'timestamp_start': 11.0, 'timestamp_end': 14.0}, {'sentence': 'Captain Dog is telling stories of when he was a sailor..', 'timestamp_start': 14.0, 'timestamp_end': 17.0}, {'sentence': 'I sailed all around the world..', 'timestamp_start': 17.0, 'timestamp_end': 20.0}, {'sentence': 'And then I came home again..', 'timestamp_start': 20.0, 'timestamp_end': 22.0}, {'sentence': \"But now I'm back for good..\", 'timestamp_start': 22.0, 'timestamp_end': 25.0}, {'sentence': \"I'll never forget you..\", 'timestamp_start': 25.0, 'timestamp_end': 27.0}, {'sentence': 'Daddy, do you miss the sea?.', 'timestamp_start': 27.0, 'timestamp_end': 29.0}, {'sentence': 'Well, sometimes..', 'timestamp_start': 29.0, 'timestamp_end': 31.0}, {'sentence': 'It is Grandad Dog, Grandpa Pig and Grumpy Rabbit..', 'timestamp_start': 31.0, 'timestamp_end': 36.0}, {'sentence': 'Hello..', 'timestamp_start': 36.0, 'timestamp_end': 37.0}, {'sentence': 'Can Captain Dog come out to play?.', 'timestamp_start': 37.0, 'timestamp_end': 40.0}, {'sentence': 'What? We are going on a fishing trip..', 'timestamp_start': 40.0, 'timestamp_end': 43.0}, {'sentence': 'On a boat?.', 'timestamp_start': 43.0, 'timestamp_end': 44.0}, {'sentence': 'On the sea!.', 'timestamp_start': 44.0, 'timestamp_end': 45.0}, {'sentence': \"OK, let's go..\", 'timestamp_start': 45.0, 'timestamp_end': 47.0}, {'sentence': \"But Daddy, you said you'd never get on a boat again..\", 'timestamp_start': 47.0, 'timestamp_end': 51.0}, {'sentence': \"I'm not going to get on a boat again..\", 'timestamp_start': 51.0, 'timestamp_end': 54.0}, {'sentence': \"You said you'd never get on a boat again..\", 'timestamp_start': 54.0, 'timestamp_end': 57.0}, {'sentence': 'Oh, yes. So I did..', 'timestamp_start': 57.0, 'timestamp_end': 60.0}, {'sentence': 'OK, bye-bye..', 'timestamp_start': 60.0, 'timestamp_end': 62.0}, {'sentence': 'Bye..', 'timestamp_start': 62.0, 'timestamp_end': 63.0}]\n",
"\u001b[32m****************************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
"\n",
"\u001b[32m***** Suggested function Call: translate_transcript *****\u001b[0m\n",
"Arguments: \n",
"{\n",
"\"source_language\": \"en\",\n",
"\"target_language\": \"zh\"\n",
"}\n",
"\u001b[32m*********************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[35m\n",
">>>>>>>> EXECUTING FUNCTION translate_transcript...\u001b[0m\n",
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
"\n",
"\u001b[32m***** Response from calling function \"translate_transcript\" *****\u001b[0m\n",
"0s to 3.0s: 这是我小弟弟乔治。\n",
"3.0s to 5.0s: 这是妈妈猪。\n",
"5.0s to 7.0s: 这位是猪爸爸..\n",
"7.0s to 9.0s: 'Peppa Pig...' (皮皮猪)\n",
"9.0s to 11.0s: \"荒岛..\"\n",
"11.0s to 14.0s: 胡椒和乔治在丹尼狗的家里。\n",
"14.0s to 17.0s: 船长狗正在讲述他作为一名海员时的故事。\n",
"17.0s to 20.0s: 我环游了全世界。\n",
"20.0s to 22.0s: 然后我又回到了家。。\n",
"22.0s to 25.0s: \"但现在我回来了,永远地回来了...\"\n",
"25.0s to 27.0s: \"我永远不会忘记你...\"\n",
"27.0s to 29.0s: \"爸爸,你想念大海吗?\"\n",
"29.0s to 31.0s: 嗯,有时候...\n",
"31.0s to 36.0s: 这是大爷狗、爷爷猪和脾气暴躁的兔子。\n",
"36.0s to 37.0s: 你好。\n",
"37.0s to 40.0s: \"船长狗可以出来玩吗?\"\n",
"40.0s to 43.0s: 什么?我们要去钓鱼了。。\n",
"43.0s to 44.0s: 在船上?\n",
"44.0s to 45.0s: 在海上!\n",
"45.0s to 47.0s: 好的,我们走吧。\n",
"47.0s to 51.0s: \"但是爸爸,你说过你再也不会上船了…\"\n",
"51.0s to 54.0s: \"我不会再上船了..\"\n",
"54.0s to 57.0s: \"你说过再也不会上船了...\"\n",
"57.0s to 60.0s: 哦,是的。所以我做了。\n",
"60.0s to 62.0s: 好的,再见。\n",
"62.0s to 63.0s: 再见。。\n",
"\u001b[32m*****************************************************************\u001b[0m\n",
"\n",
"--------------------------------------------------------------------------------\n",
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
"\n",
"TERMINATE\n",
"\n",
"--------------------------------------------------------------------------------\n"
]
}
],
"source": [
"def recognize_transcript_from_video(audio_filepath):\n",
" try:\n",
" # Load model\n",
" model = whisper.load_model(\"small\")\n",
"\n",
" # Transcribe audio with detailed timestamps\n",
" result = model.transcribe(audio_filepath, verbose=True)\n",
"\n",
" # Initialize variables for transcript\n",
" transcript = []\n",
" sentence = \"\"\n",
" start_time = 0\n",
"\n",
" # Iterate through the segments in the result\n",
" for segment in result['segments']:\n",
" # If new sentence starts, save the previous one and reset variables\n",
" if segment['start'] != start_time and sentence:\n",
" transcript.append({\n",
" \"sentence\": sentence.strip() + \".\",\n",
" \"timestamp_start\": start_time,\n",
" \"timestamp_end\": segment['start']\n",
" })\n",
" sentence = \"\"\n",
" start_time = segment['start']\n",
"\n",
" # Add the word to the current sentence\n",
" sentence += segment['text'] + \" \"\n",
"\n",
" # Add the final sentence\n",
" if sentence:\n",
" transcript.append({\n",
" \"sentence\": sentence.strip() + \".\",\n",
" \"timestamp_start\": start_time,\n",
" \"timestamp_end\": result['segments'][-1]['end']\n",
" })\n",
"\n",
" # Save the transcript to a file\n",
" with open(\"transcription.txt\", \"w\") as file:\n",
" for item in transcript:\n",
" sentence = item[\"sentence\"]\n",
" start_time, end_time = item[\"timestamp_start\"], item[\"timestamp_end\"]\n",
" file.write(f\"{start_time}s to {end_time}s: {sentence}\\n\")\n",
"\n",
" return transcript\n",
"\n",
" except FileNotFoundError:\n",
" return \"The specified audio file could not be found.\"\n",
" except Exception as e:\n",
" return f\"An unexpected error occurred: {str(e)}\"\n",
"\n",
"\n",
"\n",
"def translate_text(input_text, source_language, target_language):\n",
" client = OpenAI(api_key=key)\n",
"\n",
" response = client.chat.completions.create(\n",
" model=\"gpt-3.5-turbo\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\",\n",
" \"content\": f\"Directly translate the following {source_language} text to a pure {target_language} \"\n",
" f\"video subtitle text without additional explanation.: '{input_text}'\"},\n",
" ],\n",
" max_tokens=1500\n",
" )\n",
"\n",
" # Correctly accessing the response content\n",
" translated_text = response.choices[0].message.content if response.choices else None\n",
" return translated_text\n",
"\n",
"\n",
"def translate_transcript(source_language, target_language):\n",
" with open(\"transcription.txt\", \"r\") as f:\n",
" lines = f.readlines()\n",
"\n",
" translated_transcript = []\n",
"\n",
" for line in lines:\n",
" # Split each line into timestamp and text parts\n",
" parts = line.strip().split(': ')\n",
" if len(parts) == 2:\n",
" timestamp, text = parts[0], parts[1]\n",
" # Translate only the text part\n",
" translated_text = translate_text(text, source_language, target_language)\n",
" # Reconstruct the line with the translated text and the preserved timestamp\n",
" translated_line = f\"{timestamp}: {translated_text}\"\n",
" translated_transcript.append(translated_line)\n",
" else:\n",
" # If the line doesn't contain a timestamp, add it as is\n",
" translated_transcript.append(line.strip())\n",
"\n",
" return '\\n'.join(translated_transcript)\n",
"\n",
"\n",
"llm_config = {\n",
" \"functions\": [\n",
" {\n",
" \"name\": \"recognize_transcript_from_video\",\n",
" \"description\": \"recognize the speech from video and transfer into a txt file\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"audio_filepath\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"path of the video file\",\n",
" }\n",
" },\n",
" \"required\": [\"audio_filepath\"],\n",
" },\n",
" },\n",
" {\n",
" \"name\": \"translate_transcript\",\n",
" \"description\": \"using translate_text function to translate the script\",\n",
" \"parameters\": {\n",
" \"type\": \"object\",\n",
" \"properties\": {\n",
" \"source_language\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"source language\",\n",
" },\n",
" \"target_language\": {\n",
" \"type\": \"string\",\n",
" \"description\": \"target language\",\n",
" }\n",
" },\n",
" \"required\": [\"source_language\", \"target_language\"],\n",
" },\n",
" },\n",
" ],\n",
" \"config_list\": config_list,\n",
" \"timeout\": 120,\n",
"}\n",
"source_language = \"English\"\n",
"target_language = \"Chinese\"\n",
"key = os.getenv(\"OPENAI_API_KEY\")\n",
"target_video = \"your_file_path\"\n",
"\n",
"chatbot = autogen.AssistantAgent(\n",
" name=\"chatbot\",\n",
" system_message=\"For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done.\",\n",
" llm_config=llm_config,\n",
")\n",
"\n",
"user_proxy = autogen.UserProxyAgent(\n",
" name=\"user_proxy\",\n",
" is_termination_msg=lambda x: x.get(\"content\", \"\") and x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
" human_input_mode=\"NEVER\",\n",
" max_consecutive_auto_reply=10,\n",
" code_execution_config={\"work_dir\": \"coding_2\"},\n",
")\n",
"\n",
"user_proxy.register_function(\n",
" function_map={\n",
" \"recognize_transcript_from_video\": recognize_transcript_from_video,\n",
" \"translate_transcript\": translate_transcript,\n",
" }\n",
")\n",
"user_proxy.initiate_chat(\n",
" chatbot,\n",
" message=f\"For the video located in {target_video}, recognize the speech and transfer it into a script file, \"\n",
" f\"then translate from {source_language} text to a {target_language} video subtitle text. \",\n",
")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading