Skip to content

Commit 7c49000

Browse files
authored
Merge branch 'main' into fix_unstructured
2 parents 429ab8b + 1b4fb8f commit 7c49000

File tree

3 files changed

+440
-1
lines changed

3 files changed

+440
-1
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,393 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"id": "e4fccaaa-fda5-4f99-a4c5-c463c5c890f5",
6+
"metadata": {},
7+
"source": [
8+
"<a href=\"https://colab.research.google.com/github/microsoft/autogen/blob/main/notebook/agentchat_video_transcript_translate_with_whisper.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
9+
]
10+
},
11+
{
12+
"cell_type": "markdown",
13+
"id": "a5b4540e-4987-4774-9305-764c3133e953",
14+
"metadata": {},
15+
"source": [
16+
"<a id=\"toc\"></a>\n",
17+
"# Auto Generated Agent Chat: Translating Video audio using Whisper and GPT-3.5-turbo\n",
18+
"In this notebook, we demonstrate how to use whisper and GPT-3.5-turbo with `AssistantAgent` and `UserProxyAgent` to recognize and translate\n",
19+
"the speech sound from a video file and add the timestamp like a subtitle file based on [agentchat_function_call.ipynb](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call.ipynb)\n"
20+
]
21+
},
22+
{
23+
"cell_type": "markdown",
24+
"id": "4fd644cc-2b14-4700-8b1d-959fb2e9acb0",
25+
"metadata": {},
26+
"source": [
27+
"## Requirements\n",
28+
"AutoGen requires `Python>=3.8`. To run this notebook example, please install `openai`, `pyautogen`, `whisper`, and `moviepy`:\n",
29+
"```bash\n",
30+
"pip install openai\n",
31+
"pip install openai-whisper\n",
32+
"pip install moviepy\n",
33+
"pip install pyautogen\n",
34+
"```"
35+
]
36+
},
37+
{
38+
"cell_type": "code",
39+
"execution_count": null,
40+
"id": "bc4600b8-c6df-49dd-945d-ce69f30a65cc",
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"%%capture --no-stderr\n",
45+
"# %pip install moviepy~=1.0.3\n",
46+
"# %pip install openai-whisper~=20230918\n",
47+
"# %pip install openai~=1.3.5\n",
48+
"# %pip install pyautogen~=0.2.0b4"
49+
]
50+
},
51+
{
52+
"cell_type": "markdown",
53+
"id": "18bdeb0b-c4b6-4dec-97d2-d84f09cffa00",
54+
"metadata": {},
55+
"source": [
56+
"## Set your API Endpoint\n",
57+
"It is recommended to store your OpenAI API key in the environment variable. For example, store it in `OPENAI_API_KEY`."
58+
]
59+
},
60+
{
61+
"cell_type": "code",
62+
"execution_count": 1,
63+
"id": "26d1ae87-f007-4286-a56a-dcf68abf9393",
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"from openai import OpenAI\n",
68+
"import whisper\n",
69+
"import autogen\n",
70+
"from moviepy.editor import VideoFileClip\n",
71+
"import os\n",
72+
"\n",
73+
"config_list = [\n",
74+
" {\n",
75+
" 'model': 'gpt-4',\n",
76+
" 'api_key': os.getenv(\"OPENAI_API_KEY\"),\n",
77+
" }\n",
78+
"]"
79+
]
80+
},
81+
{
82+
"cell_type": "markdown",
83+
"id": "324fec65-ab23-45db-a7a8-0aaf753fe19c",
84+
"metadata": {},
85+
"source": [
86+
"## Example and Output\n",
87+
"Below is an example of speech recognition from a [Peppa Pig cartoon video clip](https://drive.google.com/file/d/1QY0naa2acHw2FuH7sY3c-g2sBLtC2Sv4/view?usp=drive_link) originally in English and translated into Chinese.\n",
88+
"'FFmpeg' does not support online files. To run the code on the example video, you need to download the example video locally. You can change `your_file_path` to your local video file path."
89+
]
90+
},
91+
{
92+
"cell_type": "code",
93+
"execution_count": 5,
94+
"id": "ed549b75-b4ea-4ec5-8c0b-a15e93ffd618",
95+
"metadata": {},
96+
"outputs": [
97+
{
98+
"name": "stdout",
99+
"output_type": "stream",
100+
"text": [
101+
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
102+
"\n",
103+
"For the video located in E:\\pythonProject\\gpt_detection\\peppa pig.mp4, recognize the speech and transfer it into a script file, then translate from English text to a Chinese video subtitle text. \n",
104+
"\n",
105+
"--------------------------------------------------------------------------------\n",
106+
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
107+
"\n",
108+
"\u001b[32m***** Suggested function Call: recognize_transcript_from_video *****\u001b[0m\n",
109+
"Arguments: \n",
110+
"{\n",
111+
"\"audio_filepath\": \"E:\\\\pythonProject\\\\gpt_detection\\\\peppa pig.mp4\"\n",
112+
"}\n",
113+
"\u001b[32m********************************************************************\u001b[0m\n",
114+
"\n",
115+
"--------------------------------------------------------------------------------\n",
116+
"\u001b[35m\n",
117+
">>>>>>>> EXECUTING FUNCTION recognize_transcript_from_video...\u001b[0m\n",
118+
"Detecting language using up to the first 30 seconds. Use `--language` to specify the language\n",
119+
"Detected language: English\n",
120+
"[00:00.000 --> 00:03.000] This is my little brother George.\n",
121+
"[00:03.000 --> 00:05.000] This is Mummy Pig.\n",
122+
"[00:05.000 --> 00:07.000] And this is Daddy Pig.\n",
123+
"[00:07.000 --> 00:09.000] Pee-pah Pig.\n",
124+
"[00:09.000 --> 00:11.000] Desert Island.\n",
125+
"[00:11.000 --> 00:14.000] Pepper and George are at Danny Dog's house.\n",
126+
"[00:14.000 --> 00:17.000] Captain Dog is telling stories of when he was a sailor.\n",
127+
"[00:17.000 --> 00:20.000] I sailed all around the world.\n",
128+
"[00:20.000 --> 00:22.000] And then I came home again.\n",
129+
"[00:22.000 --> 00:25.000] But now I'm back for good.\n",
130+
"[00:25.000 --> 00:27.000] I'll never forget you.\n",
131+
"[00:27.000 --> 00:29.000] Daddy, do you miss the sea?\n",
132+
"[00:29.000 --> 00:31.000] Well, sometimes.\n",
133+
"[00:31.000 --> 00:36.000] It is Grandad Dog, Grandpa Pig and Grumpy Rabbit.\n",
134+
"[00:36.000 --> 00:37.000] Hello.\n",
135+
"[00:37.000 --> 00:40.000] Can Captain Dog come out to play?\n",
136+
"[00:40.000 --> 00:43.000] What? We are going on a fishing trip.\n",
137+
"[00:43.000 --> 00:44.000] On a boat?\n",
138+
"[00:44.000 --> 00:45.000] On the sea!\n",
139+
"[00:45.000 --> 00:47.000] OK, let's go.\n",
140+
"[00:47.000 --> 00:51.000] But Daddy, you said you'd never get on a boat again.\n",
141+
"[00:51.000 --> 00:54.000] I'm not going to get on a boat again.\n",
142+
"[00:54.000 --> 00:57.000] You said you'd never get on a boat again.\n",
143+
"[00:57.000 --> 01:00.000] Oh, yes. So I did.\n",
144+
"[01:00.000 --> 01:02.000] OK, bye-bye.\n",
145+
"[01:02.000 --> 01:03.000] Bye.\n",
146+
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
147+
"\n",
148+
"\u001b[32m***** Response from calling function \"recognize_transcript_from_video\" *****\u001b[0m\n",
149+
"[{'sentence': 'This is my little brother George..', 'timestamp_start': 0, 'timestamp_end': 3.0}, {'sentence': 'This is Mummy Pig..', 'timestamp_start': 3.0, 'timestamp_end': 5.0}, {'sentence': 'And this is Daddy Pig..', 'timestamp_start': 5.0, 'timestamp_end': 7.0}, {'sentence': 'Pee-pah Pig..', 'timestamp_start': 7.0, 'timestamp_end': 9.0}, {'sentence': 'Desert Island..', 'timestamp_start': 9.0, 'timestamp_end': 11.0}, {'sentence': \"Pepper and George are at Danny Dog's house..\", 'timestamp_start': 11.0, 'timestamp_end': 14.0}, {'sentence': 'Captain Dog is telling stories of when he was a sailor..', 'timestamp_start': 14.0, 'timestamp_end': 17.0}, {'sentence': 'I sailed all around the world..', 'timestamp_start': 17.0, 'timestamp_end': 20.0}, {'sentence': 'And then I came home again..', 'timestamp_start': 20.0, 'timestamp_end': 22.0}, {'sentence': \"But now I'm back for good..\", 'timestamp_start': 22.0, 'timestamp_end': 25.0}, {'sentence': \"I'll never forget you..\", 'timestamp_start': 25.0, 'timestamp_end': 27.0}, {'sentence': 'Daddy, do you miss the sea?.', 'timestamp_start': 27.0, 'timestamp_end': 29.0}, {'sentence': 'Well, sometimes..', 'timestamp_start': 29.0, 'timestamp_end': 31.0}, {'sentence': 'It is Grandad Dog, Grandpa Pig and Grumpy Rabbit..', 'timestamp_start': 31.0, 'timestamp_end': 36.0}, {'sentence': 'Hello..', 'timestamp_start': 36.0, 'timestamp_end': 37.0}, {'sentence': 'Can Captain Dog come out to play?.', 'timestamp_start': 37.0, 'timestamp_end': 40.0}, {'sentence': 'What? We are going on a fishing trip..', 'timestamp_start': 40.0, 'timestamp_end': 43.0}, {'sentence': 'On a boat?.', 'timestamp_start': 43.0, 'timestamp_end': 44.0}, {'sentence': 'On the sea!.', 'timestamp_start': 44.0, 'timestamp_end': 45.0}, {'sentence': \"OK, let's go..\", 'timestamp_start': 45.0, 'timestamp_end': 47.0}, {'sentence': \"But Daddy, you said you'd never get on a boat again..\", 'timestamp_start': 47.0, 'timestamp_end': 51.0}, {'sentence': \"I'm not going to get on a boat again..\", 'timestamp_start': 51.0, 'timestamp_end': 54.0}, {'sentence': \"You said you'd never get on a boat again..\", 'timestamp_start': 54.0, 'timestamp_end': 57.0}, {'sentence': 'Oh, yes. So I did..', 'timestamp_start': 57.0, 'timestamp_end': 60.0}, {'sentence': 'OK, bye-bye..', 'timestamp_start': 60.0, 'timestamp_end': 62.0}, {'sentence': 'Bye..', 'timestamp_start': 62.0, 'timestamp_end': 63.0}]\n",
150+
"\u001b[32m****************************************************************************\u001b[0m\n",
151+
"\n",
152+
"--------------------------------------------------------------------------------\n",
153+
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
154+
"\n",
155+
"\u001b[32m***** Suggested function Call: translate_transcript *****\u001b[0m\n",
156+
"Arguments: \n",
157+
"{\n",
158+
"\"source_language\": \"en\",\n",
159+
"\"target_language\": \"zh\"\n",
160+
"}\n",
161+
"\u001b[32m*********************************************************\u001b[0m\n",
162+
"\n",
163+
"--------------------------------------------------------------------------------\n",
164+
"\u001b[35m\n",
165+
">>>>>>>> EXECUTING FUNCTION translate_transcript...\u001b[0m\n",
166+
"\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
167+
"\n",
168+
"\u001b[32m***** Response from calling function \"translate_transcript\" *****\u001b[0m\n",
169+
"0s to 3.0s: 这是我小弟弟乔治。\n",
170+
"3.0s to 5.0s: 这是妈妈猪。\n",
171+
"5.0s to 7.0s: 这位是猪爸爸..\n",
172+
"7.0s to 9.0s: 'Peppa Pig...' (皮皮猪)\n",
173+
"9.0s to 11.0s: \"荒岛..\"\n",
174+
"11.0s to 14.0s: 胡椒和乔治在丹尼狗的家里。\n",
175+
"14.0s to 17.0s: 船长狗正在讲述他作为一名海员时的故事。\n",
176+
"17.0s to 20.0s: 我环游了全世界。\n",
177+
"20.0s to 22.0s: 然后我又回到了家。。\n",
178+
"22.0s to 25.0s: \"但现在我回来了,永远地回来了...\"\n",
179+
"25.0s to 27.0s: \"我永远不会忘记你...\"\n",
180+
"27.0s to 29.0s: \"爸爸,你想念大海吗?\"\n",
181+
"29.0s to 31.0s: 嗯,有时候...\n",
182+
"31.0s to 36.0s: 这是大爷狗、爷爷猪和脾气暴躁的兔子。\n",
183+
"36.0s to 37.0s: 你好。\n",
184+
"37.0s to 40.0s: \"船长狗可以出来玩吗?\"\n",
185+
"40.0s to 43.0s: 什么?我们要去钓鱼了。。\n",
186+
"43.0s to 44.0s: 在船上?\n",
187+
"44.0s to 45.0s: 在海上!\n",
188+
"45.0s to 47.0s: 好的,我们走吧。\n",
189+
"47.0s to 51.0s: \"但是爸爸,你说过你再也不会上船了…\"\n",
190+
"51.0s to 54.0s: \"我不会再上船了..\"\n",
191+
"54.0s to 57.0s: \"你说过再也不会上船了...\"\n",
192+
"57.0s to 60.0s: 哦,是的。所以我做了。\n",
193+
"60.0s to 62.0s: 好的,再见。\n",
194+
"62.0s to 63.0s: 再见。。\n",
195+
"\u001b[32m*****************************************************************\u001b[0m\n",
196+
"\n",
197+
"--------------------------------------------------------------------------------\n",
198+
"\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
199+
"\n",
200+
"TERMINATE\n",
201+
"\n",
202+
"--------------------------------------------------------------------------------\n"
203+
]
204+
}
205+
],
206+
"source": [
207+
"def recognize_transcript_from_video(audio_filepath):\n",
208+
" try:\n",
209+
" # Load model\n",
210+
" model = whisper.load_model(\"small\")\n",
211+
"\n",
212+
" # Transcribe audio with detailed timestamps\n",
213+
" result = model.transcribe(audio_filepath, verbose=True)\n",
214+
"\n",
215+
" # Initialize variables for transcript\n",
216+
" transcript = []\n",
217+
" sentence = \"\"\n",
218+
" start_time = 0\n",
219+
"\n",
220+
" # Iterate through the segments in the result\n",
221+
" for segment in result['segments']:\n",
222+
" # If new sentence starts, save the previous one and reset variables\n",
223+
" if segment['start'] != start_time and sentence:\n",
224+
" transcript.append({\n",
225+
" \"sentence\": sentence.strip() + \".\",\n",
226+
" \"timestamp_start\": start_time,\n",
227+
" \"timestamp_end\": segment['start']\n",
228+
" })\n",
229+
" sentence = \"\"\n",
230+
" start_time = segment['start']\n",
231+
"\n",
232+
" # Add the word to the current sentence\n",
233+
" sentence += segment['text'] + \" \"\n",
234+
"\n",
235+
" # Add the final sentence\n",
236+
" if sentence:\n",
237+
" transcript.append({\n",
238+
" \"sentence\": sentence.strip() + \".\",\n",
239+
" \"timestamp_start\": start_time,\n",
240+
" \"timestamp_end\": result['segments'][-1]['end']\n",
241+
" })\n",
242+
"\n",
243+
" # Save the transcript to a file\n",
244+
" with open(\"transcription.txt\", \"w\") as file:\n",
245+
" for item in transcript:\n",
246+
" sentence = item[\"sentence\"]\n",
247+
" start_time, end_time = item[\"timestamp_start\"], item[\"timestamp_end\"]\n",
248+
" file.write(f\"{start_time}s to {end_time}s: {sentence}\\n\")\n",
249+
"\n",
250+
" return transcript\n",
251+
"\n",
252+
" except FileNotFoundError:\n",
253+
" return \"The specified audio file could not be found.\"\n",
254+
" except Exception as e:\n",
255+
" return f\"An unexpected error occurred: {str(e)}\"\n",
256+
"\n",
257+
"\n",
258+
"\n",
259+
"def translate_text(input_text, source_language, target_language):\n",
260+
" client = OpenAI(api_key=key)\n",
261+
"\n",
262+
" response = client.chat.completions.create(\n",
263+
" model=\"gpt-3.5-turbo\",\n",
264+
" messages=[\n",
265+
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
266+
" {\"role\": \"user\",\n",
267+
" \"content\": f\"Directly translate the following {source_language} text to a pure {target_language} \"\n",
268+
" f\"video subtitle text without additional explanation.: '{input_text}'\"},\n",
269+
" ],\n",
270+
" max_tokens=1500\n",
271+
" )\n",
272+
"\n",
273+
" # Correctly accessing the response content\n",
274+
" translated_text = response.choices[0].message.content if response.choices else None\n",
275+
" return translated_text\n",
276+
"\n",
277+
"\n",
278+
"def translate_transcript(source_language, target_language):\n",
279+
" with open(\"transcription.txt\", \"r\") as f:\n",
280+
" lines = f.readlines()\n",
281+
"\n",
282+
" translated_transcript = []\n",
283+
"\n",
284+
" for line in lines:\n",
285+
" # Split each line into timestamp and text parts\n",
286+
" parts = line.strip().split(': ')\n",
287+
" if len(parts) == 2:\n",
288+
" timestamp, text = parts[0], parts[1]\n",
289+
" # Translate only the text part\n",
290+
" translated_text = translate_text(text, source_language, target_language)\n",
291+
" # Reconstruct the line with the translated text and the preserved timestamp\n",
292+
" translated_line = f\"{timestamp}: {translated_text}\"\n",
293+
" translated_transcript.append(translated_line)\n",
294+
" else:\n",
295+
" # If the line doesn't contain a timestamp, add it as is\n",
296+
" translated_transcript.append(line.strip())\n",
297+
"\n",
298+
" return '\\n'.join(translated_transcript)\n",
299+
"\n",
300+
"\n",
301+
"llm_config = {\n",
302+
" \"functions\": [\n",
303+
" {\n",
304+
" \"name\": \"recognize_transcript_from_video\",\n",
305+
" \"description\": \"recognize the speech from video and transfer into a txt file\",\n",
306+
" \"parameters\": {\n",
307+
" \"type\": \"object\",\n",
308+
" \"properties\": {\n",
309+
" \"audio_filepath\": {\n",
310+
" \"type\": \"string\",\n",
311+
" \"description\": \"path of the video file\",\n",
312+
" }\n",
313+
" },\n",
314+
" \"required\": [\"audio_filepath\"],\n",
315+
" },\n",
316+
" },\n",
317+
" {\n",
318+
" \"name\": \"translate_transcript\",\n",
319+
" \"description\": \"using translate_text function to translate the script\",\n",
320+
" \"parameters\": {\n",
321+
" \"type\": \"object\",\n",
322+
" \"properties\": {\n",
323+
" \"source_language\": {\n",
324+
" \"type\": \"string\",\n",
325+
" \"description\": \"source language\",\n",
326+
" },\n",
327+
" \"target_language\": {\n",
328+
" \"type\": \"string\",\n",
329+
" \"description\": \"target language\",\n",
330+
" }\n",
331+
" },\n",
332+
" \"required\": [\"source_language\", \"target_language\"],\n",
333+
" },\n",
334+
" },\n",
335+
" ],\n",
336+
" \"config_list\": config_list,\n",
337+
" \"timeout\": 120,\n",
338+
"}\n",
339+
"source_language = \"English\"\n",
340+
"target_language = \"Chinese\"\n",
341+
"key = os.getenv(\"OPENAI_API_KEY\")\n",
342+
"target_video = \"your_file_path\"\n",
343+
"\n",
344+
"chatbot = autogen.AssistantAgent(\n",
345+
" name=\"chatbot\",\n",
346+
" system_message=\"For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done.\",\n",
347+
" llm_config=llm_config,\n",
348+
")\n",
349+
"\n",
350+
"user_proxy = autogen.UserProxyAgent(\n",
351+
" name=\"user_proxy\",\n",
352+
" is_termination_msg=lambda x: x.get(\"content\", \"\") and x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
353+
" human_input_mode=\"NEVER\",\n",
354+
" max_consecutive_auto_reply=10,\n",
355+
" code_execution_config={\"work_dir\": \"coding_2\"},\n",
356+
")\n",
357+
"\n",
358+
"user_proxy.register_function(\n",
359+
" function_map={\n",
360+
" \"recognize_transcript_from_video\": recognize_transcript_from_video,\n",
361+
" \"translate_transcript\": translate_transcript,\n",
362+
" }\n",
363+
")\n",
364+
"user_proxy.initiate_chat(\n",
365+
" chatbot,\n",
366+
" message=f\"For the video located in {target_video}, recognize the speech and transfer it into a script file, \"\n",
367+
" f\"then translate from {source_language} text to a {target_language} video subtitle text. \",\n",
368+
")"
369+
]
370+
}
371+
],
372+
"metadata": {
373+
"kernelspec": {
374+
"display_name": "Python 3 (ipykernel)",
375+
"language": "python",
376+
"name": "python3"
377+
},
378+
"language_info": {
379+
"codemirror_mode": {
380+
"name": "ipython",
381+
"version": 3
382+
},
383+
"file_extension": ".py",
384+
"mimetype": "text/x-python",
385+
"name": "python",
386+
"nbconvert_exporter": "python",
387+
"pygments_lexer": "ipython3",
388+
"version": "3.10.10"
389+
}
390+
},
391+
"nbformat": 4,
392+
"nbformat_minor": 5
393+
}

0 commit comments

Comments
 (0)