microsoft · qingyun-wu · Dec 7, 2023 · Nov 29, 2023 · Nov 30, 2023 · Nov 30, 2023
diff --git a/notebook/agentchat_video_transcript_translate_with_whisper.ipynb b/notebook/agentchat_video_transcript_translate_with_whisper.ipynb
@@ -0,0 +1,393 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "e4fccaaa-fda5-4f99-a4c5-c463c5c890f5",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/microsoft/autogen/blob/main/notebook/agentchat_video_transcript_translate_with_whisper.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a5b4540e-4987-4774-9305-764c3133e953",
+   "metadata": {},
+   "source": [
+    "<a id=\"toc\"></a>\n",
+    "# Auto Generated Agent Chat: Translating Video audio using Whisper and GPT-3.5-turbo\n",
+    "In this notebook, we demonstrate how to use whisper and GPT-3.5-turbo with `AssistantAgent` and `UserProxyAgent` to recognize and translate\n",
+    "the speech sound from a video file and add the timestamp like a subtitle file based on [agentchat_function_call.ipynb](https://github.com/microsoft/autogen/blob/main/notebook/agentchat_function_call.ipynb)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4fd644cc-2b14-4700-8b1d-959fb2e9acb0",
+   "metadata": {},
+   "source": [
+    "## Requirements\n",
+    "AutoGen requires `Python>=3.8`. To run this notebook example, please install `openai`, `pyautogen`, `whisper`, and `moviepy`:\n",
+    "```bash\n",
+    "pip install openai\n",
+    "pip install openai-whisper\n",
+    "pip install moviepy\n",
+    "pip install pyautogen\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bc4600b8-c6df-49dd-945d-ce69f30a65cc",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture --no-stderr\n",
+    "# %pip install moviepy~=1.0.3\n",
+    "# %pip install openai-whisper~=20230918\n",
+    "# %pip install openai~=1.3.5\n",
+    "# %pip install pyautogen~=0.2.0b4"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "18bdeb0b-c4b6-4dec-97d2-d84f09cffa00",
+   "metadata": {},
+   "source": [
+    "## Set your API Endpoint\n",
+    "It is recommended to store your OpenAI API key in the environment variable. For example, store it in `OPENAI_API_KEY`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "26d1ae87-f007-4286-a56a-dcf68abf9393",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from openai import OpenAI\n",
+    "import whisper\n",
+    "import autogen\n",
+    "from moviepy.editor import VideoFileClip\n",
+    "import os\n",
+    "\n",
+    "config_list = [\n",
+    "    {\n",
+    "        'model': 'gpt-4',\n",
+    "        'api_key': os.getenv(\"OPENAI_API_KEY\"),\n",
+    "    }\n",
+    "]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "324fec65-ab23-45db-a7a8-0aaf753fe19c",
+   "metadata": {},
+   "source": [
+    "## Example and Output\n",
+    "Below is an example of speech recognition from a [Peppa Pig cartoon video clip](https://drive.google.com/file/d/1QY0naa2acHw2FuH7sY3c-g2sBLtC2Sv4/view?usp=drive_link) originally in English and translated into Chinese.\n",
+    "'FFmpeg' does not support online files. To run the code on the example video, you need to download the example video locally. You can change `your_file_path` to your local video file path."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "ed549b75-b4ea-4ec5-8c0b-a15e93ffd618",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
+      "\n",
+      "For the video located in E:\\pythonProject\\gpt_detection\\peppa pig.mp4, recognize the speech and transfer it into a script file, then translate from English text to a Chinese video subtitle text. \n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
+      "\n",
+      "\u001b[32m***** Suggested function Call: recognize_transcript_from_video *****\u001b[0m\n",
+      "Arguments: \n",
+      "{\n",
+      "\"audio_filepath\": \"E:\\\\pythonProject\\\\gpt_detection\\\\peppa pig.mp4\"\n",
+      "}\n",
+      "\u001b[32m********************************************************************\u001b[0m\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "\u001b[35m\n",
+      ">>>>>>>> EXECUTING FUNCTION recognize_transcript_from_video...\u001b[0m\n",
+      "Detecting language using up to the first 30 seconds. Use `--language` to specify the language\n",
+      "Detected language: English\n",
+      "[00:00.000 --> 00:03.000]  This is my little brother George.\n",
+      "[00:03.000 --> 00:05.000]  This is Mummy Pig.\n",
+      "[00:05.000 --> 00:07.000]  And this is Daddy Pig.\n",
+      "[00:07.000 --> 00:09.000]  Pee-pah Pig.\n",
+      "[00:09.000 --> 00:11.000]  Desert Island.\n",
+      "[00:11.000 --> 00:14.000]  Pepper and George are at Danny Dog's house.\n",
+      "[00:14.000 --> 00:17.000]  Captain Dog is telling stories of when he was a sailor.\n",
+      "[00:17.000 --> 00:20.000]  I sailed all around the world.\n",
+      "[00:20.000 --> 00:22.000]  And then I came home again.\n",
+      "[00:22.000 --> 00:25.000]  But now I'm back for good.\n",
+      "[00:25.000 --> 00:27.000]  I'll never forget you.\n",
+      "[00:27.000 --> 00:29.000]  Daddy, do you miss the sea?\n",
+      "[00:29.000 --> 00:31.000]  Well, sometimes.\n",
+      "[00:31.000 --> 00:36.000]  It is Grandad Dog, Grandpa Pig and Grumpy Rabbit.\n",
+      "[00:36.000 --> 00:37.000]  Hello.\n",
+      "[00:37.000 --> 00:40.000]  Can Captain Dog come out to play?\n",
+      "[00:40.000 --> 00:43.000]  What? We are going on a fishing trip.\n",
+      "[00:43.000 --> 00:44.000]  On a boat?\n",
+      "[00:44.000 --> 00:45.000]  On the sea!\n",
+      "[00:45.000 --> 00:47.000]  OK, let's go.\n",
+      "[00:47.000 --> 00:51.000]  But Daddy, you said you'd never get on a boat again.\n",
+      "[00:51.000 --> 00:54.000]  I'm not going to get on a boat again.\n",
+      "[00:54.000 --> 00:57.000]  You said you'd never get on a boat again.\n",
+      "[00:57.000 --> 01:00.000]  Oh, yes. So I did.\n",
+      "[01:00.000 --> 01:02.000]  OK, bye-bye.\n",
+      "[01:02.000 --> 01:03.000]  Bye.\n",
+      "\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
+      "\n",
+      "\u001b[32m***** Response from calling function \"recognize_transcript_from_video\" *****\u001b[0m\n",
+      "[{'sentence': 'This is my little brother George..', 'timestamp_start': 0, 'timestamp_end': 3.0}, {'sentence': 'This is Mummy Pig..', 'timestamp_start': 3.0, 'timestamp_end': 5.0}, {'sentence': 'And this is Daddy Pig..', 'timestamp_start': 5.0, 'timestamp_end': 7.0}, {'sentence': 'Pee-pah Pig..', 'timestamp_start': 7.0, 'timestamp_end': 9.0}, {'sentence': 'Desert Island..', 'timestamp_start': 9.0, 'timestamp_end': 11.0}, {'sentence': \"Pepper and George are at Danny Dog's house..\", 'timestamp_start': 11.0, 'timestamp_end': 14.0}, {'sentence': 'Captain Dog is telling stories of when he was a sailor..', 'timestamp_start': 14.0, 'timestamp_end': 17.0}, {'sentence': 'I sailed all around the world..', 'timestamp_start': 17.0, 'timestamp_end': 20.0}, {'sentence': 'And then I came home again..', 'timestamp_start': 20.0, 'timestamp_end': 22.0}, {'sentence': \"But now I'm back for good..\", 'timestamp_start': 22.0, 'timestamp_end': 25.0}, {'sentence': \"I'll never forget you..\", 'timestamp_start': 25.0, 'timestamp_end': 27.0}, {'sentence': 'Daddy, do you miss the sea?.', 'timestamp_start': 27.0, 'timestamp_end': 29.0}, {'sentence': 'Well, sometimes..', 'timestamp_start': 29.0, 'timestamp_end': 31.0}, {'sentence': 'It is Grandad Dog, Grandpa Pig and Grumpy Rabbit..', 'timestamp_start': 31.0, 'timestamp_end': 36.0}, {'sentence': 'Hello..', 'timestamp_start': 36.0, 'timestamp_end': 37.0}, {'sentence': 'Can Captain Dog come out to play?.', 'timestamp_start': 37.0, 'timestamp_end': 40.0}, {'sentence': 'What? We are going on a fishing trip..', 'timestamp_start': 40.0, 'timestamp_end': 43.0}, {'sentence': 'On a boat?.', 'timestamp_start': 43.0, 'timestamp_end': 44.0}, {'sentence': 'On the sea!.', 'timestamp_start': 44.0, 'timestamp_end': 45.0}, {'sentence': \"OK, let's go..\", 'timestamp_start': 45.0, 'timestamp_end': 47.0}, {'sentence': \"But Daddy, you said you'd never get on a boat again..\", 'timestamp_start': 47.0, 'timestamp_end': 51.0}, {'sentence': \"I'm not going to get on a boat again..\", 'timestamp_start': 51.0, 'timestamp_end': 54.0}, {'sentence': \"You said you'd never get on a boat again..\", 'timestamp_start': 54.0, 'timestamp_end': 57.0}, {'sentence': 'Oh, yes. So I did..', 'timestamp_start': 57.0, 'timestamp_end': 60.0}, {'sentence': 'OK, bye-bye..', 'timestamp_start': 60.0, 'timestamp_end': 62.0}, {'sentence': 'Bye..', 'timestamp_start': 62.0, 'timestamp_end': 63.0}]\n",
+      "\u001b[32m****************************************************************************\u001b[0m\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
+      "\n",
+      "\u001b[32m***** Suggested function Call: translate_transcript *****\u001b[0m\n",
+      "Arguments: \n",
+      "{\n",
+      "\"source_language\": \"en\",\n",
+      "\"target_language\": \"zh\"\n",
+      "}\n",
+      "\u001b[32m*********************************************************\u001b[0m\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "\u001b[35m\n",
+      ">>>>>>>> EXECUTING FUNCTION translate_transcript...\u001b[0m\n",
+      "\u001b[33muser_proxy\u001b[0m (to chatbot):\n",
+      "\n",
+      "\u001b[32m***** Response from calling function \"translate_transcript\" *****\u001b[0m\n",
+      "0s to 3.0s: 这是我小弟弟乔治。\n",
+      "3.0s to 5.0s: 这是妈妈猪。\n",
+      "5.0s to 7.0s: 这位是猪爸爸..\n",
+      "7.0s to 9.0s: 'Peppa Pig...' (皮皮猪)\n",
+      "9.0s to 11.0s: \"荒岛..\"\n",
+      "11.0s to 14.0s: 胡椒和乔治在丹尼狗的家里。\n",
+      "14.0s to 17.0s: 船长狗正在讲述他作为一名海员时的故事。\n",
+      "17.0s to 20.0s: 我环游了全世界。\n",
+      "20.0s to 22.0s: 然后我又回到了家。。\n",
+      "22.0s to 25.0s: \"但现在我回来了，永远地回来了...\"\n",
+      "25.0s to 27.0s: \"我永远不会忘记你...\"\n",
+      "27.0s to 29.0s: \"爸爸，你想念大海吗？\"\n",
+      "29.0s to 31.0s: 嗯，有时候...\n",
+      "31.0s to 36.0s: 这是大爷狗、爷爷猪和脾气暴躁的兔子。\n",
+      "36.0s to 37.0s: 你好。\n",
+      "37.0s to 40.0s: \"船长狗可以出来玩吗?\"\n",
+      "40.0s to 43.0s: 什么？我们要去钓鱼了。。\n",
+      "43.0s to 44.0s: 在船上？\n",
+      "44.0s to 45.0s: 在海上！\n",
+      "45.0s to 47.0s: 好的，我们走吧。\n",
+      "47.0s to 51.0s: \"但是爸爸，你说过你再也不会上船了…\"\n",
+      "51.0s to 54.0s: \"我不会再上船了..\"\n",
+      "54.0s to 57.0s: \"你说过再也不会上船了...\"\n",
+      "57.0s to 60.0s: 哦，是的。所以我做了。\n",
+      "60.0s to 62.0s: 好的，再见。\n",
+      "62.0s to 63.0s: 再见。。\n",
+      "\u001b[32m*****************************************************************\u001b[0m\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n",
+      "\u001b[33mchatbot\u001b[0m (to user_proxy):\n",
+      "\n",
+      "TERMINATE\n",
+      "\n",
+      "--------------------------------------------------------------------------------\n"
+     ]
+    }
+   ],
+   "source": [
+    "def recognize_transcript_from_video(audio_filepath):\n",
+    "    try:\n",
+    "        # Load model\n",
+    "        model = whisper.load_model(\"small\")\n",
+    "\n",
+    "        # Transcribe audio with detailed timestamps\n",
+    "        result = model.transcribe(audio_filepath, verbose=True)\n",
+    "\n",
+    "        # Initialize variables for transcript\n",
+    "        transcript = []\n",
+    "        sentence = \"\"\n",
+    "        start_time = 0\n",
+    "\n",
+    "        # Iterate through the segments in the result\n",
+    "        for segment in result['segments']:\n",
+    "            # If new sentence starts, save the previous one and reset variables\n",
+    "            if segment['start'] != start_time and sentence:\n",
+    "                transcript.append({\n",
+    "                    \"sentence\": sentence.strip() + \".\",\n",
+    "                    \"timestamp_start\": start_time,\n",
+    "                    \"timestamp_end\": segment['start']\n",
+    "                })\n",
+    "                sentence = \"\"\n",
+    "                start_time = segment['start']\n",
+    "\n",
+    "            # Add the word to the current sentence\n",
+    "            sentence += segment['text'] + \" \"\n",
+    "\n",
+    "        # Add the final sentence\n",
+    "        if sentence:\n",
+    "            transcript.append({\n",
+    "                \"sentence\": sentence.strip() + \".\",\n",
+    "                \"timestamp_start\": start_time,\n",
+    "                \"timestamp_end\": result['segments'][-1]['end']\n",
+    "            })\n",
+    "\n",
+    "        # Save the transcript to a file\n",
+    "        with open(\"transcription.txt\", \"w\") as file:\n",
+    "            for item in transcript:\n",
+    "                sentence = item[\"sentence\"]\n",
+    "                start_time, end_time = item[\"timestamp_start\"], item[\"timestamp_end\"]\n",
+    "                file.write(f\"{start_time}s to {end_time}s: {sentence}\\n\")\n",
+    "\n",
+    "        return transcript\n",
+    "\n",
+    "    except FileNotFoundError:\n",
+    "        return \"The specified audio file could not be found.\"\n",
+    "    except Exception as e:\n",
+    "        return f\"An unexpected error occurred: {str(e)}\"\n",
+    "\n",
+    "\n",
+    "\n",
+    "def translate_text(input_text, source_language, target_language):\n",
+    "    client = OpenAI(api_key=key)\n",
+    "\n",
+    "    response = client.chat.completions.create(\n",
+    "        model=\"gpt-3.5-turbo\",\n",
+    "        messages=[\n",
+    "            {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
+    "            {\"role\": \"user\",\n",
+    "             \"content\": f\"Directly translate the following {source_language} text to a pure {target_language} \"\n",
+    "                        f\"video subtitle text without additional explanation.: '{input_text}'\"},\n",
+    "        ],\n",
+    "        max_tokens=1500\n",
+    "    )\n",
+    "\n",
+    "    # Correctly accessing the response content\n",
+    "    translated_text = response.choices[0].message.content if response.choices else None\n",
+    "    return translated_text\n",
+    "\n",
+    "\n",
+    "def translate_transcript(source_language, target_language):\n",
+    "    with open(\"transcription.txt\", \"r\") as f:\n",
+    "        lines = f.readlines()\n",
+    "\n",
+    "    translated_transcript = []\n",
+    "\n",
+    "    for line in lines:\n",
+    "        # Split each line into timestamp and text parts\n",
+    "        parts = line.strip().split(': ')\n",
+    "        if len(parts) == 2:\n",
+    "            timestamp, text = parts[0], parts[1]\n",
+    "            # Translate only the text part\n",
+    "            translated_text = translate_text(text, source_language, target_language)\n",
+    "            # Reconstruct the line with the translated text and the preserved timestamp\n",
+    "            translated_line = f\"{timestamp}: {translated_text}\"\n",
+    "            translated_transcript.append(translated_line)\n",
+    "        else:\n",
+    "            # If the line doesn't contain a timestamp, add it as is\n",
+    "            translated_transcript.append(line.strip())\n",
+    "\n",
+    "    return '\\n'.join(translated_transcript)\n",
+    "\n",
+    "\n",
+    "llm_config = {\n",
+    "    \"functions\": [\n",
+    "        {\n",
+    "            \"name\": \"recognize_transcript_from_video\",\n",
+    "            \"description\": \"recognize the speech from video and transfer into a txt file\",\n",
+    "            \"parameters\": {\n",
+    "                \"type\": \"object\",\n",
+    "                \"properties\": {\n",
+    "                    \"audio_filepath\": {\n",
+    "                        \"type\": \"string\",\n",
+    "                        \"description\": \"path of the video file\",\n",
+    "                    }\n",
+    "                },\n",
+    "                \"required\": [\"audio_filepath\"],\n",
+    "            },\n",
+    "        },\n",
+    "        {\n",
+    "            \"name\": \"translate_transcript\",\n",
+    "            \"description\": \"using translate_text function to translate the script\",\n",
+    "            \"parameters\": {\n",
+    "                \"type\": \"object\",\n",
+    "                \"properties\": {\n",
+    "                    \"source_language\": {\n",
+    "                        \"type\": \"string\",\n",
+    "                        \"description\": \"source language\",\n",
+    "                    },\n",
+    "                    \"target_language\": {\n",
+    "                        \"type\": \"string\",\n",
+    "                        \"description\": \"target language\",\n",
+    "                    }\n",
+    "                },\n",
+    "                \"required\": [\"source_language\", \"target_language\"],\n",
+    "            },\n",
+    "        },\n",
+    "    ],\n",
+    "    \"config_list\": config_list,\n",
+    "    \"timeout\": 120,\n",
+    "}\n",
+    "source_language = \"English\"\n",
+    "target_language = \"Chinese\"\n",
+    "key = os.getenv(\"OPENAI_API_KEY\")\n",
+    "target_video = \"your_file_path\"\n",
+    "\n",
+    "chatbot = autogen.AssistantAgent(\n",
+    "    name=\"chatbot\",\n",
+    "    system_message=\"For coding tasks, only use the functions you have been provided with. Reply TERMINATE when the task is done.\",\n",
+    "    llm_config=llm_config,\n",
+    ")\n",
+    "\n",
+    "user_proxy = autogen.UserProxyAgent(\n",
+    "    name=\"user_proxy\",\n",
+    "    is_termination_msg=lambda x: x.get(\"content\", \"\") and x.get(\"content\", \"\").rstrip().endswith(\"TERMINATE\"),\n",
+    "    human_input_mode=\"NEVER\",\n",
+    "    max_consecutive_auto_reply=10,\n",
+    "    code_execution_config={\"work_dir\": \"coding_2\"},\n",
+    ")\n",
+    "\n",
+    "user_proxy.register_function(\n",
+    "    function_map={\n",
+    "        \"recognize_transcript_from_video\": recognize_transcript_from_video,\n",
+    "        \"translate_transcript\": translate_transcript,\n",
+    "    }\n",
+    ")\n",
+    "user_proxy.initiate_chat(\n",
+    "    chatbot,\n",
+    "    message=f\"For the video located in {target_video}, recognize the speech and transfer it into a script file, \"\n",
+    "            f\"then translate from {source_language} text to a {target_language} video subtitle text. \",\n",
+    ")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}