diff --git a/notebooks/unit1/requirements-unit1.txt b/notebooks/unit1/requirements-unit1.txt index fd27dfc1..ebdaddb0 100644 --- a/notebooks/unit1/requirements-unit1.txt +++ b/notebooks/unit1/requirements-unit1.txt @@ -1,4 +1,4 @@ -stable-baselines3==2.0.0a5 +stable-baselines3 swig gymnasium[box2d] huggingface_sb3 diff --git a/notebooks/unit1/unit1.ipynb b/notebooks/unit1/unit1.ipynb index 06d62b08..bec20280 100644 --- a/notebooks/unit1/unit1.ipynb +++ b/notebooks/unit1/unit1.ipynb @@ -1,1182 +1,1180 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "njb_ProuHiOe" - }, - "source": [ - "# Unit 1: Train your first Deep Reinforcement Learning Agent ๐Ÿค–\n", - "\n", - "![Cover](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg)\n", - "\n", - "In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon ๐ŸŒ•**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations\n", - "\n", - "โฌ‡๏ธ Here is an example of what **you will achieve in just a couple of minutes.** โฌ‡๏ธ\n", - "\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PF46MwbZD00b" - }, - "outputs": [], - "source": [ - "%%html\n", - "" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "x7oR6R-ZIbeS" - }, - "source": [ - "### The environment ๐ŸŽฎ\n", - "\n", - "- [LunarLander-v2](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n", - "\n", - "### The library used ๐Ÿ“š\n", - "\n", - "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "OwEcFHe9RRZW" - }, - "source": [ - "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "4i6tjI2tHQ8j" - }, - "source": [ - "## Objectives of this notebook ๐Ÿ†\n", - "\n", - "At the end of the notebook, you will:\n", - "\n", - "- Be able to use **Gymnasium**, the environment library.\n", - "- Be able to use **Stable-Baselines3**, the deep reinforcement learning library.\n", - "- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ.\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ff-nyJdzJPND" - }, - "source": [ - "## This notebook is from Deep Reinforcement Learning Course\n", - "\n", - "\"Deep" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "6p5HnEefISCB" - }, - "source": [ - "In this free course, you will:\n", - "\n", - "- ๐Ÿ“– Study Deep Reinforcement Learning in **theory and practice**.\n", - "- ๐Ÿง‘โ€๐Ÿ’ป Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n", - "- ๐Ÿค– Train **agents in unique environments**\n", - "- ๐ŸŽ“ **Earn a certificate of completion** by completing 80% of the assignments.\n", - "\n", - "And more!\n", - "\n", - "Check ๐Ÿ“š the syllabus ๐Ÿ‘‰ https://simoninithomas.github.io/deep-rl-course\n", - "\n", - "Donโ€™t forget to **sign up to the course** (we are collecting your email to be able toย **send you the links when each Unit is published and give you information about the challenges and updates).**\n", - "\n", - "The best way to keep in touch and ask questions is **to join our discord server** to exchange with the community and with us ๐Ÿ‘‰๐Ÿป https://discord.gg/ydHrjt3WP5" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Y-mo_6rXIjRi" - }, - "source": [ - "## Prerequisites ๐Ÿ—๏ธ\n", - "\n", - "Before diving into the notebook, you need to:\n", - "\n", - "๐Ÿ”ฒ ๐Ÿ“ **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** ๐Ÿค—\n", - "\n", - "๐Ÿ”ฒ ๐Ÿ“š **Develop an understanding of the foundations of Reinforcement learning** (RL process, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HoeqMnr5LuYE" - }, - "source": [ - "## A small recap of Deep Reinforcement Learning ๐Ÿ“š\n", - "\n", - "\"The" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xcQYx9ynaFMD" - }, - "source": [ - "Let's do a small recap on what we learned in the first Unit:\n", - "\n", - "- Reinforcement Learning is a **computational approach to learning from actions**. We build an agent that learns from the environment by **interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.\n", - "\n", - "- The goal of any RL agent is to **maximize its expected cumulative reward** (also called expected return) because RL is based on the _reward hypothesis_, which is that all goals can be described as the maximization of an expected cumulative reward.\n", - "\n", - "- The RL process is a **loop that outputs a sequence of state, action, reward, and next state**.\n", - "\n", - "- To calculate the expected cumulative reward (expected return), **we discount the rewards**: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long-term future reward.\n", - "\n", - "- To solve an RL problem, you want to **find an optimal policy**; the policy is the \"brain\" of your AI that will tell us what action to take given a state. The optimal one is the one that gives you the actions that max the expected return.\n", - "\n", - "There are **two** ways to find your optimal policy:\n", - "\n", - "- By **training your policy directly**: policy-based methods.\n", - "- By **training a value function** that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.\n", - "\n", - "- Finally, we spoke about Deep RL because **we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name \"deep.\"**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qDploC3jSH99" - }, - "source": [ - "# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub ๐Ÿš€\n", - "\n", - "## Get a certificate ๐ŸŽ“\n", - "\n", - "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.\n", - "\n", - "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n", - "\n", - "For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HqzznTzhNfAC" - }, - "source": [ - "## Set the GPU ๐Ÿ’ช\n", - "\n", - "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n", - "\n", - "\"GPU" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "38HBd3t1SHJ8" - }, - "source": [ - "- `Hardware Accelerator > GPU`\n", - "\n", - "\"GPU" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jeDAH0h0EBiG" - }, - "source": [ - "## Install dependencies and create a virtual screen ๐Ÿ”ฝ\n", - "\n", - "The first step is to install the dependencies, weโ€™ll install multiple ones.\n", - "\n", - "- `gymnasium[box2d]`: Contains the LunarLander-v2 environment ๐ŸŒ›\n", - "- `stable-baselines3[extra]`: The deep reinforcement learning library.\n", - "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub.\n", - "\n", - "To make things easier, we created a script to install all these dependencies." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yQIGLPDkGhgG" - }, - "outputs": [], - "source": [ - "!apt install swig cmake" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9XaULfDZDvrC" - }, - "outputs": [], - "source": [ - "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BEKeXQJsQCYm" - }, - "source": [ - "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n", - "\n", - "Hence the following cell will install virtual screen libraries and create and run a virtual screen ๐Ÿ–ฅ" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "j5f2cGkdP-mb" - }, - "outputs": [], - "source": [ - "!sudo apt-get update\n", - "!sudo apt-get install -y python3-opengl\n", - "!apt install ffmpeg\n", - "!apt install xvfb\n", - "!pip3 install pyvirtualdisplay" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TCwBTAwAW9JJ" - }, - "source": [ - "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cYvkbef7XEMi" - }, - "outputs": [], - "source": [ - "import os\n", - "os.kill(os.getpid(), 9)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "BE5JWP5rQIKf" - }, - "outputs": [], - "source": [ - "# Virtual display\n", - "from pyvirtualdisplay import Display\n", - "\n", - "virtual_display = Display(visible=0, size=(1400, 900))\n", - "virtual_display.start()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wrgpVFqyENVf" - }, - "source": [ - "## Import the packages ๐Ÿ“ฆ\n", - "\n", - "One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.\n", - "\n", - "\n", - "The Hugging Face Hub ๐Ÿค— works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n", - "\n", - "You can see here all the Deep reinforcement Learning models available here๐Ÿ‘‰ https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cygWLPGsEQ0m" - }, - "outputs": [], - "source": [ - "import gymnasium\n", - "\n", - "from huggingface_sb3 import load_from_hub, package_to_hub\n", - "from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.\n", - "\n", - "from stable_baselines3 import PPO\n", - "from stable_baselines3.common.env_util import make_vec_env\n", - "from stable_baselines3.common.evaluation import evaluate_policy\n", - "from stable_baselines3.common.monitor import Monitor" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MRqRuRUl8CsB" - }, - "source": [ - "## Understand Gymnasium and how it works ๐Ÿค–\n", - "\n", - "๐Ÿ‹ The library containing our environment is called Gymnasium.\n", - "**You'll use Gymnasium a lot in Deep Reinforcement Learning.**\n", - "\n", - "Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).\n", - "\n", - "The Gymnasium library provides two things:\n", - "\n", - "- An interface that allows you to **create RL environments**.\n", - "- A **collection of environments** (gym-control, atari, box2D...).\n", - "\n", - "Let's look at an example, but first let's recall the RL loop.\n", - "\n", - "\"The" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "-TzNN0bQ_j-3" - }, - "source": [ - "At each step:\n", - "- Our Agent receivesย a **state (S0)**ย from theย **Environment**ย โ€” we receive the first frame of our game (Environment).\n", - "- Based on thatย **state (S0),**ย the Agent takes anย **action (A0)**ย โ€” our Agent will move to the right.\n", - "- The environment transitions to aย **new**ย **state (S1)**ย โ€” new frame.\n", - "- The environment gives someย **reward (R1)**ย to the Agent โ€” weโ€™re not deadย *(Positive Reward +1)*.\n", - "\n", - "\n", - "With Gymnasium:\n", - "\n", - "1๏ธโƒฃ We create our environment using `gymnasium.make()`\n", - "\n", - "2๏ธโƒฃ We reset the environment to its initial state with `observation = env.reset()`\n", - "\n", - "At each step:\n", - "\n", - "3๏ธโƒฃ Get an action using our model (in our example we take a random action)\n", - "\n", - "4๏ธโƒฃ Using `env.step(action)`, we perform this action in the environment and get\n", - "- `observation`: The new state (st+1)\n", - "- `reward`: The reward we get after executing the action\n", - "- `terminated`: Indicates if the episode terminated (agent reach the terminal state)\n", - "- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.\n", - "- `info`: A dictionary that provides additional information (depends on the environment).\n", - "\n", - "For more explanations check this ๐Ÿ‘‰ https://gymnasium.farama.org/api/env/#gymnasium.Env.step\n", - "\n", - "If the episode is terminated:\n", - "- We reset the environment to its initial state with `observation = env.reset()`\n", - "\n", - "**Let's look at an example!** Make sure to read the code\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "w7vOFlpA_ONz" - }, - "outputs": [], - "source": [ - "import gymnasium as gym\n", - "\n", - "# First, we create our environment called LunarLander-v2\n", - "env = gym.make(\"LunarLander-v2\")\n", - "\n", - "# Then we reset this environment\n", - "observation, info = env.reset()\n", - "\n", - "for _ in range(20):\n", - " # Take a random action\n", - " action = env.action_space.sample()\n", - " print(\"Action taken:\", action)\n", - "\n", - " # Do this action in the environment and get\n", - " # next_state, reward, terminated, truncated and info\n", - " observation, reward, terminated, truncated, info = env.step(action)\n", - "\n", - " # If the game is terminated (in our case we land, crashed) or truncated (timeout)\n", - " if terminated or truncated:\n", - " # Reset the environment\n", - " print(\"Environment is reset\")\n", - " observation, info = env.reset()\n", - "\n", - "env.close()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XIrKGGSlENZB" - }, - "source": [ - "## Create the LunarLander environment ๐ŸŒ› and understand how it works\n", - "\n", - "### [The environment ๐ŸŽฎ](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n", - "\n", - "In this first tutorial, weโ€™re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**\n", - "\n", - "---\n", - "\n", - "\n", - "๐Ÿ’ก A good habit when you start to use an environment is to check its documentation\n", - "\n", - "๐Ÿ‘‰ https://gymnasium.farama.org/environments/box2d/lunar_lander/\n", - "\n", - "---\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "poLBgRocF9aT" - }, - "source": [ - "Let's see what the Environment looks like:\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ZNPG0g_UGCfh" - }, - "outputs": [], - "source": [ - "# We create our environment with gym.make(\"\")\n", - "env = gym.make(\"LunarLander-v2\")\n", - "env.reset()\n", - "print(\"_____OBSERVATION SPACE_____ \\n\")\n", - "print(\"Observation Space Shape\", env.observation_space.shape)\n", - "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "2MXc15qFE0M9" - }, - "source": [ - "We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:\n", - "- Horizontal pad coordinate (x)\n", - "- Vertical pad coordinate (y)\n", - "- Horizontal speed (x)\n", - "- Vertical speed (y)\n", - "- Angle\n", - "- Angular speed\n", - "- If the left leg contact point has touched the land (boolean)\n", - "- If the right leg contact point has touched the land (boolean)\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "We5WqOBGLoSm" - }, - "outputs": [], - "source": [ - "print(\"\\n _____ACTION SPACE_____ \\n\")\n", - "print(\"Action Space Shape\", env.action_space.n)\n", - "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "MyxXwkI2Magx" - }, - "source": [ - "The action space (the set of possible actions the agent can take) is discrete with 4 actions available ๐ŸŽฎ:\n", - "\n", - "- Action 0: Do nothing,\n", - "- Action 1: Fire left orientation engine,\n", - "- Action 2: Fire the main engine,\n", - "- Action 3: Fire right orientation engine.\n", - "\n", - "Reward function (the function that will give a reward at each timestep) ๐Ÿ’ฐ:\n", - "\n", - "After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.\n", - "\n", - "For each step, the reward:\n", - "\n", - "- Is increased/decreased the closer/further the lander is to the landing pad.\n", - "- Is increased/decreased the slower/faster the lander is moving.\n", - "- Is decreased the more the lander is tilted (angle not horizontal).\n", - "- Is increased by 10 points for each leg that is in contact with the ground.\n", - "- Is decreased by 0.03 points each frame a side engine is firing.\n", - "- Is decreased by 0.3 points each frame the main engine is firing.\n", - "\n", - "The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**\n", - "\n", - "An episode is **considered a solution if it scores at least 200 points.**" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "dFD9RAFjG8aq" - }, - "source": [ - "#### Vectorized Environment\n", - "\n", - "- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "99hqQ_etEy1N" - }, - "outputs": [], - "source": [ - "# Create the environment\n", - "env = make_vec_env('LunarLander-v2', n_envs=16)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VgrE86r5E5IK" - }, - "source": [ - "## Create the Model ๐Ÿค–\n", - "- We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem ๐Ÿš€.\n", - "\n", - "- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).\n", - "\n", - "- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.\n", - "\n", - "---\n", - "\n", - "๐Ÿ’ก A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.\n", - "\n", - "----" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HLlClRW37Q7e" - }, - "source": [ - "\"Stable" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HV4yiUM_9_Ka" - }, - "source": [ - "To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).\n", - "\n", - "PPO is a combination of:\n", - "- *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.\n", - "- *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "5qL_4HeIOrEJ" - }, - "source": [ - "Stable-Baselines3 is easy to set up:\n", - "\n", - "1๏ธโƒฃ You **create your environment** (in our case it was done above)\n", - "\n", - "2๏ธโƒฃ You define the **model you want to use and instantiate this model** `model = PPO(\"MlpPolicy\")`\n", - "\n", - "3๏ธโƒฃ You **train the agent** with `model.learn` and define the number of training timesteps\n", - "\n", - "```\n", - "# Create environment\n", - "env = gym.make('LunarLander-v2')\n", - "\n", - "# Instantiate the agent\n", - "model = PPO('MlpPolicy', env, verbose=1)\n", - "# Train the agent\n", - "model.learn(total_timesteps=int(2e5))\n", - "```\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nxI6hT1GE4-A" - }, - "outputs": [], - "source": [ - "# TODO: Define a PPO MlpPolicy architecture\n", - "# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,\n", - "# if we had frames as input we would use CnnPolicy\n", - "model =" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QAN7B0_HCVZC" - }, - "source": [ - "#### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "543OHYDfcjK4" - }, - "outputs": [], - "source": [ - "# SOLUTION\n", - "# We added some parameters to accelerate the training\n", - "model = PPO(\n", - " policy = 'MlpPolicy',\n", - " env = env,\n", - " n_steps = 1024,\n", - " batch_size = 64,\n", - " n_epochs = 4,\n", - " gamma = 0.999,\n", - " gae_lambda = 0.98,\n", - " ent_coef = 0.01,\n", - " verbose=1)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ClJJk88yoBUi" - }, - "source": [ - "## Train the PPO agent ๐Ÿƒ\n", - "- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.\n", - "- During the training, take a โ˜• break you deserved it ๐Ÿค—" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qKnYkNiVp89p" - }, - "outputs": [], - "source": [ - "# TODO: Train it for 1,000,000 timesteps\n", - "\n", - "# TODO: Specify file name for model and save the model to file\n", - "model_name = \"ppo-LunarLander-v2\"\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1bQzQ-QcE3zo" - }, - "source": [ - "#### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "poBCy9u_csyR" - }, - "outputs": [], - "source": [ - "# SOLUTION\n", - "# Train it for 1,000,000 timesteps\n", - "model.learn(total_timesteps=1000000)\n", - "# Save the model\n", - "model_name = \"ppo-LunarLander-v2\"\n", - "model.save(model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BY_HuedOoISR" - }, - "source": [ - "## Evaluate the agent ๐Ÿ“ˆ\n", - "- Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).\n", - "- Now that our Lunar Lander agent is trained ๐Ÿš€, we need to **check its performance**.\n", - "- Stable-Baselines3 provides a method to do that: `evaluate_policy`.\n", - "- To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)\n", - "- In the next step, we'll see **how to automatically evaluate and share your agent to compete in a leaderboard, but for now let's do it ourselves**\n", - "\n", - "\n", - "๐Ÿ’ก When you evaluate your agent, you should not use your training environment but create an evaluation environment." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "yRpno0glsADy" - }, - "outputs": [], - "source": [ - "# TODO: Evaluate the agent\n", - "# Create a new environment for evaluation\n", - "eval_env =\n", - "\n", - "# Evaluate the model with 10 evaluation episodes and deterministic=True\n", - "mean_reward, std_reward =\n", - "\n", - "# Print the results\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BqPKw3jt_pG5" - }, - "source": [ - "#### Solution" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "zpz8kHlt_a_m" - }, - "outputs": [], - "source": [ - "#@title\n", - "eval_env = Monitor(gym.make(\"LunarLander-v2\", render_mode='rgb_array'))\n", - "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n", - "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "reBhoODwcXfr" - }, - "source": [ - "- In my case, I got a mean reward of `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon ๐ŸŒ›๐Ÿฅณ." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "IK_kR78NoNb2" - }, - "source": [ - "## Publish our trained model on the Hub ๐Ÿ”ฅ\n", - "Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code.\n", - "\n", - "๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n", - "\n", - "Here's an example of a Model Card (with Space Invaders):" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gs-Ew7e1gXN3" - }, - "source": [ - "By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n", - "\n", - "This way:\n", - "- You can **showcase our work** ๐Ÿ”ฅ\n", - "- You can **visualize your agent playing** ๐Ÿ‘€\n", - "- You can **share with the community an agent that others can use** ๐Ÿ’พ\n", - "- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "JquRrWytA6eo" - }, - "source": [ - "To be able to share your model with the community there are three more steps to follow:\n", - "\n", - "1๏ธโƒฃ (If it's not already done) create an account on Hugging Face โžก https://huggingface.co/join\n", - "\n", - "2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website.\n", - "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n", - "\n", - "\"Create\n", - "\n", - "- Copy the token\n", - "- Run the cell below and paste the token" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "GZiFBBlzxzxY" - }, - "outputs": [], - "source": [ - "notebook_login()\n", - "!git config --global credential.helper store" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_tsf2uv0g_4p" - }, - "source": [ - "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FGNh9VsZok0i" - }, - "source": [ - "3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Ay24l6bqFF18" - }, - "source": [ - "Let's fill the `package_to_hub` function:\n", - "- `model`: our trained model.\n", - "- `model_name`: the name of the trained model that we defined in `model_save`\n", - "- `model_architecture`: the model architecture we used, in our case PPO\n", - "- `env_id`: the name of the environment, in our case `LunarLander-v2`\n", - "- `eval_env`: the evaluation environment defined in eval_env\n", - "- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`\n", - "\n", - "๐Ÿ’ก **A good name is {username}/{model_architecture}-{env_id}**\n", - "\n", - "- `commit_message`: message of the commit" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "JPG7ofdGIHN8" - }, - "outputs": [], - "source": [ - "import gymnasium as gym\n", - "from stable_baselines3.common.vec_env import DummyVecEnv\n", - "from stable_baselines3.common.env_util import make_vec_env\n", - "\n", - "from huggingface_sb3 import package_to_hub\n", - "\n", - "## TODO: Define a repo_id\n", - "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n", - "repo_id =\n", - "\n", - "# TODO: Define the name of the environment\n", - "env_id =\n", - "\n", - "# Create the evaluation env and set the render_mode=\"rgb_array\"\n", - "eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode=\"rgb_array\"))])\n", - "\n", - "\n", - "# TODO: Define the model architecture we used\n", - "model_architecture = \"\"\n", - "\n", - "## TODO: Define the commit message\n", - "commit_message = \"\"\n", - "\n", - "# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub\n", - "package_to_hub(model=model, # Our trained model\n", - " model_name=model_name, # The name of our trained model\n", - " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n", - " env_id=env_id, # Name of the environment\n", - " eval_env=eval_env, # Evaluation Environment\n", - " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n", - " commit_message=commit_message)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Avf6gufJBGMw" - }, - "source": [ - "#### Solution\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "I2E--IJu8JYq" - }, - "outputs": [], - "source": [ - "import gymnasium as gym\n", - "\n", - "from stable_baselines3 import PPO\n", - "from stable_baselines3.common.vec_env import DummyVecEnv\n", - "from stable_baselines3.common.env_util import make_vec_env\n", - "\n", - "from huggingface_sb3 import package_to_hub\n", - "\n", - "# PLACE the variables you've just defined two cells above\n", - "# Define the name of the environment\n", - "env_id = \"LunarLander-v2\"\n", - "\n", - "# TODO: Define the model architecture we used\n", - "model_architecture = \"PPO\"\n", - "\n", - "## Define a repo_id\n", - "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n", - "## CHANGE WITH YOUR REPO ID\n", - "repo_id = \"ThomasSimonini/ppo-LunarLander-v2\" # Change with your repo id, you can't push with mine ๐Ÿ˜„\n", - "\n", - "## Define the commit message\n", - "commit_message = \"Upload PPO LunarLander-v2 trained agent\"\n", - "\n", - "# Create the evaluation env and set the render_mode=\"rgb_array\"\n", - "eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode=\"rgb_array\")])\n", - "\n", - "# PLACE the package_to_hub function you've just filled here\n", - "package_to_hub(model=model, # Our trained model\n", - " model_name=model_name, # The name of our trained model\n", - " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n", - " env_id=env_id, # Name of the environment\n", - " eval_env=eval_env, # Evaluation Environment\n", - " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v2\n", - " commit_message=commit_message)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "T79AEAWEFIxz" - }, - "source": [ - "Congrats ๐Ÿฅณ you've just trained and uploaded your first Deep Reinforcement Learning agent. The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:\n", - "* See a video preview of your agent at the right.\n", - "* Click \"Files and versions\" to see all the files in the repository.\n", - "* Click \"Use in stable-baselines3\" to get a code snippet that shows how to load the model.\n", - "* A model card (`README.md` file) which gives a description of the model\n", - "\n", - "Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.\n", - "\n", - "Compare the results of your LunarLander-v2 with your classmates using the leaderboard ๐Ÿ† ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9nWnuQHRfFRa" - }, - "source": [ - "## Load a saved LunarLander model from the Hub ๐Ÿค—\n", - "Thanks to [ironbar](https://github.com/ironbar) for the contribution.\n", - "\n", - "Loading a saved model from the Hub is really easy.\n", - "\n", - "You go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.\n", - "1. You select one and copy its repo_id\n", - "\n", - "\"Copy-id\"/" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "hNPLJF2bfiUw" - }, - "source": [ - "2. Then we just need to use load_from_hub with:\n", - "- The repo_id\n", - "- The filename: the saved model inside the repo and its extension (*.zip)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bhb9-NtsinKB" - }, - "source": [ - "Because the model I download from the Hub was trained with Gym (the former version of Gymnasium) we need to install shimmy a API conversion tool that will help us to run the environment correctly.\n", - "\n", - "Shimmy Documentation: https://github.com/Farama-Foundation/Shimmy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "03WI-bkci1kH" - }, - "outputs": [], - "source": [ - "!pip install shimmy" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "oj8PSGHJfwz3" - }, - "outputs": [], - "source": [ - "from huggingface_sb3 import load_from_hub\n", - "repo_id = \"Classroom-workshop/assignment2-omar\" # The repo_id\n", - "filename = \"ppo-LunarLander-v2.zip\" # The model filename.zip\n", - "\n", - "# When the model was trained on Python 3.8 the pickle protocol is 5\n", - "# But Python 3.6, 3.7 use protocol 4\n", - "# In order to get compatibility we need to:\n", - "# 1. Install pickle5 (we done it at the beginning of the colab)\n", - "# 2. Create a custom empty object we pass as parameter to PPO.load()\n", - "custom_objects = {\n", - " \"learning_rate\": 0.0,\n", - " \"lr_schedule\": lambda _: 0.0,\n", - " \"clip_range\": lambda _: 0.0,\n", - "}\n", - "\n", - "checkpoint = load_from_hub(repo_id, filename)\n", - "model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Fs0Y-qgPgLUf" - }, - "source": [ - "Let's evaluate this agent:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "PAEVwK-aahfx" - }, - "outputs": [], - "source": [ - "#@title\n", - "eval_env = Monitor(gym.make(\"LunarLander-v2\"))\n", - "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n", - "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BQAwLnYFPk-s" - }, - "source": [ - "## Some additional challenges ๐Ÿ†\n", - "The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!\n", - "\n", - "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", - "\n", - "Here are some ideas to achieve so:\n", - "* Train more steps\n", - "* Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.\n", - "* Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.\n", - "* **Push your new trained model** on the Hub ๐Ÿ”ฅ\n", - "\n", - "**Compare the results of your LunarLander-v2 with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) ๐Ÿ†\n", - "\n", - "Is moon landing too boring for you? Try to **change the environment**, why not use MountainCar-v0, CartPole-v1 or CarRacing-v0? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun ๐ŸŽ‰." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "9lM95-dvmif8" - }, - "source": [ - "________________________________________________________________________\n", - "Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**\n", - "\n", - "If youโ€™re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**\n", - "\n", - "Take time to really **grasp the material before continuing and try the additional challenges**. Itโ€™s important to master these elements and have a solid foundations.\n", - "\n", - "Naturally, during the course, weโ€™re going to dive deeper into these concepts but **itโ€™s better to have a good understanding of them now before diving into the next chapters.**\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BjLhT70TEZIn" - }, - "source": [ - "Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.\n", - "\n", - "\"Huggy\"/\n", - "\n", - "## Keep learning, stay awesome ๐Ÿค—" - ] - } - ], - "metadata": { - "accelerator": "GPU", - "colab": { - "collapsed_sections": [ - "QAN7B0_HCVZC", - "BqPKw3jt_pG5" - ], - "private_outputs": true, - "provenance": [] - }, - "gpuClass": "standard", - "kernelspec": { - "display_name": "Python 3.9.7", - "language": "python", - "name": "python3" - }, - "language_info": { - "name": "python", - "version": "3.9.7" - }, - "vscode": { - "interpreter": { - "hash": "ed7f8024e43d3b8f5ca3c5e1a8151ab4d136b3ecee1e3fd59e0766ccc55e1b10" - } - } - }, - "nbformat": 4, - "nbformat_minor": 0 + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "njb_ProuHiOe" + }, + "source": [ + "# Unit 1: Train your first Deep Reinforcement Learning Agent ๐Ÿค–\n", + "\n", + "![Cover](https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit1/thumbnail.jpg)\n", + "\n", + "In this notebook, you'll train your **first Deep Reinforcement Learning agent** a Lunar Lander agent that will learn to **land correctly on the Moon ๐ŸŒ•**. Using [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/) a Deep Reinforcement Learning library, share them with the community, and experiment with different configurations\n", + "\n", + "โฌ‡๏ธ Here is an example of what **you will achieve in just a couple of minutes.** โฌ‡๏ธ\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x7oR6R-ZIbeS" + }, + "source": [ + "### The environment ๐ŸŽฎ\n", + "\n", + "- [LunarLander-v3](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n", + "\n", + "### The library used ๐Ÿ“š\n", + "\n", + "- [Stable-Baselines3](https://stable-baselines3.readthedocs.io/en/master/)" + ] + }, + { + "metadata": { + "id": "OwEcFHe9RRZW" + }, + "cell_type": "markdown", + "source": "We're constantly trying to improve our tutorials, so **if you find some issues in this notebook**, please [open an issue on the Github Repo](https://github.com/huggingface/deep-rl-class/issues)." + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4i6tjI2tHQ8j" + }, + "source": [ + "## Objectives of this notebook ๐Ÿ†\n", + "\n", + "At the end of the notebook, you will:\n", + "\n", + "- Be able to use **Gymnasium**, the environment library.\n", + "- Be able to use **Stable-Baselines3**, the deep reinforcement learning library.\n", + "- Be able to **push your trained agent to the Hub** with a nice video replay and an evaluation score ๐Ÿ”ฅ.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ff-nyJdzJPND" + }, + "source": [ + "## This notebook is from Deep Reinforcement Learning Course\n", + "\n", + "\"Deep" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6p5HnEefISCB" + }, + "source": [ + "In this free course, you will:\n", + "\n", + "- ๐Ÿ“– Study Deep Reinforcement Learning in **theory and practice**.\n", + "- ๐Ÿง‘โ€๐Ÿ’ป Learn to **use famous Deep RL libraries** such as Stable Baselines3, RL Baselines3 Zoo, CleanRL and Sample Factory 2.0.\n", + "- ๐Ÿค– Train **agents in unique environments**\n", + "- ๐ŸŽ“ **Earn a certificate of completion** by completing 80% of the assignments.\n", + "\n", + "And more!\n", + "\n", + "Check ๐Ÿ“š the syllabus ๐Ÿ‘‰ https://simoninithomas.github.io/deep-rl-course\n", + "\n", + "Donโ€™t forget to **sign up to the course** (we are collecting your email to be able toย **send you the links when each Unit is published and give you information about the challenges and updates).**\n", + "\n", + "The best way to keep in touch and ask questions is **to join our discord server** to exchange with the community and with us ๐Ÿ‘‰๐Ÿป https://discord.gg/ydHrjt3WP5" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y-mo_6rXIjRi" + }, + "source": [ + "## Prerequisites ๐Ÿ—๏ธ\n", + "\n", + "Before diving into the notebook, you need to:\n", + "\n", + "๐Ÿ”ฒ ๐Ÿ“ **[Read Unit 0](https://huggingface.co/deep-rl-course/unit0/introduction)** that gives you all the **information about the course and helps you to onboard** ๐Ÿค—\n", + "\n", + "๐Ÿ”ฒ ๐Ÿ“š **Develop an understanding of the foundations of Reinforcement learning** (RL process, Rewards hypothesis...) by [reading Unit 1](https://huggingface.co/deep-rl-course/unit1/introduction)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HoeqMnr5LuYE" + }, + "source": [ + "## A small recap of Deep Reinforcement Learning ๐Ÿ“š\n", + "\n", + "\"The" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xcQYx9ynaFMD" + }, + "source": [ + "Let's do a small recap on what we learned in the first Unit:\n", + "\n", + "- Reinforcement Learning is a **computational approach to learning from actions**. We build an agent that learns from the environment by **interacting with it through trial and error** and receiving rewards (negative or positive) as feedback.\n", + "\n", + "- The goal of any RL agent is to **maximize its expected cumulative reward** (also called expected return) because RL is based on the _reward hypothesis_, which is that all goals can be described as the maximization of an expected cumulative reward.\n", + "\n", + "- The RL process is a **loop that outputs a sequence of state, action, reward, and next state**.\n", + "\n", + "- To calculate the expected cumulative reward (expected return), **we discount the rewards**: the rewards that come sooner (at the beginning of the game) are more probable to happen since they are more predictable than the long-term future reward.\n", + "\n", + "- To solve an RL problem, you want to **find an optimal policy**; the policy is the \"brain\" of your AI that will tell us what action to take given a state. The optimal one is the one that gives you the actions that max the expected return.\n", + "\n", + "There are **two** ways to find your optimal policy:\n", + "\n", + "- By **training your policy directly**: policy-based methods.\n", + "- By **training a value function** that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.\n", + "\n", + "- Finally, we spoke about Deep RL because **we introduce deep neural networks to estimate the action to take (policy-based) or to estimate the value of a state (value-based) hence the name \"deep.\"**" + ] + }, + { + "metadata": { + "id": "PF46MwbZD00b" + }, + "cell_type": "code", + "outputs": [], + "execution_count": null, + "source": [ + "%%html\n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qDploC3jSH99" + }, + "source": [ + "# Let's train our first Deep Reinforcement Learning agent and upload it to the Hub ๐Ÿš€\n", + "\n", + "## Get a certificate ๐ŸŽ“\n", + "\n", + "To validate this hands-on for the [certification process](https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process), you need to push your trained model to the Hub and **get a result of >= 200**.\n", + "\n", + "To find your result, go to the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) and find your model, **the result = mean_reward - std of reward**\n", + "\n", + "For more information about the certification process, check this section ๐Ÿ‘‰ https://huggingface.co/deep-rl-course/en/unit0/introduction#certification-process" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HqzznTzhNfAC" + }, + "source": [ + "## Set the GPU ๐Ÿ’ช\n", + "\n", + "- To **accelerate the agent's training, we'll use a GPU**. To do that, go to `Runtime > Change Runtime type`\n", + "\n", + "\"GPU" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "38HBd3t1SHJ8" + }, + "source": [ + "- `Hardware Accelerator > GPU`\n", + "\n", + "\"GPU" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jeDAH0h0EBiG" + }, + "source": [ + "## Install dependencies and create a virtual screen ๐Ÿ”ฝ\n", + "\n", + "The first step is to install the dependencies, weโ€™ll install multiple ones.\n", + "\n", + "- `gymnasium[box2d]`: Contains the LunarLander-v3 environment ๐ŸŒ›\n", + "- `stable-baselines3[extra]`: The deep reinforcement learning library.\n", + "- `huggingface_sb3`: Additional code for Stable-baselines3 to load and upload models from the Hugging Face ๐Ÿค— Hub.\n", + "\n", + "To make things easier, we created a script to install all these dependencies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yQIGLPDkGhgG" + }, + "outputs": [], + "source": [ + "!apt install swig cmake" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9XaULfDZDvrC" + }, + "outputs": [], + "source": [ + "!pip install -r https://raw.githubusercontent.com/huggingface/deep-rl-class/main/notebooks/unit1/requirements-unit1.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BEKeXQJsQCYm" + }, + "source": [ + "During the notebook, we'll need to generate a replay video. To do so, with colab, **we need to have a virtual screen to be able to render the environment** (and thus record the frames).\n", + "\n", + "Hence the following cell will install virtual screen libraries and create and run a virtual screen ๐Ÿ–ฅ" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "j5f2cGkdP-mb" + }, + "outputs": [], + "source": [ + "!sudo apt-get update\n", + "!sudo apt-get install -y python3-opengl\n", + "!apt install ffmpeg\n", + "!apt install xvfb\n", + "!pip3 install pyvirtualdisplay" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TCwBTAwAW9JJ" + }, + "source": [ + "To make sure the new installed libraries are used, **sometimes it's required to restart the notebook runtime**. The next cell will force the **runtime to crash, so you'll need to connect again and run the code starting from here**. Thanks to this trick, **we will be able to run our virtual screen.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cYvkbef7XEMi" + }, + "outputs": [], + "source": [ + "import os\n", + "os.kill(os.getpid(), 9)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BE5JWP5rQIKf" + }, + "outputs": [], + "source": [ + "# Virtual display\n", + "from pyvirtualdisplay import Display\n", + "\n", + "virtual_display = Display(visible=0, size=(1400, 900))\n", + "virtual_display.start()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wrgpVFqyENVf" + }, + "source": [ + "## Import the packages ๐Ÿ“ฆ\n", + "\n", + "One additional library we import is huggingface_hub **to be able to upload and download trained models from the hub**.\n", + "\n", + "\n", + "The Hugging Face Hub ๐Ÿค— works as a central place where anyone can share and explore models and datasets. It has versioning, metrics, visualizations and other features that will allow you to easily collaborate with others.\n", + "\n", + "You can see here all the Deep reinforcement Learning models available here๐Ÿ‘‰ https://huggingface.co/models?pipeline_tag=reinforcement-learning&sort=downloads\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cygWLPGsEQ0m" + }, + "outputs": [], + "source": [ + "import gymnasium\n", + "\n", + "from huggingface_sb3 import load_from_hub, package_to_hub\n", + "from huggingface_hub import notebook_login # To log to our Hugging Face account to be able to upload models to the Hub.\n", + "\n", + "from stable_baselines3 import PPO\n", + "from stable_baselines3.common.env_util import make_vec_env\n", + "from stable_baselines3.common.evaluation import evaluate_policy\n", + "from stable_baselines3.common.monitor import Monitor" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MRqRuRUl8CsB" + }, + "source": [ + "## Understand Gymnasium and how it works ๐Ÿค–\n", + "\n", + "๐Ÿ‹ The library containing our environment is called Gymnasium.\n", + "**You'll use Gymnasium a lot in Deep Reinforcement Learning.**\n", + "\n", + "Gymnasium is the **new version of Gym library** [maintained by the Farama Foundation](https://farama.org/).\n", + "\n", + "The Gymnasium library provides two things:\n", + "\n", + "- An interface that allows you to **create RL environments**.\n", + "- A **collection of environments** (gym-control, atari, box2D...).\n", + "\n", + "Let's look at an example, but first let's recall the RL loop.\n", + "\n", + "\"The" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-TzNN0bQ_j-3" + }, + "source": [ + "At each step:\n", + "- Our Agent receivesย a **state (S0)**ย from theย **Environment**ย โ€” we receive the first frame of our game (Environment).\n", + "- Based on thatย **state (S0),**ย the Agent takes anย **action (A0)**ย โ€” our Agent will move to the right.\n", + "- The environment transitions to aย **new**ย **state (S1)**ย โ€” new frame.\n", + "- The environment gives someย **reward (R1)**ย to the Agent โ€” weโ€™re not deadย *(Positive Reward +1)*.\n", + "\n", + "\n", + "With Gymnasium:\n", + "\n", + "1๏ธโƒฃ We create our environment using `gymnasium.make()`\n", + "\n", + "2๏ธโƒฃ We reset the environment to its initial state with `observation = env.reset()`\n", + "\n", + "At each step:\n", + "\n", + "3๏ธโƒฃ Get an action using our model (in our example we take a random action)\n", + "\n", + "4๏ธโƒฃ Using `env.step(action)`, we perform this action in the environment and get\n", + "- `observation`: The new state (st+1)\n", + "- `reward`: The reward we get after executing the action\n", + "- `terminated`: Indicates if the episode terminated (agent reach the terminal state)\n", + "- `truncated`: Introduced with this new version, it indicates a timelimit or if an agent go out of bounds of the environment for instance.\n", + "- `info`: A dictionary that provides additional information (depends on the environment).\n", + "\n", + "For more explanations check this ๐Ÿ‘‰ https://gymnasium.farama.org/api/env/#gymnasium.Env.step\n", + "\n", + "If the episode is terminated:\n", + "- We reset the environment to its initial state with `observation = env.reset()`\n", + "\n", + "**Let's look at an example!** Make sure to read the code\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w7vOFlpA_ONz" + }, + "outputs": [], + "source": [ + "import gymnasium as gym\n", + "\n", + "# First, we create our environment called LunarLander-v3\n", + "env = gym.make(\"LunarLander-v3\")\n", + "\n", + "# Then we reset this environment\n", + "observation, info = env.reset()\n", + "\n", + "for _ in range(20):\n", + " # Take a random action\n", + " action = env.action_space.sample()\n", + " print(\"Action taken:\", action)\n", + "\n", + " # Do this action in the environment and get\n", + " # next_state, reward, terminated, truncated and info\n", + " observation, reward, terminated, truncated, info = env.step(action)\n", + "\n", + " # If the game is terminated (in our case we land, crashed) or truncated (timeout)\n", + " if terminated or truncated:\n", + " # Reset the environment\n", + " print(\"Environment is reset\")\n", + " observation, info = env.reset()\n", + "\n", + "env.close()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XIrKGGSlENZB" + }, + "source": [ + "## Create the LunarLander environment ๐ŸŒ› and understand how it works\n", + "\n", + "### [The environment ๐ŸŽฎ](https://gymnasium.farama.org/environments/box2d/lunar_lander/)\n", + "\n", + "In this first tutorial, weโ€™re going to train our agent, a [Lunar Lander](https://gymnasium.farama.org/environments/box2d/lunar_lander/), **to land correctly on the moon**. To do that, the agent needs to learn **to adapt its speed and position (horizontal, vertical, and angular) to land correctly.**\n", + "\n", + "---\n", + "\n", + "\n", + "๐Ÿ’ก A good habit when you start to use an environment is to check its documentation\n", + "\n", + "๐Ÿ‘‰ https://gymnasium.farama.org/environments/box2d/lunar_lander/\n", + "\n", + "---\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "poLBgRocF9aT" + }, + "source": [ + "Let's see what the Environment looks like:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZNPG0g_UGCfh" + }, + "outputs": [], + "source": [ + "# We create our environment with gym.make(\"\")\n", + "env = gym.make(\"LunarLander-v3\")\n", + "env.reset()\n", + "print(\"_____OBSERVATION SPACE_____ \\n\")\n", + "print(\"Observation Space Shape\", env.observation_space.shape)\n", + "print(\"Sample observation\", env.observation_space.sample()) # Get a random observation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2MXc15qFE0M9" + }, + "source": [ + "We see with `Observation Space Shape (8,)` that the observation is a vector of size 8, where each value contains different information about the lander:\n", + "- Horizontal pad coordinate (x)\n", + "- Vertical pad coordinate (y)\n", + "- Horizontal speed (x)\n", + "- Vertical speed (y)\n", + "- Angle\n", + "- Angular speed\n", + "- If the left leg contact point has touched the land (boolean)\n", + "- If the right leg contact point has touched the land (boolean)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "We5WqOBGLoSm" + }, + "outputs": [], + "source": [ + "print(\"\\n _____ACTION SPACE_____ \\n\")\n", + "print(\"Action Space Shape\", env.action_space.n)\n", + "print(\"Action Space Sample\", env.action_space.sample()) # Take a random action" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MyxXwkI2Magx" + }, + "source": [ + "The action space (the set of possible actions the agent can take) is discrete with 4 actions available ๐ŸŽฎ:\n", + "\n", + "- Action 0: Do nothing,\n", + "- Action 1: Fire left orientation engine,\n", + "- Action 2: Fire the main engine,\n", + "- Action 3: Fire right orientation engine.\n", + "\n", + "Reward function (the function that will give a reward at each timestep) ๐Ÿ’ฐ:\n", + "\n", + "After every step a reward is granted. The total reward of an episode is the **sum of the rewards for all the steps within that episode**.\n", + "\n", + "For each step, the reward:\n", + "\n", + "- Is increased/decreased the closer/further the lander is to the landing pad.\n", + "- Is increased/decreased the slower/faster the lander is moving.\n", + "- Is decreased the more the lander is tilted (angle not horizontal).\n", + "- Is increased by 10 points for each leg that is in contact with the ground.\n", + "- Is decreased by 0.03 points each frame a side engine is firing.\n", + "- Is decreased by 0.3 points each frame the main engine is firing.\n", + "\n", + "The episode receive an **additional reward of -100 or +100 points for crashing or landing safely respectively.**\n", + "\n", + "An episode is **considered a solution if it scores at least 200 points.**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dFD9RAFjG8aq" + }, + "source": [ + "#### Vectorized Environment\n", + "\n", + "- We create a vectorized environment (a method for stacking multiple independent environments into a single environment) of 16 environments, this way, **we'll have more diverse experiences during the training.**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "99hqQ_etEy1N" + }, + "outputs": [], + "source": [ + "# Create the environment\n", + "env = make_vec_env('LunarLander-v3', n_envs=16)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VgrE86r5E5IK" + }, + "source": [ + "## Create the Model ๐Ÿค–\n", + "- We have studied our environment and we understood the problem: **being able to land the Lunar Lander to the Landing Pad correctly by controlling left, right and main orientation engine**. Now let's build the algorithm we're going to use to solve this Problem ๐Ÿš€.\n", + "\n", + "- To do so, we're going to use our first Deep RL library, [Stable Baselines3 (SB3)](https://stable-baselines3.readthedocs.io/en/master/).\n", + "\n", + "- SB3 is a set of **reliable implementations of reinforcement learning algorithms in PyTorch**.\n", + "\n", + "---\n", + "\n", + "๐Ÿ’ก A good habit when using a new library is to dive first on the documentation: https://stable-baselines3.readthedocs.io/en/master/ and then try some tutorials.\n", + "\n", + "----" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HLlClRW37Q7e" + }, + "source": [ + "\"Stable" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HV4yiUM_9_Ka" + }, + "source": [ + "To solve this problem, we're going to use SB3 **PPO**. [PPO (aka Proximal Policy Optimization) is one of the SOTA (state of the art) Deep Reinforcement Learning algorithms that you'll study during this course](https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#example%5D).\n", + "\n", + "PPO is a combination of:\n", + "- *Value-based reinforcement learning method*: learning an action-value function that will tell us the **most valuable action to take given a state and action**.\n", + "- *Policy-based reinforcement learning method*: learning a policy that will **give us a probability distribution over actions**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5qL_4HeIOrEJ" + }, + "source": [ + "Stable-Baselines3 is easy to set up:\n", + "\n", + "1๏ธโƒฃ You **create your environment** (in our case it was done above)\n", + "\n", + "2๏ธโƒฃ You define the **model you want to use and instantiate this model** `model = PPO(\"MlpPolicy\")`\n", + "\n", + "3๏ธโƒฃ You **train the agent** with `model.learn` and define the number of training timesteps\n", + "\n", + "```\n", + "# Create environment\n", + "env = gym.make('LunarLander-v3')\n", + "\n", + "# Instantiate the agent\n", + "model = PPO('MlpPolicy', env, verbose=1)\n", + "# Train the agent\n", + "model.learn(total_timesteps=int(2e5))\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nxI6hT1GE4-A" + }, + "outputs": [], + "source": [ + "# TODO: Define a PPO MlpPolicy architecture\n", + "# We use MultiLayerPerceptron (MLPPolicy) because the input is a vector,\n", + "# if we had frames as input we would use CnnPolicy\n", + "model =" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QAN7B0_HCVZC" + }, + "source": [ + "#### Solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "543OHYDfcjK4" + }, + "outputs": [], + "source": [ + "# SOLUTION\n", + "# We added some parameters to accelerate the training\n", + "model = PPO(\n", + " policy = 'MlpPolicy',\n", + " env = env,\n", + " n_steps = 1024,\n", + " batch_size = 64,\n", + " n_epochs = 4,\n", + " gamma = 0.999,\n", + " gae_lambda = 0.98,\n", + " ent_coef = 0.01,\n", + " verbose=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ClJJk88yoBUi" + }, + "source": [ + "## Train the PPO agent ๐Ÿƒ\n", + "- Let's train our agent for 1,000,000 timesteps, don't forget to use GPU on Colab. It will take approximately ~20min, but you can use fewer timesteps if you just want to try it out.\n", + "- During the training, take a โ˜• break you deserved it ๐Ÿค—" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qKnYkNiVp89p" + }, + "outputs": [], + "source": [ + "# TODO: Train it for 1,000,000 timesteps\n", + "\n", + "# TODO: Specify file name for model and save the model to file\n", + "model_name = \"ppo-LunarLander-v3\"\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1bQzQ-QcE3zo" + }, + "source": [ + "#### Solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "poBCy9u_csyR" + }, + "outputs": [], + "source": [ + "# SOLUTION\n", + "# Train it for 1,000,000 timesteps\n", + "model.learn(total_timesteps=1000000)\n", + "# Save the model\n", + "model_name = \"ppo-LunarLander-v3\"\n", + "model.save(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BY_HuedOoISR" + }, + "source": [ + "## Evaluate the agent ๐Ÿ“ˆ\n", + "- Remember to wrap the environment in a [Monitor](https://stable-baselines3.readthedocs.io/en/master/common/monitor.html).\n", + "- Now that our Lunar Lander agent is trained ๐Ÿš€, we need to **check its performance**.\n", + "- Stable-Baselines3 provides a method to do that: `evaluate_policy`.\n", + "- To fill that part you need to [check the documentation](https://stable-baselines3.readthedocs.io/en/master/guide/examples.html#basic-usage-training-saving-loading)\n", + "- In the next step, we'll see **how to automatically evaluate and share your agent to compete in a leaderboard, but for now let's do it ourselves**\n", + "\n", + "\n", + "๐Ÿ’ก When you evaluate your agent, you should not use your training environment but create an evaluation environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yRpno0glsADy" + }, + "outputs": [], + "source": [ + "# TODO: Evaluate the agent\n", + "# Create a new environment for evaluation\n", + "eval_env =\n", + "\n", + "# Evaluate the model with 10 evaluation episodes and deterministic=True\n", + "mean_reward, std_reward =\n", + "\n", + "# Print the results\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BqPKw3jt_pG5" + }, + "source": [ + "#### Solution" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zpz8kHlt_a_m" + }, + "outputs": [], + "source": [ + "#@title\n", + "eval_env = Monitor(gym.make(\"LunarLander-v3\", render_mode='rgb_array'))\n", + "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n", + "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "reBhoODwcXfr" + }, + "source": [ + "- In my case, I got a mean reward of `200.20 +/- 20.80` after training for 1 million steps, which means that our lunar lander agent is ready to land on the moon ๐ŸŒ›๐Ÿฅณ." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IK_kR78NoNb2" + }, + "source": [ + "## Publish our trained model on the Hub ๐Ÿ”ฅ\n", + "Now that we saw we got good results after the training, we can publish our trained model on the hub ๐Ÿค— with one line of code.\n", + "\n", + "๐Ÿ“š The libraries documentation ๐Ÿ‘‰ https://github.com/huggingface/huggingface_sb3/tree/main#hugging-face--x-stable-baselines3-v20\n", + "\n", + "Here's an example of a Model Card (with Space Invaders):" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gs-Ew7e1gXN3" + }, + "source": [ + "By using `package_to_hub` **you evaluate, record a replay, generate a model card of your agent and push it to the hub**.\n", + "\n", + "This way:\n", + "- You can **showcase our work** ๐Ÿ”ฅ\n", + "- You can **visualize your agent playing** ๐Ÿ‘€\n", + "- You can **share with the community an agent that others can use** ๐Ÿ’พ\n", + "- You can **access a leaderboard ๐Ÿ† to see how well your agent is performing compared to your classmates** ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JquRrWytA6eo" + }, + "source": [ + "To be able to share your model with the community there are three more steps to follow:\n", + "\n", + "1๏ธโƒฃ (If it's not already done) create an account on Hugging Face โžก https://huggingface.co/join\n", + "\n", + "2๏ธโƒฃ Sign in and then, you need to store your authentication token from the Hugging Face website.\n", + "- Create a new token (https://huggingface.co/settings/tokens) **with write role**\n", + "\n", + "\"Create\n", + "\n", + "- Copy the token\n", + "- Run the cell below and paste the token" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GZiFBBlzxzxY" + }, + "outputs": [], + "source": [ + "notebook_login()\n", + "!git config --global credential.helper store" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_tsf2uv0g_4p" + }, + "source": [ + "If you don't want to use a Google Colab or a Jupyter Notebook, you need to use this command instead: `huggingface-cli login`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FGNh9VsZok0i" + }, + "source": [ + "3๏ธโƒฃ We're now ready to push our trained agent to the ๐Ÿค— Hub ๐Ÿ”ฅ using `package_to_hub()` function" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ay24l6bqFF18" + }, + "source": [ + "Let's fill the `package_to_hub` function:\n", + "- `model`: our trained model.\n", + "- `model_name`: the name of the trained model that we defined in `model_save`\n", + "- `model_architecture`: the model architecture we used, in our case PPO\n", + "- `env_id`: the name of the environment, in our case `LunarLander-v3`\n", + "- `eval_env`: the evaluation environment defined in eval_env\n", + "- `repo_id`: the name of the Hugging Face Hub Repository that will be created/updated `(repo_id = {username}/{repo_name})`\n", + "\n", + "๐Ÿ’ก **A good name is {username}/{model_architecture}-{env_id}**\n", + "\n", + "- `commit_message`: message of the commit" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JPG7ofdGIHN8" + }, + "outputs": [], + "source": [ + "import gymnasium as gym\n", + "from stable_baselines3.common.vec_env import DummyVecEnv\n", + "from stable_baselines3.common.env_util import make_vec_env\n", + "\n", + "from huggingface_sb3 import package_to_hub\n", + "\n", + "## TODO: Define a repo_id\n", + "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v3\n", + "repo_id =\n", + "\n", + "# TODO: Define the name of the environment\n", + "env_id =\n", + "\n", + "# Create the evaluation env and set the render_mode=\"rgb_array\"\n", + "eval_env = DummyVecEnv([lambda: Monitor(gym.make(env_id, render_mode=\"rgb_array\"))])\n", + "\n", + "\n", + "# TODO: Define the model architecture we used\n", + "model_architecture = \"\"\n", + "\n", + "## TODO: Define the commit message\n", + "commit_message = \"\"\n", + "\n", + "# method save, evaluate, generate a model card and record a replay video of your agent before pushing the repo to the hub\n", + "package_to_hub(model=model, # Our trained model\n", + " model_name=model_name, # The name of our trained model\n", + " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n", + " env_id=env_id, # Name of the environment\n", + " eval_env=eval_env, # Evaluation Environment\n", + " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v3\n", + " commit_message=commit_message)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Avf6gufJBGMw" + }, + "source": [ + "#### Solution\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "I2E--IJu8JYq" + }, + "outputs": [], + "source": [ + "import gymnasium as gym\n", + "\n", + "from stable_baselines3 import PPO\n", + "from stable_baselines3.common.vec_env import DummyVecEnv\n", + "from stable_baselines3.common.env_util import make_vec_env\n", + "\n", + "from huggingface_sb3 import package_to_hub\n", + "\n", + "# PLACE the variables you've just defined two cells above\n", + "# Define the name of the environment\n", + "env_id = \"LunarLander-v3\"\n", + "\n", + "# TODO: Define the model architecture we used\n", + "model_architecture = \"PPO\"\n", + "\n", + "## Define a repo_id\n", + "## repo_id is the id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v3\n", + "## CHANGE WITH YOUR REPO ID\n", + "repo_id = \"ThomasSimonini/ppo-LunarLander-v3\" # Change with your repo id, you can't push with mine ๐Ÿ˜„\n", + "\n", + "## Define the commit message\n", + "commit_message = \"Upload PPO LunarLander-v3 trained agent\"\n", + "\n", + "# Create the evaluation env and set the render_mode=\"rgb_array\"\n", + "eval_env = DummyVecEnv([lambda: gym.make(env_id, render_mode=\"rgb_array\")])\n", + "\n", + "# PLACE the package_to_hub function you've just filled here\n", + "package_to_hub(model=model, # Our trained model\n", + " model_name=model_name, # The name of our trained model\n", + " model_architecture=model_architecture, # The model architecture we used: in our case PPO\n", + " env_id=env_id, # Name of the environment\n", + " eval_env=eval_env, # Evaluation Environment\n", + " repo_id=repo_id, # id of the model repository from the Hugging Face Hub (repo_id = {organization}/{repo_name} for instance ThomasSimonini/ppo-LunarLander-v3\n", + " commit_message=commit_message)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T79AEAWEFIxz" + }, + "source": [ + "Congrats ๐Ÿฅณ you've just trained and uploaded your first Deep Reinforcement Learning agent. The script above should have displayed a link to a model repository such as https://huggingface.co/osanseviero/test_sb3. When you go to this link, you can:\n", + "* See a video preview of your agent at the right.\n", + "* Click \"Files and versions\" to see all the files in the repository.\n", + "* Click \"Use in stable-baselines3\" to get a code snippet that shows how to load the model.\n", + "* A model card (`README.md` file) which gives a description of the model\n", + "\n", + "Under the hood, the Hub uses git-based repositories (don't worry if you don't know what git is), which means you can update the model with new versions as you experiment and improve your agent.\n", + "\n", + "Compare the results of your LunarLander-v3 with your classmates using the leaderboard ๐Ÿ† ๐Ÿ‘‰ https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9nWnuQHRfFRa" + }, + "source": [ + "## Load a saved LunarLander model from the Hub ๐Ÿค—\n", + "Thanks to [ironbar](https://github.com/ironbar) for the contribution.\n", + "\n", + "Loading a saved model from the Hub is really easy.\n", + "\n", + "You go to https://huggingface.co/models?library=stable-baselines3 to see the list of all the Stable-baselines3 saved models.\n", + "1. You select one and copy its repo_id\n", + "\n", + "\"Copy-id\"/" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNPLJF2bfiUw" + }, + "source": [ + "2. Then we just need to use load_from_hub with:\n", + "- The repo_id\n", + "- The filename: the saved model inside the repo and its extension (*.zip)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bhb9-NtsinKB" + }, + "source": [ + "Because the model I download from the Hub was trained with Gym (the former version of Gymnasium) we need to install shimmy a API conversion tool that will help us to run the environment correctly.\n", + "\n", + "Shimmy Documentation: https://github.com/Farama-Foundation/Shimmy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "03WI-bkci1kH" + }, + "outputs": [], + "source": [ + "!pip install shimmy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oj8PSGHJfwz3" + }, + "outputs": [], + "source": [ + "from huggingface_sb3 import load_from_hub\n", + "repo_id = \"Classroom-workshop/assignment2-omar\" # The repo_id\n", + "filename = \"ppo-LunarLander-v2.zip\" # The model filename.zip\n", + "\n", + "# When the model was trained on Python 3.8 the pickle protocol is 5\n", + "# But Python 3.6, 3.7 use protocol 4\n", + "# In order to get compatibility we need to:\n", + "# 1. Install pickle5 (we done it at the beginning of the colab)\n", + "# 2. Create a custom empty object we pass as parameter to PPO.load()\n", + "custom_objects = {\n", + " \"learning_rate\": 0.0,\n", + " \"lr_schedule\": lambda _: 0.0,\n", + " \"clip_range\": lambda _: 0.0,\n", + "}\n", + "\n", + "checkpoint = load_from_hub(repo_id, filename)\n", + "model = PPO.load(checkpoint, custom_objects=custom_objects, print_system_info=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fs0Y-qgPgLUf" + }, + "source": [ + "Let's evaluate this agent:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PAEVwK-aahfx" + }, + "outputs": [], + "source": [ + "#@title\n", + "eval_env = Monitor(gym.make(\"LunarLander-v2\"))\n", + "mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10, deterministic=True)\n", + "print(f\"mean_reward={mean_reward:.2f} +/- {std_reward}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BQAwLnYFPk-s" + }, + "source": [ + "## Some additional challenges ๐Ÿ†\n", + "The best way to learn **is to try things by your own**! As you saw, the current agent is not doing great. As a first suggestion, you can train for more steps. With 1,000,000 steps, we saw some great results!\n", + "\n", + "In the [Leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) you will find your agents. Can you get to the top?\n", + "\n", + "Here are some ideas to achieve so:\n", + "* Train more steps\n", + "* Try different hyperparameters for `PPO`. You can see them at https://stable-baselines3.readthedocs.io/en/master/modules/ppo.html#parameters.\n", + "* Check the [Stable-Baselines3 documentation](https://stable-baselines3.readthedocs.io/en/master/modules/dqn.html) and try another model such as DQN.\n", + "* **Push your new trained model** on the Hub ๐Ÿ”ฅ\n", + "\n", + "**Compare the results of your LunarLander-v3 with your classmates** using the [leaderboard](https://huggingface.co/spaces/huggingface-projects/Deep-Reinforcement-Learning-Leaderboard) ๐Ÿ†\n", + "\n", + "Is moon landing too boring for you? Try to **change the environment**, why not use MountainCar-v0, CartPole-v1 or CarRacing-v0? Check how they work [using the gym documentation](https://www.gymlibrary.dev/) and have fun ๐ŸŽ‰." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9lM95-dvmif8" + }, + "source": [ + "________________________________________________________________________\n", + "Congrats on finishing this chapter! That was the biggest one, **and there was a lot of information.**\n", + "\n", + "If youโ€™re still feel confused with all these elements...it's totally normal! **This was the same for me and for all people who studied RL.**\n", + "\n", + "Take time to really **grasp the material before continuing and try the additional challenges**. Itโ€™s important to master these elements and have a solid foundations.\n", + "\n", + "Naturally, during the course, weโ€™re going to dive deeper into these concepts but **itโ€™s better to have a good understanding of them now before diving into the next chapters.**\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BjLhT70TEZIn" + }, + "source": [ + "Next time, in the bonus unit 1, you'll train Huggy the Dog to fetch the stick.\n", + "\n", + "\"Huggy\"/\n", + "\n", + "## Keep learning, stay awesome ๐Ÿค—" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [ + "QAN7B0_HCVZC", + "BqPKw3jt_pG5" + ], + "private_outputs": true, + "provenance": [] + }, + "gpuClass": "standard", + "kernelspec": { + "display_name": "Python 3.9.7", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.9.7" + }, + "vscode": { + "interpreter": { + "hash": "ed7f8024e43d3b8f5ca3c5e1a8151ab4d136b3ecee1e3fd59e0766ccc55e1b10" + } + } + }, + "nbformat": 4, + "nbformat_minor": 0 }