final commit of all changes and new code as well as agents

rlai-lab · Aug 31, 2022 · c3baba0 · c3baba0
1 parent fb6e987
commit c3baba0
Show file tree

Hide file tree

Showing 53 changed files with 11,704 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -11,12 +11,12 @@ MinAtar is a testbed for AI agents which implements miniaturized versions of sev
 <img src="img/space_invaders.gif" width="200" />
 </p>
 
-## Quick Start
+## Standard Quick Start
 To use MinAtar, you need python3 installed, make sure pip is also up to date.  To run the included `DQN` and `AC_lambda` examples, you need `PyTorch`.  To install MinAtar, please follow the steps below:
 
 1. Clone the repo: 
 ```bash
-git clone https://github.com/kenjyoung/MinAtar.git
+git clone https://github.com/Robertboy18/MinAtar-Faster.git
 ```
 If you prefer running MinAtar in a virtualenv, you can do the following before step 2:
 ```bash
@@ -29,6 +29,7 @@ pip install --upgrade pip
 2.  Install MinAtar:
 ```bash
 pip install .
+pip install -r requirements.txt
 ```
 If you have any issues with automatic dependency installation, you can instead install the necessary dependencies manually and run
 ```bash
@@ -55,6 +56,193 @@ Use the arrow keys to move and space bar to fire. Also, press q to quit and r to
 
 Also included in the examples directory are example implementations of DQN (dqn.py) and online actor-critic with eligibility traces (AC_lambda.py).
 
+## Optimized Code with various Agents Usage
+
+To run your first experiment:
+```
+python3 main.py --agent-json config/agent/SAC.json --env-json config/environment/AcrobotContinuous-v1.json --index 0
+```
+
+# Usage
+The file main.py trains an agent for a specified number of runs, based on an environment and agent configuration file count in config/environment/ or config/agent/ respectively. The data is saved in the results directory, with a name similar to the environment and agent name.
+
+For more information on how to use the main.py program, see the `--help` option:
+```
+Usage: main.py [OPTIONS]
+
+  Given agent and environment configuration files, run the experiment defined
+  by the configuration files
+
+Options:
+  --env-json TEXT      Path to the environment json configuration file
+                       [required]
+  --agent-json TEXT    Path to the agent json configuration file  [required]
+  --index INTEGER      The index of the hyperparameter to run
+  -m, --monitor        Whether or not to render the scene as the agent trains.
+  -a, --after INTEGER  How many timesteps (training) should pass before
+                       rendering the scene
+  --save-dir TEXT      Which directory to save the results file in
+  --help               Show this message and exit.
+```
+
+Example:
+```
+./main.py --env-json config/environment/MountainCarContinuous-v0.json --agent-json config/agent/linearAC.json --index 0 --monitor --after 1000
+```
+will run the experiment using linear-Gaussian actor-critic on the mountain
+car environment. The experiment is run on one process (serially), and the
+scene is rendered after 1000 timesteps of training. We will only run the
+hyperparameter setting with index 0.
+
+# Hyperparameter settings
+The hyperparameter settings are laid out in the agent configuration files.
+The files are laid out such that each setting is a list of values, and the
+total number of hyperparameter settings is the product of the lengths of each
+of these lists. For example, if the agent config file looks like:
+```
+{
+    "agent_name": "linearAC",
+    "parameters":
+    {
+        "decay": [0.5],
+        "critic_lr": [0.005, 0.1, 0.3],
+        "actor_lr": [0.005, 0.1, 0.3],
+        "avg_reward_lr": [0.1, 0.3, 0.5, 0.9],
+        "scaled": [true],
+        "clip_stddev": [1000]
+    }
+}
+```
+then, there are `1 x 3 x 3 x 4 x 1 x 1 = 36` different hyperparameter
+settings. Each hyperparameter setting is given a specific index. For example
+hyperparameter setting index `1` would have the following hyperparameters:
+```
+{
+    "agent_name": "linearAC",
+    "parameters":
+    {
+        "decay": 0.5,
+        "critic_lr": 0.005,
+        "actor_lr": 0.005,
+        "avg_reward_lr": 0.1,
+        "scaled": true,
+        "clip_stddev": 1000
+    }
+}
+```
+The hyperparameter settings indices are actually implemented `mod x`,
+where `x` is the maximum number of hyperparameter settings (in the example
+about, `36`). So, in the example above, the hyperparameter settings with
+indices `1, 37, 73, ...` all refer to the same hyperparameter settings since
+`1 = 37 = 73 = ... mod 36`. The difference is that the consecutive indices
+have a different seed. So, each time we run experiments with hyperparameter
+setting `1`, it will have the same seed. If we run with hyperparameter setting
+`37`, it will be the same hyperparameter settings as `1`, but with a different
+seed, and this seed will be the same every time we run the experiment with
+hyperparameter settings `37`. This is what Martha and her students
+have done with their Actor-Expert implementation, and I find that it works
+nicely for hyperparameter sweeps.
+
+
+# Saved Data
+Each experiment saves all the data as a Python dictionary. The dictionary is
+designed so that we store all information about the experiment, including all
+agent hyperparameters and environment settings so that the experiment is
+exactly reproducible.
+
+If the data dictionary is called `data`, then the main data for the experiment
+is stored in `data["experiment_data"]`, which is a dictionary mapping from
+hyperparameter settings indices to agent parameters and experiment runs.
+`data["experiment_data"][i]["agent_params"]` is a dictionary storing the
+agent's hyperparameters (hyperparameter settings index `i`) for the experiment.
+`data["experiment_data"][i]["runs]` is a list storing the runs for the
+`i-th` hyperparameter setting. Each element of the list is a dictionary, giving
+all the information for that run and hyperparameter setting. For example,
+`data["experiment_data"][i]["runs"][j]` will give all the information on
+the `j-th` run of hyperparameter settings `i`.
+
+Below is a tree diagram of the data structure:
+```
+data
+├─── "experiment"
+│       ├─── "environment": environment configuration file
+│       └─── "agent": agent configuration file
+└─── "experiment_data": dictionary of hyperparameter setting *index* to runs
+        ├─── "agent_params": the hyperparameters settings
+        └─── "runs": a list containing all the runs for this hyperparameter setting (each run is a dictionary of elements)
+                └─── index i: information on the ith run
+                		├─── "run_number": the run number
+                        ├─── "random_seed": the random seed used for the run
+                        ├─── "total_timesteps": the total number of timesteps in the run
+                        ├─── "eval_interval_timesteps": the interval of timesteps to pass before running offline evaluation
+                        ├─── "episodes_per_eval": the number of episodes run at each offline evaluation
+                        ├─── "eval_episode_rewards": list of the returns (np.array) from each evaluation episode if there are 10 episodes per eval,
+                        │     then this will be a list of np.arrays where each np.array has 10 elements (one per eval episode)
+                        ├─── "eval_episode_steps": the number of timesteps per evaluation episode, with the same form as "eval_episode_rewards"
+                        ├─── "timesteps_at_eval": the number of training steps that passed at each evaluation. For example, if there were 10
+                        │    offline evaluations, then this will be a list of 10 integers, each stating how many training steps passed before each
+                        │    evaluation.
+                        ├─── "train_episode_rewards": the return seen for each training episode
+                        ├─── "train_episode_steps": the number of timesteps passed for each training episode
+                        ├─── "train_time": the total amount of training time in seconds
+                        ├─── "eval_time": the total amount of evaluation time in seconds
+                        └─── "total_train_episodes": the total number of training episodes for the run
+```
+
+For example, here is `data["experiment_data"][i]["runs"][j]` for a mock run
+of the Linear-Gaussian Actor-Critic agent on MountainCarContinuous-v0:
+```
+{'random_seed': 0,
+ 'total_timesteps': 1000,
+ 'eval_interval_timesteps': 500,
+ 'episodes_per_eval': 10,
+ 'eval_episode_rewards': array([[-200., -200., -200., -200., -200., -200., -200., -200., -200.,
+         -200.],
+        [-200., -200., -200., -200., -200., -200., -200., -200., -200.,
+         -200.]]),
+ 'eval_episode_steps': array([[200, 200, 200, 200, 200, 200, 200, 200, 200, 200],
+        [200, 200, 200, 200, 200, 200, 200, 200, 200, 200]]),
+ 'timesteps_at_eval': array([  0, 600]),
+ 'train_episode_steps': array([200, 200, 200, 200, 200]),
+ 'train_episode_rewards': array([-200., -200., -200., -200., -200.]),
+ 'train_time': 0.12098526954650879,
+ 'eval_time': 0.044415950775146484,
+ 'total_train_episodes': 5,
+ ...}
+```
+
+# Configuration files
+Each configuration file is a JSON file and has a few properties. There
+are also templates in each configuration directory for the files.
+
+## Environment Configuration File
+```
+{
+    "env_name": "environment filename without .json, all files refer to this as env_name",
+    "total_timesteps": "int - total timesteps for the entire run",
+    "steps_per_episode": "int - max number of steps per episode",
+    "eval_interval_timesteps": "int - interval of timesteps at which offline evaluation should be done",
+    "eval_episodes": "int - the number of offline episodes per evaluation",
+    "gamma": "float - the discount factor",
+}
+```
+
+## Agent Configuration File
+The agent configuration file is more general. The template is below. Since
+both agents already have configuration files, there is not much need to add
+any new configurations for agents. Instead, it would suffice to alter the
+existing configuration files. The issue is that each agent has very different
+configurations and hyperparameters, and so the config files are very different.
+```
+{
+    "agent_name": "filename without .json, all code refers to this as agent_name",
+    "parameters":
+    {
+        "parameter name": "list of values"
+    }
+}
+```
+
 ## OpenAI Gym Wrapper
 MinAtar now includes an OpenAI Gym plugin using the Gym plugin system. If a sufficiently recent version of OpenAI gym (`pip install gym==0.21.0` works) is installed, this plugin should be automatically available after installing MinAtar as normal. A gym environment can then be constructed as follows:
 ```bash

diff --git a/agent/Random.py b/agent/Random.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+
+# Adapted from https://github.com/pranz24/pytorch-soft-actor-critic
+
+# Import modules
+import torch
+import numpy as np
+from agent.baseAgent import BaseAgent
+
+
+class Random(BaseAgent):
+    """
+    Random implements a random policy.
+    """
+    def __init__(self, action_space, seed):
+        super().__init__()
+        self.batch = False
+
+        self.action_dims = len(action_space.high)
+        self.action_low = action_space.low
+        self.action_high = action_space.high
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.policy = torch.distributions.Uniform(
+            torch.Tensor(action_space.low), torch.Tensor(action_space.high))
+
+    def sample_action(self, _):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        _ : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        action = self.policy.sample()
+
+        return action.detach().cpu().numpy()
+
+    def sample_action_(self, _, size):
+        """
+        sample_action_ is like sample_action, except the rng for
+        action selection in the environment is not affected by running
+        this function.
+        """
+        return self.rng.uniform(self.action_low, self.action_high,
+                                size=(size, self.action_dims))
+
+    def update(self, _, _1, _2, _3, _4):
+        pass
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        pass
+
+    def train(self):
+        pass
+
+    # Save model parameters
+    def save_model(self, _, _1="", _2=None, _3=None):
+        pass
+
+    # Load model parameters
+    def load_model(self, _, _1):
+        pass
+
+    def get_parameters(self):
+        pass