Source code of the paper "Beyond Optimism: Exploration With Partially Observable Rewards" (Arxiv, NeurIPS).
To install and use our environments, run
pip install -r requirements.txt
cd src/gym_gridworlds
pip install -e .
Run python
and then
import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Penalty-3x3-v0", render_mode="human")
env.reset()
env.step(1) # DOWN
env.step(4) # STAY
env.render()
to render the Penalty-3x3-v0
(left figure), and
import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-5x5-v0", render_mode="human")
env.reset()
env.step(1) # DOWN
env.render()
to render the Full-5x5-v0
(right figure).
- Black tiles are empty,
- Black tiles with gray arrows are tiles where the agent can move only in one direction (other actions will fail),
- Red tiles give negative rewards,
- Green tiles give positive rewards (the brighter, the higher),
- Yellow tiles are quicksands, where all actions will fail with 90% probability,
- The agent is the blue circle,
- The orange arrow denotes the agent's last action,
- The orange dot denotes that the agent did not try to move with its last action.
It is also possible to add noise to the transition and the reward functions. For example, the following environment
import gymnasium
import gym_gridworlds
env = gymnasium.make("Gym-Gridworlds/Full-5x5-v0", random_action_prob=0.1, reward_noise_std=0.05)
- Performs a random action with 10% probability (regardless of what the agent wants to do),
- Adds Gaussian noise with 0.05 standard deviation to the reward.
We use Hydra to configure our experiments.
Hyperparameters and other settings are defined in YAML files in the configs/
folder.
Most of the configuration is self-explanatory. Some keys you may need to change are the following:
- WandB settings and Hydra log directories in
configs/default.yaml
, - Folder
experiment.datadir
inconfigs/default.yaml
(wherenpy
data is saved), - Folder
experiment.debugdir
inconfigs/default.yaml
(where agent pics are saved), - Learning rate, epsilon decay, Q-function initial values, and other agent parameters are
configs/agent/
, - Training/testing setup is in
configs/environment/
.
python main.py experiment.debugdir=debug environment=penalty monitor=full
This will save pics to easily debug a run.
Everything will be saved in debug/
, in subfolders depending on the Git commit and environment name.
For example, you will find these two heatmaps, representing the state-action visitation count and the
Q-function.
- States are on the y-axis,
- Actions are on the x-axis,
- Cells outlined in red are for the actions with the highest value in each state.
Run the following script to see the optimal Q-function
python fqi.py experiment.debugdir=debug environment=penalty monitor=full
For a sweep over multiple jobs in parallel with Joblib, run
python main.py -m hydra/launcher=joblib hydra/sweeper=test
Custom sweeps are defined in configs/hydra/sweeper/
.
You can further customize a sweep via command line. For example,
python main.py -m hydra/launcher=joblib hydra/sweeper=test experiment.rng_seed="range(0, 10)" hydra.launcher.verbose=1000
Configs in configs/hydra/sweeper/
hide the training progress bar of the agent, so we
suggest to pass hydra.launcher.verbose=1000
to show the progress of the sweep.
If you have access to a SLURM-based cluster, you can submit multiple jobs,
each running a chunk of the sweep with Joblib. Refer to submitit_jobs.py
for an example.
Experiments will save the expected discounted return of the ε-greedy (training)
and greedy (testing) policies, along with other stats, in a npz
file
(default dir is data/
, followed by the hash of the Git commit).
If you want to zip the whole data folder, run
find data -type f -name "*.npz" -print0 | tar -czvf data.tar.gz --null -T -
To plot expected return curves, use plot_results.py
. This script takes two arguments:
-c
is the config file that defines where to save plots, axes limits, axes ticks, what algorithms to show, and so on. Default configs are located inconfigs/plots/
.-f
is the folder where data from the sweep is located.
A separate scriptplot_legend.py
saves the legend in a separate pic.
For example, running
python plot_results.py -c configs/plots/test.py -f data_test/GIT_HASH/
python plot_legend.py
Will generate many plots like these, and save them in the same folder passed with -f
.