Leaderboard

This page tracks the performance of user algorithms for various tasks in gym. Previously, users could submit their scores directly to gym.openai.com/envs, but it has been decided that a simpler wiki might do this task more efficiently.

This wiki page is a community driven page. Anyone can edit this page and add to it. We encourage you to contribute and modify this page and add your scores and links to your write-ups and code to reproduce your results. We also encourage you to add new tasks with the gym interface, but not in the core gym library (such as roboschool) to this page as well.

Links to videos are optional, but encouraged. Videos can be youtube, instagram, a tweet, or other public links. Write-ups should explain how to reproduce the result, and can be in the form of a simple gist link, blog post, or github repo.

We have begun to copy over the previous performance scores and write-up links over from the previous page. This is an ongoing effort, and we can use some help.

Environments

Classic control

CartPole-v0

A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track. The system is controlled by applying a force of +1 or -1 to the cart. The pendulum starts upright, and the goal is to prevent it from falling over. A reward of +1 is provided for every timestep that the pole remains upright. The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.

Environment Details
CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.
This environment corresponds to the version of the cart-pole problem described by Barto, Sutton, and Anderson [Barto83].

User	Episodes before solve	Write-up	Video
Zhiqing Xiao	0 (use close-form preset policy)	writeup
Henry Jia	0 (use close-form PID policy)	code/writeup
Keavnn	0	writeup
Shakti Kumar	0	writeup	Video
Mathias Åsberg 🔥	0	writeup	Video
iRyanBell	2	writeup
Adam	3 (36)	writeup
Daniel Sallander	4	writeup
Kapil Chauhan	4	writeup
Ritika Kapoor	7 (use genetic algorithm)	writeup
Ben Harris	12	writeup	video
Tiger37	14 (0)	writeup	video
Blake Richey	20	writeup
LukaszFuszara	22	writeup	video
MisterTea, econti	24	writeup
Roald Brønstad	24	writeup
yingzwang	32	writeup
sharvar	33	writeup
nuggfr	38	writeup
SurenderHarsha	40	writeup
Chrispresso	45	writeup
n1try	85	writeup
khev	96	writeup	video
ceteke	99	writeup
manikanta	100	writeup	video
BS Haney	100	Write-up	YouTube
Trevor McInroe	130	writeup
JamesUnicomb	145	writeup	video
Nihal T Rao	184	writeup	video
Harshit Singh Lodha	265	writeup	gif
XYTriste	286	writeup
mbalunovic	306	writeup
onimaru	355	writeup	video
Google Search "M Kunthe"	382	writeup

MountainCar-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum.

Environment details
MountainCar-v0 defines "solving" as getting average reward of -110.0 over 100 consecutive trials.
This problem was first described by Andrew Moore in his PhD thesis [Moore90].

User	Episodes before solve	Write-up	Video
Zhiqing Xiao	0 (use close-form preset policy)	writeup
Leocus	10 (1150)	writeup
Keavnn	47	writeup
Zhiqing Xiao	75	writeup	video
Mohith Sakthivel	90	writeup
Tiger37	224	writeup	video
Anas Mohamed	341	Link	Link
Harshit Singh Lodha	643	writeup	gif
Colin M	944	writeup	gif
jing582	1119
DaveLeongSingapore	1967
Pechckin	30	writeup
Amit	1000-1200	writeup	video
Gleb I	100	writeup

MountainCarContinuous-v0

A car is on a one-dimensional track, positioned between two "mountains". The goal is to drive up the mountain on the right; however, the car's engine is not strong enough to scale the mountain in a single pass. Therefore, the only way to succeed is to drive back and forth to build up momentum. Here, the reward is greater if you spend less energy to reach the goal

Here, this is the continuous version.

Environment details
MountainCarContinuous-v0 defines "solving" as getting average reward of 90.0 over 100 consecutive trials.
This problem was first described by Andrew Moore in his PhD thesis [Moore90].

User	Episodes before solve	Write-up	Video
Zhiqing Xiao	0 (use close-form preset policy)	writeup
Ashioto	1	writeup
timurgepard	5 (Symphony🎹 ver 2.0)	writeup	video
Mathias Åsberg 🤖	9	writeup	Video
Keavnn	11	writeup
camigord	18	writeup
Tobias Steidle	32	writeup	video
lirnli	33	writeup
khev	130	writeup	video
Sanket Thakur	140	writeup	video
Pechckin	1	writeup
Nikhil Barhate	200 (HAC)	writeup	gif

Pendulum-v0

The inverted pendulum swingup problem is a classic problem in the control literature. In this version of the problem, the pendulum starts in a random position, and the goal is to swing it up so it stays upright.

Environment details
Pendulum-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.

User	Best 100-episode performance	Write-up	Video
KanishkNavale	-106.9528	MultiAgent Policy
msinto93	-123.11 ± 6.86	D4PG
msinto93	-123.79 ± 6.90	DDPG
heerad	-134.48 ± 9.07	writeup
BS Haney	-135	Write-up	YouTube
ThyrixYang	-136.16 ± 11.97	writeup
MaelFrancesc	-146.4 (mean 900 ep)	writeup
lirnli	-152.24 ± 10.87	writeup

Acrobot-v1

The acrobot system includes two joints and two links, where the joint between the two links is actuated. Initially, the links are hanging downwards, and the goal is to swing the end of the lower link up to a given height.

Acrobot-v1 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.*
Control of Acrobot around equilibrium was described by J. Hauser and R. Murray in ACC 1990. Swing-up control of Acrobot is in M. W. Spong, IEEE Control Systems Magazine, 1995.
Learning control on Acrobot was first described by Sutton [Sutton96]. We are using the version from RLPy [Geramiford15], which uses Runge-Kutta integration for better accuracy.

User	Best 100-episode performance	Write-up	Video
mallochio	-42.37 ± 4.83		taken down
marunowskia	-59.31 ± 1.23
MontrealAI	-60.82 ± 0.06
BS Haney	-61.8	Write-up	YouTube
Felix Nica	-63.13 ± 2.65	Write-up	YouTube
Nick Kaparinos	-64.30 ± 4.10	Write-up	gif
Daniel Barbosa	-67.18	writeup
Mahmood Khordoo	-68.63	writeup	gif
lirnli	-72.09 ± 1.15
Tiger37	-74.49 ± 10.87	writeup
tsdaemon	-77.87 ± 1.54
a7b23	-80.68 ± 1.18
Tzoof Avny Brosh	-80.73	writeup
DaveLeongSingapore	-84.02 ± 1.46
Sanket Thakur	-89.29	writeup	video
loicmarie	-99.18 ± 2.60
simonoso	-113.66 ± 5.15
alebac	-427.26 ± 15.02
mehdimerai	-500.00 ± 0.00

Box2D

LunarLander-v2

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

LunarLander-v2 defines "solving" as getting average reward of 200 over 100 consecutive trials.=
by @olegklimov

User	Episodes before solve	Write-up	Video
Keavnn	16	writeup
liu	29 (Average:100)	Write-up
Ash Bellett	101	Write-up	Video
Mathias Åsberg 🔥	133	writeup	Video
Aman Arora	141	Write-up	[Under progress]
A. Myachin and R. Potemin	231	Write-up	GIF
Daniel T. Plop	295	Write-up	GIF
Nick Kaparinos	420	Write-up	gif
Sanket Thakur	454	Write-up	Video
Mahmood Khordoo	602	Writup	gif
Christoph Powazny	658	writeup	gif
Daniel Barbosa	674	writeup	gif
Xinli Yu	805	writeup	gif
Ruslan Miftakhov	814	writeup	gif
Ollie Graham	987	writeup	gif
Leocus	1000 (21000)	writeup
Nikhil Barhate	1500	writeup	gif
Udacity DRLND Team	1504	writeup	gif
Sigve Rokenes	1590	writeup	gif
JamesUnicomb	2100	writeup	video
ksankar	2148	Working on it
koltafrickenfer	499474	writeup	youtube

LunarLanderContinuous-v2

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Action is two real values vector from -1 to +1. First controls main engine, -1..0 off, 0..+1 throttle from 50% to 100% power. Engine can't work with less than 50% power. Second value -1.0..-0.5 fire left engine, +0.5..+1.0 fire right engine, -0.5..0.5 off.

LunarLanderContinuous-v2 defines "solving" as getting average reward of 200 over 100 consecutive trials.

User	Episodes before solve	Write-up	Video
Keavnn	30	writeup
liu	57 (Average:100)	Write-up
timurgepard	90 (Symphony🎹 ver 2.1)	writeup	video
BS Haney	100	Write-up	YouTube
timurgepard	140 (Symphony🎹 ver 2.0)	writeup	video
Mathias Åsberg 🔥	178	writeup	Video
Nick Kaparinos	300	Write-up	gif
shnippi	422	writeup
Nandino Cakar	474	writeup
Felix Nica	556	Write-up	YouTube
Nikhil Barhate	1500	Write-up	GIF
Jootten	2472	Write-up	YouTube
Tom	5000	Write-up	YouTube
Sigve Rokenes	5300	Write-up	GIF

BipedalWalker-v2 and BipedalWalker-v3

Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.

Environment Details
BipedalWalker-v2 defines "solving" as getting average reward of 300 over 100 consecutive trials.
by @olegklimov

User	Version	Episodes before solve	Write-up	Video
timurgepard	3.0	27 (Symphony🎹 ver 3.0, no ep step limit)	writeup	video
timurgepard	3.0	40 (Symphony🎹 ver 2.0, no ep step limit)	writeup	video
Benjamin & Thor	3.0	57 (TRPO with OU action noise)	writeup
timurgepard	3.0	100 (Monte-Carlo🌊 & Temporal Difference🔥)	writeup
Lauren	2.0	110	writeup	Video
Mathias Åsberg 😎	2.0	164	writeup	Video
liu	2.0	200 (AverageEpRet:338)	writeup
Nandino Cakar	3.0	474	writeup
Yoggi Voltbro	3.0	696	write-up	video
Nikhil Barhate	2.0	800	writeup	gif
Nick Kaparinos	3.0	800	Write-up	gif
Vinit & Abhimanyu	2.0	910	writeup	Video
shnippi	3.0	925	writeup
M	2.0	960	writeup	Video
mayurmadnani	2.0	1000	Write-up	Youtube
Rafael1s	2.0	1795	Write-up	Youtube
chitianqilin	2.0	47956	writeup	Youtube
ZhiqingXiao	3.0	0 (use close-form preset policy)	writeup
koltafrickenfer	2.0	N/A	writeup	youtube
alirezamika	2.0	N/A	writeup
404akhan	2.0	N/A	writeup
Udacity DRLND Team	2.0	N/A	writeup	gif

BipedalWalkerHardcore-v2 and BipedalWalkerHardcore-v3

Hardcore version with ladders, stumps, pitfalls. Time limit is increased due to obstacles. Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.

BipedalWalkerHardcore-v2 defines "solving" as getting average reward of 300 over 100 consecutive trials.

User	Version	Episodes before solve	100-Episode Average Score	Write-up	Video
honghaow	3.0	3593	312.10	write-up	video
Yoggi Voltbro	3.0	7280	302.92 ± 10.82	write-up	video
Nick Kaparinos	3.0	15500	305.40 ± 21.35	Write-up	gif
liu	2.0	N/A	319 (average of 10000 trials)	writeup
DollarAkshay	2.0	N/A	N/A	writeup
ryogrid	2.0	N/A	N/A	writeup
dgriff777	2.0	N/A	300	writeup	video
lerrytang and hardmaru	2.0	N/A	300	writeup	video
hardmaru	2.0	N/A	313 ± 53	writeup	video
Alister Maguire	3.0	N/A	313	Write-up	gif

CarRacing-v0

Easiest continuous control task to learn from pixels, a top-down racing environment. Discreet control is reasonable in this environment as well, on/off discretisation is fine. State consists of 96x96 pixels. Reward is -0.1 every frame and +1000/N for every track tile visited, where N is the total number of tiles in track. For example, if you have finished in 732 frames, your reward is 1000 - 0.1*732 = 926.8 points. Episode finishes when all tiles are visited. Some indicators shown at the bottom of the window and the state RGB buffer. From left to right: true speed, four ABS sensors, steering wheel position, gyroscope.

by @olegklimov
CarRacing-v0 defines "solving" as getting average reward of 900 over 100 consecutive trials.

User	Episodes before solve	100-Episode Average Score	Write-up	Video
irvpet	N/A	913 ± 26	writeup	video
lmclupr	N/A	N/A	writeup
IPAM-AMD	900	907 ± 24	writeup	Video
hardmaru	N/A	906 ± 21	writeup	Videos
Rafael1s	2760	901 (*)	writeup	Video
sebastianrisi	N/A	903 ± 72	writeup	video
ctallec	N/A	870 ± 120	writeup	video
agaier and hardmaru	N/A	893 ± 74	writeup	video
jperod	N/A	905 ± 24	writeup	Video
JinayJain	N/A	909 ± 10	writeup	video

(*) They used reward shaping (added some score back when the agent dies) during training to make training work better, but unfortunately kept the artificial shaped score for evaluation. When testing their agent using their model (and also trying to train it from scratch, which performed worse), we got a score of 820. We have filed an issue. We found a similar problem with another PPO repo here.

CarRacing-v1

v1: Changed track completion logic and added domain randomization (0.24.0)

User	Episodes before solve	100-Episode Average Score	Write-up	Video
Ray Coden Mercurius	925	917	writeup	video

MuJoCo

Inverted Pendulum

This environment involves a cart that can be moved linearly, with a pole fixed on it at one end and having another end free. The cart can be pushed left or right, and the goal is to balance the pole on the top of the cart by applying forces on the cart.

User	Episode	100-Episode Average Score	Write-up	Video
timurgepard	56	1000.0 (Symphony🎹 ver 2.0)	writeup

Walker2d-v1 and Walker 2d-v2

Make a two-dimensional bipedal robot walk forward as fast as possible.

Walker2d-v1 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
The robot model is based on work by Erez, Tassa, and Todorov [Erez11].

User	Episode	100-Episode Average Score	Write-up	Video
timurgepard	500	7920.0 (ep steps 2000) (Symphony🎹 ver 2.0)	writeup	video
timurgepard	450	7670.0 (ep steps 2000) (Symphony🎹 ver 2.0)	writeup	video
zlw21gxy	N/A	7197.15	writeup
pat-coady	N/A	7167.24	link	video
joschu	N/A	5594.75	link	video
Nick Kaparinos	N/A	5317.38 ± 15.86	Write-up	gif
songrotek	N/A	1222.12	link	video
BS Haney	N/A	1190	Write-up	YouTube

Ant-v1

Make a four-legged creature walk forward as fast as possible.

Ant-v1 defines "solving" as getting average reward of 6000.0 over 100 consecutive trials.
This task originally appeared in [Schulman15].

User	Episode	100-Episode Average Score	Write-up	Video
timurgepard	700	10700.0 (ep steps 2000) (Symphony🎹 ver 2.0)	writeup	video
zlw21gxy	1000	N/A	writeup
pat-coady	69154	N/A	writeup
joschu	N/A	N/A	writeup

HalfCheetah-v4

Make a 2-dimensional robot walk forward as fast as possible.

The HalfCheetah is a 2-dimensional robot consisting of 9 body parts and 8 joints connecting them (including two paws).
The goal is to apply a torque on the joints to make the cheetah run forward (right) as fast as possible.
This environment is based on the work by P. Wawrzyński

User	Episodes before solve	Write-up	Video
timurgepard	25 (Symphony🎹 ver 2.0)	writeup	video
tareknaser	N/A	writeup	video

Humanoid-v4

Make 3D humanoid robot walk forward as fast as possible.

Humanoid-v4 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
The 3D bipedal robot is designed to simulate a human. Humanoid-v4 defines "solving" as acquiring human like motions.
The robot model is based on work by Tassa, Erez, and Todorov [Tassa12].*

User	Episodes before solve	100-Episode Average Score	Write-up	Video
timurgepard	1500	12,600.0 (ep steps 2000) (Symphony🎹 ver 2.0)	writeup	video

HumanoidStandup-v4

Make the humanoid standup and then keep it standing by applying torques on the various hinges.

The environment starts with the humanoid laying on the ground, and then the goal of the environment is to make the humanoid standup and then keep it standing by applying torques on the various hinges.
The 3D bipedal robot is designed to simulate a human. It has a torso (abdomen) with a pair of legs and arms. The legs each consist of two links, and so the arms (representing the knees and elbows respectively).
This environment is based on the environment introduced by Tassa, Erez and Todorov in “Synthesis and stabilization of complex behaviors through online trajectory optimization”.

User	Episodes before solve	100-Episode Average Score	Write-up	Video
timurgepard	3200 (step 960k, ep steps 300)	~320000.0 (Symphony🎹 ver 2.0)	writeup	video

Pusher-v4

“Pusher” is a multi-jointed robot arm which is very similar to that of a human. The goal is to move a target cylinder (called object) to a goal position using the robot’s end effector (called fingertip). The robot consists of shoulder, elbow, forearm, and wrist joints.

User	Episodes before solve	100-Episode Average Score	Write-up	Video
timurgepard	350	-45.0 (Symphony🎹 ver 2.0)	writeup	video

Swimmer-v4

The swimmer consist of three segments ('links') and two articulation joints (’rotors’) - one rotor joint connecting exactly two links to form a linear chain. The swimmer is suspended in a two dimensional pool, and the goal is to move as fast as possible towards the right by applying torque on the rotors and using the fluids friction.

User	Episodes before solve	100-Episode Average Score	Write-up	Video
timurgepard	55	205.0 (Symphony🎹 ver 2.0)	writeup

PyGame Learning Environment

FlappyBird-v0

This environment adapts a game from the PyGame Learning Environment (PLE). To run it, you will need to install gym-ple from https://github.com/lusob/gym-ple.

Flappybird is a side-scrolling game where the agent must successfully navigate through gaps between pipes. The up arrow causes the bird to accelerate upwards. If the bird makes contact with the ground or pipes, or goes above the top of the screen, the game is over. For each pipe it passes through it gains a positive reward of +1. Each time a terminal state is reached it receives a negative reward of -1.

FlappyBird-v0 is an unsolved environment, which means it does not have a specified reward threshold at which it's considered solved.
by @lusob

User	Best 100-episode performance	Write-up	Video
dguoy	264.0 ± 0.0	writeup	video
andreimuntean	261.12 ± 2.61	writeup
Kunal Arora	90.83	writeup
chuchro3	62.26 ± 7.81	writeup
warmar	11.28 ± 14.25	writeup	video1 video2

Snake-v0

Snake is a game where the agent must maneuver a line which grows in length each time food is touched by the head of the segment. The line follows the previous paths taken which eventually become obstacles for the agent to avoid.

The food is randomly spawned inside of the valid window while checking it does not make contact with the snake body.

User	Best 100-episode performance	Write-up	Video
carsonprindle	.44 ± .04	writeup

Atari Games

Atlantis-v0

User	Best 100-episode performance	Write-up
msemple1111	62,500 ± 0	writeup

Breakout-v0

User	Best 100-episode performance	Write-up
ppwwyyxx	760.07 ± 18.37	writeup

Pong-v5

User	Best 100-episode performance	Write-up
Nick Kaparinos	21.00 ± 0.00	Write-up
ppwwyyxx	20.81 ± 0.04	writeup

MsPacman-v0

User	Best 100-episode performance	Write-up
ppwwyyxx	5738.30 ± 171.99	writeup

SpaceInvaders-v0

User	Best 100-episode performance	Write-up
ppwwyyxx	3454.00 ± 0	writeup

Seaquest-v0

User	Best 100-episode performance	Write-up
ppwwyyxx	50209 ± 2440.07	writeup

Toy text

Simple text environments to get you started.

Taxi-v2

This task was introduced in [Dietterich2000] to illustrate some issues in hierarchical reinforcement learning. There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

[Dietterich2000] T Erez, Y Tassa, E Todorov, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition", 2011.

User	100 Episodes Best Average Reward	Write-up	Solved In Episode
Michael Schock	9.716	writeup	19790
giskmov	9.700	writeup
Hari Iyer	9.634	writeup
Jin.P	9.617	writeup
jo4x962k7JL	9.600	writeup
Delton Oliver	9.59	writeup
Eka Kurniawan	9.59	writeup
Daniel T. Plop	9.582	writeup
Roald Brønstad	9.574	writeup
andyharless	9.57	writeup
ksankar	9.530	writeup
Tom Roth	9.500	writeup
mostoo45	9.492	writeup
crazyleg	9.49	writeup
Akshay Sathe	9.471	writeup
Ridhwan Luthra	9.461	writeup	15000
newwaylw	9.459	writeup	20000
romOlivo	9.449	writeup
Herimiaina ANDRIA-NTOANINA	9.446	writeup
aleckretch	9.426	writeup
Cihan Soylu	9.423	writeup
Tristan Frizza	9.358	writeup
Jhon Muñoz	9.334	writeup
Mahaveer Jain	9.296	writeup
Mostafa Elhoushi	9.2926	writeup
Rajiv Krishnakumar	9.277	writeup	20000
Brungi Vishwa Sourab	9.23	writeup

Taxi-v3

This task was introduced in [Dietterich2000] to illustrate some issues in hierarchical reinforcement learning. There are 4 locations (labeled by different letters) and your job is to pick up the passenger at one location and drop him off in another. You receive +20 points for a successful dropoff, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions.

[Dietterich2000] T Erez, Y Tassa, E Todorov, "Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition", 2011.

User	100 Episodes Best Average Reward	Write-up	Video	Solved In Episode
andyharless	9.26	writeup
chillage	9.249	writeup
morakanhan	9.247	writeup		20000
yurkovak	9.19	writeup		20000
crazyleg	9.07	writeup
rahulkaplesh	8.97	writeup+Notebook		20000
Mattia-Scarpa	8.83	writeup		20000
Tiger37	8.8	writeup		20000
take2rohit	8.57	writeup+Notebook	video	5000

GuessingGame-V0

The goal of the game is to guess within 1% of the randomly chosen number within 200 time steps

After each step the agent is provided with one of four possible observations which indicate where the guess is in relation to the randomly chosen number

User	Average Episode Steps	Write-up	Video	Solved In Episode
Anandha Krishnan H	51 (use close-form preset policy)	writeup
Britto Sabu	53 (use close-form preset policy)	writeup

FrozenLake-v0

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

User	Episodes Before Solve	Write-up	Video	Solved In Episode
Nitish tom michael	100	writeup

FrozenLake8x8-v0

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

User	100 Episodes Best Average Reward	Write-up	Video	Solved In Episode
Sukesh Shenoy	85	writeup

OpenAI Wiki

Gym Repository
Wiki Home
Leaderboard
Environments
FAQ
Resources
Feature Requests
- Wrapper info
- Wrapper Q&A

Leaderboard

Environments

Classic control

CartPole-v0

MountainCar-v0

MountainCarContinuous-v0

Pendulum-v0

Acrobot-v1

Box2D

LunarLander-v2

LunarLanderContinuous-v2

BipedalWalker-v2 and BipedalWalker-v3

BipedalWalkerHardcore-v2 and BipedalWalkerHardcore-v3

CarRacing-v0

CarRacing-v1

MuJoCo

Inverted Pendulum

Walker2d-v1 and Walker 2d-v2

Ant-v1

HalfCheetah-v4

Humanoid-v4

HumanoidStandup-v4

Pusher-v4

Swimmer-v4

PyGame Learning Environment

FlappyBird-v0

Snake-v0

Atari Games

Atlantis-v0

Breakout-v0

Pong-v5

MsPacman-v0

SpaceInvaders-v0

Seaquest-v0

Toy text

Taxi-v2

Taxi-v3

GuessingGame-V0

FrozenLake-v0

FrozenLake8x8-v0

OpenAI Wiki

Clone this wiki locally