A3C-algorithm-implementation-for-continous-and-multidimensional-actions

Using as a test the OpenAI LunarLanderContinuous-v2 environment

This is an implementation of the A3C (Asynchronous Advantage Actor-Critic) algorithm introduced by the deepmind team in 2016.

The implementation is in script.py file

Requirements

tensorflow ver. 1.12.0
python ver 3.x

Hyperparameters can be changed for the Actor and critic networks by changing the two dictionaries:

policy_net_args = {"num_Hlayers": 2,
                      "activations_Hlayers": ["relu", "relu"],
                      "Hlayer_sizes": [100, 100],
                      "n_output_units": 2,
                      "output_layer_activation": tf.nn.tanh,
                      "state_space_size": 8,
                      "action_space_size": 2,
                      "Entropy": 0.01,
                      "action_space_upper_bound": action_space_upper_bound,
                      "action_space_lower_bound": action_space_lower_bound,
                      "optimizer": tf.train.RMSPropOptimizer(0.0001),
                      "total_number_episodes": 5000,
                      "number_of_episodes_before_update": 1,
                      "frequency_of_printing_statistics": 100,
                      "frequency_of_rendering_episode": 1000,
                      "number_child_agents": 8,
                      "episodes_back": 20,
                      "gamma": 0.99,
                      "regularization_constant": 0.01,
                      "max_steps_per_episode": 2000

                      }

   valuefunction_net_args = {"num_Hlayers": 2,
                             "activations_Hlayers": ["relu", "relu"],
                             "Hlayer_sizes": [100, 64],
                             "n_output_units": 1,
                             "output_layer_activation": "linear",
                             "state_space_size": 8,
                             "action_space_size": 2,
                             "optimizer": tf.train.RMSPropOptimizer(0.01),
                             "regularization_constant": 0.01}

Performance

The performance is quite noisy, but still respects the conditions for it to be considered solved. The noisy solution may be due to early stopping, but it seems that the reward structure does not translate itself into landings that would be considered safe. For example, landing with just a leg inside the landing zone seems to be as good as landing with both legs inside. Landing in very few time steps is also considered better, encouraging maneuvers performed with high acceleration.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
supporting_images		supporting_images
.gitignore		.gitignore
README.md		README.md
script.py		script.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A3C-algorithm-implementation-for-continous-and-multidimensional-actions

Requirements

Hyperparameters can be changed for the Actor and critic networks by changing the two dictionaries:

Performance

About

Releases

Packages

Languages

TomeASilva/A3C-algorithm-implementation-for-continous-and-multidimensional-actions

Folders and files

Latest commit

History

Repository files navigation

A3C-algorithm-implementation-for-continous-and-multidimensional-actions

Requirements

Hyperparameters can be changed for the Actor and critic networks by changing the two dictionaries:

Performance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages