Multiple entity logistics and continuous space coordination using deep reinforcement learning

Quick preview:

Problem definition:

Given a number of autonomous mobility entities create a single neural network that can infer their next required positions in order to solve a reward based environment.

The environment consists of a number of "Bots", "Packs" and "Places" located at different positions in a continuous 2D space.

Bots can pick up packs when they are in close proximity to them.
Bots can drop packs when they are in close proximity to the corresponding places.
"Heading" - coordinate vector that provides the bots their new destination. The bots can navigate to the heading coordinates autonomously.
The agent is a single policy neural network that generates a new heading vector for any specific state of the environment in order to complete the task.
During training the agent receives 50 points reward if a bot picks up a pack and 100 points reward if the pack is delivered to the corresponding place, ex: for 2 bots, 2 packs and 1 place the maximum total reward is 300.

The problem is considered solved if the swarm agent can obtain a reward average over 95% of maximum total reward over a span of 100 episodes

Policy network implementation:

The input to the policy network consists of a stack of consecutive environment states.
The environment state consists of positional and logistic values:

x,y position values for bots
"bot full" flags, 0 or 1 if full
x,y position values for packs
"loaded in" pack logistic index (bot index)
"unloaded in" pack logistic index (place index)
x,y position values for places
heading values form the previous state (logistic "memory")

The policy output is the next heading vector consisting of x,y coordinates for all bots.

Training the policy network

The policy network is trained using "Twin Delayed Deep Deterministic Policy Gradients" - TD3, an actor-critic method derived from "Deep Deterministic Policy Gradients" - DDPG, that uses a second critic network in order to prevent value over estimation. An off-policy method was chosen due to sample efficiency:
Original TD3 Paper: Addressing Function Approximation Error in Actor-Critic Methods

TD3 critic implementation details:

# Compute target Q values for both critic networks:
target_q1, target_q2 = self.critic_target(next_state, next_action)

# Choose the minimum target Q value to prevent overestimation error:
target_q = torch.min(target_q1, target_q2)

# Apply the Bellman equation:
target_q = reward + not_done * self.discount * target_q

# Compute current Q values:
current_q1, current_q2 = self.critic(state, action)

# Compute the critic loss:
critic_loss = functional.mse_loss(current_q1, target_q) + functional.mse_loss(current_q2, target_q)

# Optimize the critic
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()

TD3 actor implementation details:

# Delayed policy updates - the actor is updated less frequently:
if self.total_it % self.policy_freq == 0:

    # Compute the actor loss by propagating through one of the critic networks:
    actor_loss = -self.critic.q1(state, self.actor(state)).mean()
	
    # Optimize the actor:
    self.actor_optimizer.zero_grad()
    actor_loss.backward()
    self.actor_optimizer.step()

Hyper-parameters

#Environment Hyper-parameters: 

env = "SB"              # File naming purposes
bots_number = 2         # Number of bots in environment
packs_number = 2        # Number of packs in environment
places_number = 1       # Number of places in environment
bot_scale = 0.03        # Size of bots, relative to total space
pack_scale = 0.025      # Size of packs, affects loading range
place_scale = 0.1       # Size of places, affects unloading range
load_reward = 50        # Reward if a bot loads a pack
unload_reward = 100     # Reward if a bot unloads at place
episode_steps = 50      # Episode length in environment frames

#Policy Hyper-parameters: 

policy = "TD3-1"        # File naming purposes
policy_width = 512      # Neural Network layer size
batch_size 1024         # Training batch size, GPU memory limited
learning_rate = 0.00001 # Optimizer learning rate, make smaller for larger networks
update_rate = 2         # Policy function optimisation frequency
update_tau = 0.005      # Target policy wight transfer factor
discount = 0.99         # Future reward discount for Bellman equation
policy_noise = 0.2      # Noise added to replay memory during the optimisation pass
noise_clip = 0.5        # Policy noise clipping factor

#Markov Decision Process Hyper-parameters: 

expl_noise = 0.1        # Action noise for exploration
max_steps = 4000000     # Maximum allowed experiment steps
start_step = 2000       # Pre-training replay buffer loading
eval_freq = 1000        # Evaluation pass interval 
eval_length = 10        # Base evaluation length
min_performance = 0.95  # Minimum performance before experiment stops (see problem statement)
seed = 30               # Universal seed (environment, noise, weight initialisation)

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
images		images
ExpManager.py		ExpManager.py
MDP.py		MDP.py
README.md		README.md
SwarmBots.py		SwarmBots.py
TD3.py		TD3.py
experiments_config.csv		experiments_config.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multiple entity logistics and continuous space coordination using deep reinforcement learning

Problem definition:

Policy network implementation:

Training the policy network

Hyper-parameters

About

Releases

Packages

Languages

FelixNica/SwarmAI_Multiple-entity-logistics-and-coordination

Folders and files

Latest commit

History

Repository files navigation

Multiple entity logistics and continuous space coordination using deep reinforcement learning

Problem definition:

Policy network implementation:

Training the policy network

Hyper-parameters

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages