This document details how to scale Circuit Training using Google Cloud as the example and a block from Ariane RISC-V as the target. These instructions are modified over time to illustrate how to scale. The numbers from our training run are in the history section.
In order to run at scale, a large number of collect jobs are needed. It is not unreasonable to have 500+ collect jobs to drive enough episodes to keep 8 NVIDIA V100s busy.
For the Ariane RISC-V example we used the configuration in the list below. The
intention is the information will allow you to run an experiment at scale
using your orchestration system of choice. Our assumption is that you would be
using your own orchestration, e.g. SLURM, Kubernetes, or similar product.
Although simple, we flushed out some issue using bash
and remotely executing
scripts via gcloud compute ssh
. If you are having issues, please
open an issue.
We want you to be successful.
For the training we utilized the following servers and jobs:
- 1 Replay Buffer(Reverb)/Eval server 32vCPUs (n1-standard-32)
- 1 Replay Buffer(Reverb) job
- 1 Eval job
- 20 Collect servers 96vCPUs (n1-standard-96)
- Each server running 25 collect jobs for a total of 500.
- 1 Training server: 8xV100s (n1-standard-96)
- 1 Training job
Each iteration of training clears the replay buffer due to the on-policy nature
of the PPO algorithm. This results in a large amount of training time spent
waiting for the replay buffer to refill based on the last policy. 500 collect
agents worked well for this example to feed 8xV100s. It is not necessary to have
that many collect jobs. Having less will slow down total training time
(walltime) but will not impact the quality of the result. We have noticed that
using a smaller per_replica_batch_size
or smaller num_episodes_per_iteration
can reduce quality.
The steps below are not intended to be step-by-step. Listed below are the commands we used to start each of the job types and the server type we used. We have left out how we orchestrated starting 500 collect jobs across 20 servers and other details that we felt would be noise given your environment and orchestration tools will be different.
$ export PROJECT=<Your Google Cloud Project>
$ export ZONE=<Zone you are using>
$ export TRAIN_INSTANCE_NAME=ct-train
$ export REVERB_INSTANCE_NAME=ct-reverb
Create 20 96vCPU servers to host the ~500 collect jobs. Training will work with less collect jobs and get the same result; but each iteration of training is gated on collect as the replay buffer is cleared after each iteration.
# This needs to be run 20 times with a different `$INSTANCE_NAME` each time.
$ export INSTANCE_NAME=ct-collect-00
$ gcloud compute instances create ${INSTANCE_NAME} \
--project=${PROJECT} \
--zone=${ZONE} \
--machine-type=n1-standard-96 \
--maintenance-policy=TERMINATE \
--image-family=tf-ent-latest-cpu \
--image-project=deeplearning-platform-release \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--scopes="https://www.googleapis.com/auth/cloud-platform"
Create one 32vCPU server that will host the reverb and eval job.
$ gcloud compute instances create ${REVERB_INSTANCE_NAME} \
--project=${PROJECT} \
--zone=${ZONE} \
--machine-type=n1-standard-32 \
--maintenance-policy=TERMINATE \
--image-family=tf-ent-latest-cpu \
--image-project=deeplearning-platform-release \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--scopes="https://www.googleapis.com/auth/cloud-platform"
The training server is configured with 8x NVIDIA V100s.
$ gcloud compute instances create ${TRAIN_INSTANCE_NAME} \
--project=${PROJECT} \
--zone=${ZONE} \
--machine-type=n1-standard-96 \
--accelerator=count=8,type=nvidia-tesla-v100 \
--maintenance-policy=TERMINATE \
--image-family=tf-ent-latest-gpu \
--image-project=deeplearning-platform-release \
--boot-disk-size=100GB \
--boot-disk-type=pd-ssd \
--scopes="https://www.googleapis.com/auth/cloud-platform"
While this can be run at HEAD with nightly builds from the upstream libraries, we suggest using a branched version. The instructions below setup the docker images with the latest branch. If you want to try head, these instructions provide the breadcrumbs needed to build the images with head/nightly.
Each host server is going to need:
Our approach was to copy the code to each server. The instructions below assume
the circuit training repo root is located at $HOME/run_ct/circuit_training
on
the host systems.
$ export CT_VERSION=0.0.4
$ git clone https://github.com/google-research/circuit_training.git
$ git -C $(pwd)/circuit_training checkout r${CT_VERSION}
A service account key with access to write to a Google Cloud Storage Bucket.
How this is done will be based on your orchestration system. These are the environment variable we made available on each of the host machines.
$ export REPO_ROOT=$HOME/run_ct/circuit_training
## These first variables are only needed for building the docker images.
# Version of Circuit Training to use, which is also used to build images.
$ export TF_AGENTS_PIP_VERSION=tf-agents[reverb]
$ export CT_VERSION=0.0.3
# The docker is python3.9 only.
$ export PYTHON_VERSION=python3.9
$ export DREAMPLACE_PATTERN=dreamplace_20231214_c5a83e5_${PYTHON_VERSION}.tar.gz
## These variables are used at runtime and passed to the docker container.
$ export REVERB_PORT=8008
$ export REVERB_SERVER="<IP Address of the Reverb Server>:${REVERB_PORT}"
$ export ROOT_DIR=<Path to network storage, e.g. gs://my-bucket/logs/run_00>
$ export NETLIST_FILE=./circuit_training/environment/test_data/ariane/netlist.pb.txt
$ export INIT_PLACEMENT=./circuit_training/environment/test_data/ariane/initial.plc
$ export GLOBAL_SEED=333
We have also uploaded a toy netlist that has a couple macros and can easily be trained on CPU in a few minutes. It is great for smoke testing.
# Use these vars to use the toy netlist, otherwise skip these commands.
$ export NETLIST_FILE=gs://rl-infra-public/circuit-training/netlist/toy/netlist.pb.txt
$ export INIT_PLACEMENT=gs://rl-infra-public/circuit-training/netlist/toy/initial.plc
# If using the toy example make sure on the training job to set
# --sequence_length=3, or it will hang looking for episodes.
We suggest using Docker as we use the same images for continuous integration testing and they include the required dependencies. You can either store a docker images in a central location or create the image on each of the servers.
# For Collect and Reverb/Eval Jobs
$ docker build --pull --no-cache --tag circuit_training:core \
--build-arg tf_agents_version="${TF_AGENTS_PIP_VERSION}" \
--build-arg dreamplace_version="${DREAMPLACE_PATTERN}" \
--build-arg placement_cost_binary="plc_wrapper_main_${CT_VERSION}" \
-f "${REPO_ROOT}"/tools/docker/ubuntu_circuit_training ${REPO_ROOT}/tools/docker/
# For training job with NVIDIA GPU support.
$ docker build --pull --no-cache --tag circuit_training:core \
--build-arg base_image=nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 \
--build-arg tf_agents_version="${TF_AGENTS_PIP_VERSION}" \
--build-arg dreamplace_version="${DREAMPLACE_PATTERN}" \
--build-arg placement_cost_binary="plc_wrapper_main_${CT_VERSION}" \
-f "${REPO_ROOT}"/tools/docker/ubuntu_circuit_training ${REPO_ROOT}/tools/docker/
This section outlines the jobs that will need to be started. The Docker
containers are started detached (-d
). To verify they are running attach to the
container with docker attach
and exit with Ctrl-P
followed by Ctrl-Q
.
All of the commands assume a service account key with access to write to the Google Cloud Storage Bucket has been copied into the root of the cloned github repo. Each command is to be executed from the root of the cloned repo.
The order that the jobs are started should not matter. Each job waits for what it needs from the other jobs before moving forward. If the collect jobs stop for some reason the train job will keep training on stale data.
Start the reverb job on the Replay Buffer(Reverb)/Eval server. Make sure to set
-p
to the port reverb will be running on. The reverb job will continuously
write to console
Waiting for `wait_predicate_fn`. Block execution. Sleeping for 1 seconds.
until the training job is started and writes out the initial collect policy.
$ docker run --rm -d -it -p 8008:8008 -e "GOOGLE_APPLICATION_CREDENTIALS=/workspace/cloud_key.json" \
--rm -it -v ${REPO_ROOT}:/workspace -w /workspace/ circuit_training:core \
python3.9 -m circuit_training.learning.ppo_reverb_server \
--global_seed=${GLOBAL_SEED} \
--root_dir=${ROOT_DIR} \
--port=${REVERB_PORT}
Start the training job on the training server. Remember to use the Docker image created with GPU support. 200 iterations will be about 100K steps.
$ docker run --network host -d -e "GOOGLE_APPLICATION_CREDENTIALS=/workspace/cloud_key.json" \
--gpus all --rm -it -v ${REPO_ROOT}:/workspace -w /workspace/ circuit_training:core \
python3.9 -m circuit_training.learning.train_ppo \
--root_dir=${ROOT_DIR} \
--std_cell_placer_mode=dreamplace \
--replay_buffer_server_address=${REVERB_SERVER} \
--variable_container_server_address=${REVERB_SERVER} \
--sequence_length=134 \
--gin_bindings='train.num_iterations=200'\
--netlist_file=${NETLIST_FILE} \
--init_placement=${INIT_PLACEMENT} \
--global_seed=${GLOBAL_SEED} \
--use_gpu
# If using the toy netlist, some args need changed. Use this command instead.
$ docker run --network host -d -e "GOOGLE_APPLICATION_CREDENTIALS=/workspace/cloud_key.json" \
--gpus all --rm -it -v ${REPO_ROOT}:/workspace -w /workspace/ circuit_training:core \
python3.9 -m circuit_training.learning.train_ppo \
--root_dir=${ROOT_DIR} \
--std_cell_placer_mode=dreamplace \
--replay_buffer_server_address=${REVERB_SERVER} \
--variable_container_server_address=${REVERB_SERVER} \
--sequence_length=3 \
--gin_bindings='train.num_iterations=200' \
--gin_bindings='train.num_episodes_per_iteration=32' \
--gin_bindings='train.per_replica_batch_size=64' \
--gin_bindings='CircuittrainingPPOLearner.summary_interval=12' \
--gin_bindings='CircuitPPOAgent.debug_summaries=True' \
--netlist_file=${NETLIST_FILE} \
--init_placement=${INIT_PLACEMENT} \
--global_seed=${GLOBAL_SEED} \
--use_gpu
Each of the 20 collect servers should run 25 collect jobs. While you would not orchestrate it this way, the bash script below illustrates the point. The following command would need to be run on each of the 20 collect servers. Watching the CPU usage is the best way to figure out the max number of collect jobs your server type can handle.
for i in $(seq 1 23); do
docker run --network host -d -e "GOOGLE_APPLICATION_CREDENTIALS=/workspace/cloud_key.json" \
--rm -it -v ${REPO_ROOT}/circuit_training:/workspace -w /workspace/ circuit_training:core \
python3.9 -m circuit_training.learning.ppo_collect \
--root_dir=${ROOT_DIR} \
--std_cell_placer_mode=dreamplace \
--replay_buffer_server_address=${REVERB_SERVER} \
--variable_container_server_address=${REVERB_SERVER} \
--task_id=${i} \
--netlist_file=${NETLIST_FILE} \
--init_placement=${INIT_PLACEMENT} \
--global_seed=${GLOBAL_SEED} \
--logtostderr
It is ok for the task_id
to repeat across or even on a given server. To get
logging of a single collect job in Tensorboard set a single collect job to have
a task_id=0
.
Start the eval job on the replay buffer(Reverb)/eval server. Eval could be run on a seperate server with the same command below if desired.
$ docker run --network host -d -e "GOOGLE_APPLICATION_CREDENTIALS=/workspace/cloud_key.json" \
--rm -it -v $(pwd):/workspace -w /workspace/ circuit_training:core \
python3.9 -m circuit_training.learning.eval \
--root_dir=${ROOT_DIR} \
--std_cell_placer_mode=dreamplace \
--variable_container_server_address=${REVERB_SERVER} \
--netlist_file=${NETLIST_FILE} \
--init_placement=${INIT_PLACEMENT} \
--global_seed=${GLOBAL_SEED} \
--output_placement_save_dir=./
From the replay buffer(Reverb)/eval server or a local workstation run Tensorboard to monitor the results.
$ tensorboard dev upload --logdir $ROOT_DIR
The numbers below are from a run we did in January of 2022 on Google Cloud. Much has changed and we are not currently taking the time to redo the experiment. The instructions on how to scale Circuit Training are evolving and this section exists to preserve the results from Jan 2022.
The results in the table below are reported for training from scratch. We trained with 3 different seeds run 3 times each. This is slightly different than what was used in the paper (8 runs each with a different seed), but better captures the different sources of variability. Our results training from scratch are comparable or better than the reported results in the paper (on page 22) which used fine-tuning from a pre-trained model. We are training from scratch because we cannot publish the pre-trained model at this time and the released code can provide comparable results. The metrics listed in the table correspond to the output optimal placement. Training was run for 200 iterations which results in ~107K steps (gradient updates). This tensorboard link contains the raw results of the runs for full transparency and usefulness.
Run id | Seed | Proxy Wirelength | Proxy Congestion | Proxy Density | Step |
---|---|---|---|---|---|
run_00 | 111 | 0.1051 | 0.8746 | 0.5154 | 77,184 |
run_01 | 111 | 0.0979 | 0.8749 | 0.5217 | 51,992 |
run_02 | 111 | 0.1052 | 0.9069 | 0.5289 | 94,872 |
run_03 | 222 | 0.1021 | 0.9667 | 0.5799 | 33,232 |
run_04 | 222 | 0.0963 | 0.9945 | 0.6795 | 103,984 |
run_05 | 222 | 0.1060 | 1.0352 | 0.5886 | 37,520 |
run_06 | 333 | 0.1012 | 0.8738 | 0.5110 | 46,096 |
run_07 | 333 | 0.0977 | 0.8684 | 0.5109 | 35,912 |
run_08 | 333 | 0.1004 | 0.8613 | 0.5160 | 48,776 |
_ | Proxy Wirelength | Proxy Congestion | Proxy Density |
---|---|---|---|
mean | 0.1013 | 0.9174 | 0.5502 |
std | 0.0036 | 0.0647 | 0.0568 |
Applying coordinated descent after training resulted in improved proxy numbers for complex blocks like those used in TPUs as referenced in the paper. However, for the simpler Ariane RISC-V there were modest (1-2%) improvements to proxy wirelength and congestion.
_ | Proxy Wirelength | Proxy Congestion | Proxy Density |
---|---|---|---|
mean | 0.0988 | 0.9077 | 0.5513 |
std | 0.0053 | 0.0621 | 0.0589 |
Below are the hyperparameters and the values that were used for our experiments. Some hyperparameters are changed from the paper to make the training more stable for the Ariane block. The hyperparameters listed in the paper were set for TPU blocks which have different characteristics. For training, we use the clipping version of proximal policy optimization (PPO) (Schulman et al., 2017) without the KL divergence penalty implemented by tf-agents. The default for the training hyperparameters, if not specified in the table, is the same as the defaults in the tf-agents.
Configuration | Default Value | Comments |
---|---|---|
**Proxy reward |
: calculation** : : : | wirelength_weight | 1.0 | | | density_weight | 1.0 | Changed from 0.1 in the | : : : paper, since it produces : : : : more stable training from : : : : scratch on Ariane blocks. : | congestion_weight | 0.5 | Changed from 0.1 in the | : : : paper, since it produces : : : : more stable training from : : : : scratch on Ariane blocks. : | Standard cell | | | : placement : : : | num_steps | [100, 100, 100] | | | io_factor | 1.0 | | | move_distance_factors | [1, 1, 1] | | | attract_factors | [100, 1e-3, 1e-5] | | | repel_factors | [0, 1e6, 1e7] | | | Environment | | | : observation : : : | max_num_nodes | 4700 | | | max_num_edges | 28400 | | | max_grid_size | 128 | | | default_location_x | 0.5 | | | default_location_y | 0.5 | | | Model architecture | | | | num_gcn_layers | 3 | | | edge_fc_layers | 1 | | | gcn_node_dim | 8 | | | dirichlet_alpha | 0.1 | | | policy_noise_weight | 0.0 | | | Training | | | | optimizer | Adam | | | learning_rate | 4e-4 | | | sequence_length | 134 | | | num_episodes_per_iteration | 1024 | | | per_replica_batch_size | 128 | | | num_epochs | 4 | | | value_pred_loss_coef | 0.5 | | | entropy_regularization | 0.01 | | | importance_ratio_clipping | 0.2 | | | discount_factor | 1.0 | | | entropy_regularization | 0.01 | | | value_pred_loss_coef | 0.5 | | | gradient_clipping | 1.0 | | | use_gae | False | | | use_td_lambda_return | False | | | log_prob_clipping | 0.0 | | | policy_l2_reg | 0.0 | | | value_function_l2_reg | 0.0 | | | shared_vars_l2_reg | 0.0 | |