Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[llvm_instcount] Leaderboard Submission: DQN trained on test set #292

Merged
merged 5 commits into from
Jun 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,7 @@ environment on the 23 benchmarks in the `cbench-v1` dataset.
| Facebook | e-Greedy search (e=0.1) | [write-up](leaderboard/llvm_instcount/e_greedy/README.md), [results](leaderboard/llvm_instcount/e_greedy/results_e10.csv) | 2021-03 | 152.579s | 1.041× |
| Jiadong Guo | Tabular Q (N=5000, H=10) | [write-up](leaderboard/llvm_instcount/tabular_q/README.md), [results](leaderboard/llvm_instcount/tabular_q/results-H10-N5000.csv) | 2021-04 | 2534.305 | 1.036× |
| Facebook | Random search (t=10) | [write-up](leaderboard/llvm_instcount/random_search/README.md), [results](leaderboard/llvm_instcount/random_search/results_p125_t10.csv) | 2021-03 | **42.939s** | 1.031× |
| Patrick Hesse | DQN (N=4000, H=10) | [write-up](leaderboard/llvm_instcount/dqn/README.md), [results](leaderboard/llvm_instcount/dqn/results-instcountnorm-H10-N4000.csv) | 2021-06 | 91.018s | 1.029× |
phesse001 marked this conversation as resolved.
Show resolved Hide resolved
| Jiadong Guo | Tabular Q (N=2000, H=5) | [write-up](leaderboard/llvm_instcount/tabular_q/README.md), [results](leaderboard/llvm_instcount/tabular_q/results-H5-N2000.csv) | 2021-04 | 694.105 | 0.988× |


Expand Down
81 changes: 81 additions & 0 deletions leaderboard/llvm_instcount/dqn/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# DQN

**tldr;**
A deep q-network that is trained to learn sequences of transformation passes on programs from the test set.

**Authors:**
Patrick Hesse

**Results:**
1. [Episode length 10, 4000 training episodes](results-instcountnorm-H10-N4000.csv).

**Publication:**

**CompilerGym version:**
0.1.9

**Open source?**
Yes. [Source Code](https://github.com/phesse001/compiler-gym-dqn/blob/5a7dc2eec2f144bdabf640266b1667b3da470c79/eval.py).

**Did you modify the CompilerGym source code?**
No.

**What parameters does the approach have?**
* Episode length. *H*
* Learning rate. *λ*
* Discount fatcor. *γ*
* Actions that are considered by the algorithm. *a*
* Features used *f*
* Number of episodes used during learning. *N*
* Ratio of random actions to greedy actions. *ε*
* Rate of decrease of ε. *d*
* Final value of ε. *E*
* Size of memory buffer to store (action, state, reward, new_state) tuple. *s*
* Frequency of target network update. *t*
* The number of time-steps without reward before considering episode done (patience). *p*
* The minimum number of memorys in replay buffer before learning. *l*
* The size of a batch of observations fed through the network *b*
* The number of nodes in a fully connected layer *n*

**What range of values were considered for the above parameters?**
Originally, I tried a much larger set of hyperparameters, something like:
* H=40, λ=0.001, γ=0.99, entire action space, f=InstCountNorm, N=100000, ε=1.0, d=5e-6, E=0.05, s=100000, t=10000, p=5, l=32, b=32, n=512.
But the model was much more unstable, oscillating between ok and bad policies. After some trial and error I eventually decided to scale down the problem by using a subset of the action space with actions that are known to help with code-size reduction and ended up using this set of hyperparameters:
* H=10, λ=0.001, γ=0.9, 15 selected actions, f=InstCountNorm, N=4000, ε=1.0, d=5e-5, E=0.05, s=100000, t=500, p=5, l=32, b=32, n=128.

**Is the policy deterministic?**
The policy itself is deterministic after its trained. However the initialization of network parameters is non-deterministic, so the behavior is different when trained again.

## Description

Deep Q-learning is a standard reinforcement learning algorithm that uses a neural
network to approximate Q-value iteration. As the agent interacts with it's environment,
transitions of state, action, reward, new state, and done are stored in a buffer and
sampled randomly to feed to the Q-network for learning. The Q-network predicts the
expected cumulative reward of taking an action in a state and updates the network
parameters with respect to the huber loss between the predicted Q-values and the
more stable target predicted Q-values.

This algorithm learns from data collected online by the agent, but the data are stored
in a replay buffer and sampled randomly to remove sequential correlations.

The decision-making is done off-policy, meaning that it's actions are dictated not
only by the policy but also by some randomness to encourage exploration of the
environment.

### Experimental Setup

| | Hardware Specification |
| ------ | --------------------------------------------- |
| OS | Ubuntu 20.04 |
| CPU | Ryzen 5 3600 CPU @ 3.60GHz (6× core) |
| Memory | 16 GiB |

### Experimental Methodology

```sh
# this will train the network parameters, which we will load later for evaluation
# since this is not for generalization, we will average the train time over the 23 benchmarks and add it to the geomean time
$ time python train.py --episodes=4000 --episode_length=10 --fc_dim=128 --patience=4
$ python eval.py --epsilon=0 --episode_length=10 --fc_dim=128 --patience=4
```
Loading