Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
This architecture reaches much better returns from multiple reasons listed below, it is featured in recent paper (current SOTA). The architecture was first introduced in Scaling Laws for Imitation Learning in NetHack.
Credit: Jens Tulys https://github.com/jens321/
Architecture Details (from the paper)
We use two main architectures for all our experiments, one for the BC experiments and another for
the RL experiments.
BC architecture. The NLD-AA dataset is comprised of ttyrec-formatted trajectories, which are
24 × 80 ASCII character and color grids (one for each) along with the cursor position. To encode
these, we modify the architecture used in Hambro et al., resulting in the following:
a 21 × 80 grid per time step. Note the top row and bottom two rows are cut off as those
are fed into the message and bottom line statistics encoder, respectively. We embed each
character and color in an embedding lookup table, after which we concatenate them and
put them in their respective positions in the grid. We then feed this embedded grid into a
ResNet, which consists of 2 identical modules, each using 1 convolutional layer followed by
a max pooling layer and 2 residual blocks (of 2 convolutional layers each), for a total of 10
convolutional layers, closely following the setup in Espeholt et al.
characters into a one-hot vector, and concatenates these, resulting in a 80 × 256 = 20, 480
dimensional vector representing the message. This vector is then fed into a 2-layer MLP,
resulting in the message representation.
rows of the grid and create a “character-normalized" (subtract 32 and divide by 96) and
“digits-normalized" (subtract 47 and divide by 10, mask out ASCII characters smaller than
45 or larger than 58) input representation, which we then stack, resulting in a 160 × 2
dimensional input. This closely follows the Sample Factory3 model used in Hambro et al.