Releases: kengz/SLM-Lab
Rework eval mode; major refactoring
Eval rework
This release adds an eval mode that is the same as OpenAI baseline. Spawn 2 environments, 1 for training and 1 more eval. In the same process (blocking), run training as usual, then at ckpt, run an episode on eval env and update stats.
The logic for the stats are the same as before, except the original body.df
is now split into two: body.train_df
and body.eval_df
. Eval df uses the main env stats except for t, reward
to reflect progress on eval env. Correspondingly, session analysis also produces both versions of data.
Data from body.eval_df
is used to generate session_df, session_graph, session_fitness_df
, whereas the data from body.train_df
is used to generate a new set of trainsession_df, trainsession_graph, trainsession_fitness_df
for debugging.
The previous process-based eval functionality is kept, but is now considered as parallel_eval
. This can be useful for more robust checkpointing and eval.
Refactoring
- purge useless computations
- properly and efficiently gather and organize all update variable computations.
This also speeds up run time by x2. For Atari Beamrider with DQN on V100 GPU, manual benchmark measurement gives 110 FPS for training every 4 frames, while eval achieves 160 FPS. This translates to 10M frames in roughly 24 hours.
v3.1.1: Add Retro Eval, fix Boltzmann spec
Add Retro Eval
- #270 add retro eval mode to run fail online eval sessions. Use command
yarn retro_eval data/reinforce_cartpole_2018_01_22_211751
- #272 #273 fix eval saving 0 index to
eval_session_df
causing trial analysis to break; add reset_index for safety
fix Boltzmann spec
- #271 change Boltzmann spec to use Categorical instead of the wrong Argmax
misc
v3.1.0: L1 fitness norm, code and spec refactor, online eval
v3.1.0: L1 fitness norm, code and spec refactor, online eval
L1 fitness norm (breaking change)
- change fitness vector norm from L2 to L1 for intuitiveness and non-extreme values
code and spec refactor
- #254 PPO cleanup: remove hack and restore minimization scheme
- #255 remove
use_gae
anduse_nstep
param to infer fromlam, num_step_returns
- #260 fix decay
start_step
offset, add unit tests for rate decay methods - #262 make epi start from 0 instead of 1 for code logic consistency
- #264 switch
max_total_t
,max_epi
tomax_tick
andmax_tick_unit
for directness. retiregraph_x
for the unit above - #266 add Atari fitness std, fix CUDA coredump issue
- #269 update gym, remove box2d hack
Online Eval mode
#252 #257 #261 #267
Evaluation sessions during training on a subprocess. This does not interfere with the training process, but spawns multiple subprocesses to do independent evaluation, which then adds to an eval file, and at the end a final eval will finish and plot all the graphs and save all the data for eval.
- enabled by meta spec
'training_eval'
- configure
NUM_EVAL_EPI
inanalysis.py
- update
enjoy
andeval
mode syntax. see README. - change ckpt behavior to use e.g. tag
ckpt-epi10-totalt1000
- add new
eval
mode to lab. runs on a checkpoint file. see below
Eval Session
- add a proper eval Session which loads from the ckpt like above, and does not interfere with existing files. This can be ran on terminal, and it's also used by the internal eval logic, e.g. command
python run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000
- when eval session is done, it will average all of its ran episodes and append to a row in an
eval_session_df.csv
- after that it will delete the ckpt files it had just used (to prevent large storage)
- then, it will run a trial analysis to update
eval_trial_graph.png
, and an accompanyingtrial_df
as average of allsession_df
s
How eval mode works
- checkpoint will save the models using the scheme which records its
epi
andtotal_t
. This allows one to eval using the ckpt model - after creating ckpt files, if
spec.meta.training_eval in
trainmode, a subprocess will launch using the ckpt prepath to run an eval Session, using the same way above
python run_lab.py data/dqn_cartpole_2018_12_20_214412/dqn_cartpole_t0_spec.json dqn_cartpole eval@dqn_cartpole_t0_s2_ckpt-epi10-totalt1000` - eval session runs as above. ckpt will now run at the starting timestep, ckpt timestep, and at the end
- the main Session will wait for the final eval session and it's final eval trial to finish before closing, to ensure that other processes like zipping wait for them.
Example eval trial graph:
V3: PyTorch 1.0, faster NN, Variable Scheduler, working Atari
V3: PyTorch 1.0, faster Neural Network, Variable Scheduler
PRs included #240 #241 #239 #238 #244 #248
PyTorch 1.0 and parallel CUDA
- switch to PyTorch 1.0 with various improvements and parallel CUDA fix
new Neural Network API (breaking changes)
To accommodate more advanced features and improvements, all the networks have been improved with better spec and code design, faster operations, and added features
- single-tail networks will now not use list but a single tail for fast output compute (for loop is slow)
- use PyTorch
optim.lr_scheduler
for learning rate decay. retire old methods. - more efficient spec format for network,
clip_grad
,lr_scheduler_spec
- fix and add proper generalization for ConvNet and RecurrentNet
- add full basic network unit tests
DQN
- rewrite DQN loss for 2x speedup and code simplicity. extend to SARSA
- retire MultitaskDQN for HydraDQN
Memory
- add
OnpolicyConcatReplay
- standardize
preprocess_state
logic in onpolicy memories
Variable Scheduler (breaking spec changes)
- implement variable decay class
VarScheduler
similar to pytorch's LR scheduler. use clock with flexible scheduling unitsepi
ortotal_t
- unify VarScheduler to use standard
clock.max_tick_unit
specified from env - retire
action_policy_update
, update agent spec toexplore_var_spec
- replace
entropy_coef
withentropy_coef_spec
- replace
clip_eps
withclip_eps_spec
(PPO) - update all specs
Math util
- move decay methods to
math_util.py
- move
math_util.py
fromalgorithm/
tolib/
env max tick (breaking spec changes)
- spec/variable renamings:
max_episode
tomax_epi
max_timestep
tomax_t
save_epi_frequency
tosave_frequency
traininig_min_timestep
totraining_start_step
- allow env to stop based on
max_epi
as well asmax_total_t
. propagate clock unit usage - introduce
max_tick, max_tick_unit
properties to env and clock from above - allow
save_frequency
to use the same units accordingly - update Pong and Beamrider to use
max_total_t
as end-condition
Update Ray to reenable CUDA in search
- update ray from
0.3.1
to0.5.3
to address broken GPU with pytorch 1.0.0 - to fix CUDA not discovered in Ray worker, have to manually set CUDA devices at ray remote function due to poor design.
Improved logging and Enjoy mode
- Best models checkpointing measured using the the reward_ma
- Early termination if the environment is solved
- method for logging learning rate to session data frame needed to be updated after move to PyTorch lr_scheduler
- Also removed training_net from the mean learning rate reported in the session dataframe since the learning rate doesn't change
- update naming scheme to work with enjoy mode
- unify and simplify prepath methods
- info_space now uses a
ckpt
for loading ckpt model. Example usage:yarn start pong.json dqn_pong enjoy@data/dqn_cartpole_2018_12_02_124127/dqn_cartpole_t0_s0_ckptbest
- update agent load and policy to properly set variables to
end_val
in enjoy mode - random-seed env as well
Working Atari
#242
Atari benchmark had been failing, but the root cause had finally been discovered and fix: wrong image preprocessing. This can be due to several factors, and we are doing ablation studies to check against the old code: - Image normalization cause the input values to be lowered by ~255, and the resultant loss is too small for optimizer.
- blackframes in stacking at the beginning timesteps
- wrong image permutation
PR #242 introduces:
- global environment preprocessor in the form of env wrapper borrowed from OpenAI baselines, in
env/wrapper.py
- a
TransformImage
to do the proper image transform: grayscale, downsize, and shape from w,h,c to PyTorch format c,h,w - a
FrameStack
which usesLazyFrames
for efficiency to replace the agent-specific Atari stack frame preprocessing. This simplifies the Atari memories - update convnet to use honest shape (c,h,w) without extra transform, and remove its expensive image axis permutation since input now is in the right shape
- update Vizdoom to produce (c,h,w) shape consistent with convnet input expectation
Tuned parameters will be obtained and released next version.
Attached is a quick training curve on Pong, DQN, where the solution avg is +18:
VizDoom, NN Weight Init, Plotly Update
Add VizDoom environment
- add new
OnPolicyImageReplay
andImageReplay
memories - add VizDoom environment, thanks to @joelouismarino
Add NN Weight Initialization functionality
- allow specification of NN weight init function in spec, thanks to @mwcvitkovic
Update Plotly to v3
- move to v3 to allow Python based (instead of bash) image saving for stability
Fixes
Benchmarking; Reward Scaling; HydraDQN
Benchmark
- #177 #183 zip experiment data file for easy upload
- #178 #186 #188 #194 add benchmark spec files
- #193 add benchmark standard data to compute fitness
- #196 add benchmark mode
Reward scaling
- #175 add environment-specific reward scaling
HydraDQN
- #175 HydraDQN works on cartpole and 2dball using reward scaling. spec committed
Add code of conduct
- #199 add a code of conduct file for community
Misc
Distributed CUDA; DQN replace; AtariPrioritizedReplay
Enable Distributed CUDA
#170
Fix the long standing pytorch + distributed using spawn
multiprocessing due to Lab classes not pickleable. Just let the class wrapped in a mp_runner
passed as mp.Process(target=mp_runner, args)
so the classes don't get cloned from memory when spawning process, since it is now passed from outside.
DQN replace method fix
#169
DQN target network replacement was in the wrong direction. Fix that.
AtariPrioritizedReplay
#170 #171
Add a quick AtariPrioritizedReplay via some multi-inheritance black magic with PrioritizedReplay, AtariReplay
Remove Data Space History, Optimize Memory
This release optimizes the RAM consumption and memory sampling speed after stress-testing with Atari. RAM growth is curbed, and replay memory RAM usage is now near theoretical optimality.
Thanks to @mwcvitkovic for providing major help with this release.
Remove DataSpace history
- debug and fix memory growth (cause: data space saving history)
- remove history saving altogether, and mdp data. remove aeb
add_single
. This changes the API. - create
body.df
to track data efficiently as a replacement. This is the API replacement for above.
Optimize Replay Memory RAM
#163 first optimization, halves replay RAM
- make memory state numpy storage
float16
to accommodate big memory size. half a million max_size virtual memory goes from 200GB to 50GB - memory index sampling for training with large size is very slow. add a method
fast_uniform_sampling
to speed up
#165 second optimization, halves replay RAM again to the theoretical minimum
- do not save
next_states
for replay memories due to redundancy - replace with sentinel
self.latest_next_states
during sampling - 1 mil max_size for Atari replay now consumes 50Gb instead of 100Gb (was 200Gb before float16 downcasting in #163 )
Add OnPolicyAtariReplay
- add OnPolicyAtariReplay memory so that policy based algorithms can be applied to the Atari suite.
Misc
- #157 allow usage as a python module via
pip install -e .
orpython setup.py install
- #160 guard lab
default.json
creation on first install - #161 fix agent save method, improve logging
- #162 split logger by session for easier debugging
- #164 fix N-Step-returns calculation
- #166 fix pandas weird casting breaking issue causing process to hang
- #167 uninstall unused tensorflow and tensorboard that come with Unity ML-Agents. rebuild Docker image.
- #168 rebuild Docker and CI images
v2.0.0 Singleton Mode, CUDA Support, Distributed Training
This major v2.0.0 release addresses the user feedbacks on usability and feature requests:
- makes the singleton case (single-agent-env) default
- adds CUDA GPU support for all algorithms (except for distributed)
- adds distributed training to all algorithms (ala A3C style)
- optimizes compute, fixes some computation bugs
Note that this release is backward-incompatible with v1.x. and earlier.
v2.0.0
: make components independent of the framework so it can be used outside of SLM-Lab for development and production, and improve usability. Backward-incompatible with v1.x
.
Singleton Mode as Default
- singleton case (single-agent-env-body) is now the default. Any implementations need only to worry about singleton. Uses the
Session
in lab. - space case (multi-agent-env-body) is now an extension from singleton case. Simply add
space_{method}
to handle the space logic. Uses theSpaceSession
in lab. - make components more independent from framework
- major logic simplification to improve usability. Simplify the AEB and init sequences. remove
post_body_init()
- make network update and grad norm check more robust
CUDA support
- add attribute
Net.cuda_id
for device assignment (per network basis), and auto-calculate thecuda_id
by trial and session index to distribute jobs - enable CUDA and add GPU support for all algorithms, except for distributed (A3C, DPPO etc.)
- properly assign tensors to CUDA automatically depending if GPU is available and desired
- run unit tests on machine with GTX 1070
Distributed Training
- add
distributed
key to meta spec - enable distributed training using pytorch multiprocessing. Create new
DistSession
class which acts as the worker. - In distributed training,
Trial
creates the global networks for agents, then passes to and spawnsDistSession
. Effectively, the semantics of a session changes from being a disjoint copy to being a training worker. - make distributed usable for both singleton (single agent) and space (multiagent) cases.
- add distributed cases to unit tests
State Normalization
- add state normalization using running mean and std:
state = (state - mean) / std
- apply to all algorithms
- TODO conduct a large scale systematic study of the effect is state normalization vs without it
Bug Fixes and Improvements
save()
andload()
now include network optimizers- refactor
set_manual_seed
to util - rename
StackReplay
toConcatReplay
for clarity - improve network training check of weights and grad norms
- introduce
BaseEnv
as base class toOpenAIEnv
andUnityEnv
- optimize computations, major refactoring
- update Dockerfile and release
Misc
- #155 add state normalization using running mean and std
- #154 fix A2C advantage calculation for Nstep returns
- #152 refactor SIL implementation using multi-inheritance
- #151 refactor Memory module
- #150 refactor Net module
- #147 update grad clipping, norm check, multicategorical API
- #156 fix multiprocessing for device with cuda, without using cuda
- #156 fix multi policy arguments to be consistent, and add missing state append logic
PPOSIL, fix continuous actions and PPO
This release adds PPOSIL, fixes some small issues with continuous actions, and PPO ratio computation.
Implementations
#145 Implement PPOSIL. Improve debug logging
#143 add Arch installer thanks to @angel-ayala
Bug Fixes
#138 kill hanging processes of Electron for plotting
#145 fix PPO wrong graph update sequence causing ratio to be 1. Fix continuous action output construction. add guards.
#146 fix continuous actions and add full tests