Distillation Policy Optimization

Pytorch implementation of Distillation Policy Optimization (DPO) for discrete domains, a general learning framework for a set of on-policy algorithms, with off-policy data fully engaged.

Changes Made

To accommodate to the discrete action, we only change the network architecture by encoding the observation first and then concatenate with action to predict the action-value function and the residual function. There is no need for modifying any algorithmic component as outlined in the paper.

How To Use

We reference the hyperparameters from original PPO paper and repository Pytorch-PPO:

PPO

python main.py --env-name "Breakout-v4" --learner PPO --clip-param 0.2 --log-dir "logs" --seed 0 --log-interval 2 --eval-interval 2 --num-steps 128 --num-processes 8 --lr 3e-4 --dpo-epoch 4 --num-mini-batch 4 --gamma 0.99 --uae-lambda 0.95 --num-env-steps 100000

Requirements

pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
algo		algo
common		common
LICENSE		LICENSE
README.md		README.md
dpo.py		dpo.py
evaluation.py		evaluation.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distillation Policy Optimization

Changes Made

How To Use

PPO

Requirements

About

Releases

Packages

Languages

License

Anonymous-DPO-discrete/DPO

Folders and files

Latest commit

History

Repository files navigation

Distillation Policy Optimization

Changes Made

How To Use

PPO

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages