Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plot of CPE during training #381

Open
wall-ed-coder opened this issue Jan 28, 2021 · 0 comments
Open

Plot of CPE during training #381

wall-ed-coder opened this issue Jan 28, 2021 · 0 comments

Comments

@wall-ed-coder
Copy link
Contributor

wall-ed-coder commented Jan 28, 2021

I’m going to implement a comparison of two models using CPE, as well as counting CPE during training in my task, so I decided to start with the Cartpole toy problem as an example. After reading the tutorial and the article, I assumed that CPE's output will be stored in files during the training and can be used in Tensorboard for visualization, however, nothing like this happens. Can you tell me why?
I ran a tutorial on offline learning.
Here are the steps I've executed:
I tried to run the tutorial completely on the points:

1) export CONFIG=reagent/workflow/sample_configs/discrete_dqn_cartpole_offline.yaml
2)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.offline_gym $CONFIG
3)mvn -f preprocessing/pom.xml clean package
4)rm -Rf spark-warehouse derby.log metastore_db preprocessing/spark-warehouse preprocessing/metastore_db preprocessing/derby.log
5)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.timeline_operator $CONFIG
6)./reagent/workflow/cli.py run reagent.workflow.training.identify_and_train_network $CONFIG
7)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.evaluate_gym $CONFIG
8) tensorboard --logdir outputs/

And ran into a problem in the last, 8th step: the command tensorboard --logdir outputs/ did not output anything in the tensorboard, as if it did not find data. I assumed that there is some folder outputs and there are logs/data that are needed, but I did not find this folder. I've changed the command in the last step to tensorboard --logdir . and it helped - I was able to see the losses that were during the training, but there was no CPE either.
In order to get CPE output for Tensorboard, I've added the line cpe_details. log_to_tensorboard() in the file reagent/training/dqn_trainer_base file.py after the line 284 started the training as before, and looked at the tensorboard with the command tensorboard --logdir .. so I could bring CPE, but obtained only one point(as I understand it, checking CPE occurs only at the end of training on test data, that is why one point), but how to get the full schedule of values CPE? However, this is not the only problem. In addition to the fact that only one point is output, the CPE values themselves seem to be incorrect. Normalized values sometimes reach some exorbitant values, such as 200 (in Sequential_Doubly_Robust, MAGIC, Weighted_Sequential_Doubly_Robust), although more often they are within 10. And in the estimators Direct_Method_Reward, Doubly_Robust_Reward, IPS_Reward, the values are always 0.90-0.98. However, as far as I understand, the normalized value shows how many times the new agent is better than the old one that was used to create data, and a value of, for example, 10 cannot be accepted in the Cartpole task when we used a random policy to generate data.
I ran the tutorial on MAC OS 10.15.6
commit bc11359

Could you please also elaborate on the following since it also may relate to the issues I've faced?
I found two Evaluator classes in the reagent(in the reagent/evaluation/evaluator.py and reagent/ope/estimators/estimator.py), could you explain the differences? And what is the difference between the reagent/ope/estimators module and the reagent/evaluation module?
In the tutorial, the command /reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.offline_gym $CONFIG creates data from a random policy, but how to generate data from some trained one?
What probabilities should we use in calculating CPE algorithms(e.g., Importance Sampling) for DQN? logprob (that is, essentially counting probabilities after applying the greedy policy), or just take Softmax(Q-values) and count it as a probability?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant