You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I’m going to implement a comparison of two models using CPE, as well as counting CPE during training in my task, so I decided to start with the Cartpole toy problem as an example. After reading the tutorial and the article, I assumed that CPE's output will be stored in files during the training and can be used in Tensorboard for visualization, however, nothing like this happens. Can you tell me why?
I ran a tutorial on offline learning.
Here are the steps I've executed:
I tried to run the tutorial completely on the points:
1) export CONFIG=reagent/workflow/sample_configs/discrete_dqn_cartpole_offline.yaml
2)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.offline_gym $CONFIG
3)mvn -f preprocessing/pom.xml clean package
4)rm -Rf spark-warehouse derby.log metastore_db preprocessing/spark-warehouse preprocessing/metastore_db preprocessing/derby.log
5)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.timeline_operator $CONFIG
6)./reagent/workflow/cli.py run reagent.workflow.training.identify_and_train_network $CONFIG
7)./reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.evaluate_gym $CONFIG
8) tensorboard --logdir outputs/
And ran into a problem in the last, 8th step: the command tensorboard --logdir outputs/ did not output anything in the tensorboard, as if it did not find data. I assumed that there is some folder outputs and there are logs/data that are needed, but I did not find this folder. I've changed the command in the last step to tensorboard --logdir . and it helped - I was able to see the losses that were during the training, but there was no CPE either.
In order to get CPE output for Tensorboard, I've added the line cpe_details. log_to_tensorboard() in the file reagent/training/dqn_trainer_base file.py after the line 284 started the training as before, and looked at the tensorboard with the command tensorboard --logdir .. so I could bring CPE, but obtained only one point(as I understand it, checking CPE occurs only at the end of training on test data, that is why one point), but how to get the full schedule of values CPE? However, this is not the only problem. In addition to the fact that only one point is output, the CPE values themselves seem to be incorrect. Normalized values sometimes reach some exorbitant values, such as 200 (in Sequential_Doubly_Robust, MAGIC, Weighted_Sequential_Doubly_Robust), although more often they are within 10. And in the estimators Direct_Method_Reward, Doubly_Robust_Reward, IPS_Reward, the values are always 0.90-0.98. However, as far as I understand, the normalized value shows how many times the new agent is better than the old one that was used to create data, and a value of, for example, 10 cannot be accepted in the Cartpole task when we used a random policy to generate data.
I ran the tutorial on MAC OS 10.15.6
commit bc11359
Could you please also elaborate on the following since it also may relate to the issues I've faced?
I found two Evaluator classes in the reagent(in the reagent/evaluation/evaluator.py and reagent/ope/estimators/estimator.py), could you explain the differences? And what is the difference between the reagent/ope/estimators module and the reagent/evaluation module?
In the tutorial, the command /reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.offline_gym $CONFIG creates data from a random policy, but how to generate data from some trained one?
What probabilities should we use in calculating CPE algorithms(e.g., Importance Sampling) for DQN? logprob (that is, essentially counting probabilities after applying the greedy policy), or just take Softmax(Q-values) and count it as a probability?
The text was updated successfully, but these errors were encountered:
I’m going to implement a comparison of two models using CPE, as well as counting CPE during training in my task, so I decided to start with the Cartpole toy problem as an example. After reading the tutorial and the article, I assumed that CPE's output will be stored in files during the training and can be used in Tensorboard for visualization, however, nothing like this happens. Can you tell me why?
I ran a tutorial on offline learning.
Here are the steps I've executed:
I tried to run the tutorial completely on the points:
And ran into a problem in the last, 8th step: the command
tensorboard --logdir outputs/
did not output anything in the tensorboard, as if it did not find data. I assumed that there is some folder outputs and there are logs/data that are needed, but I did not find this folder. I've changed the command in the last step totensorboard --logdir .
and it helped - I was able to see the losses that were during the training, but there was no CPE either.In order to get CPE output for Tensorboard, I've added the line
cpe_details. log_to_tensorboard()
in the file reagent/training/dqn_trainer_base file.py after the line 284 started the training as before, and looked at the tensorboard with the commandtensorboard --logdir .
. so I could bring CPE, but obtained only one point(as I understand it, checking CPE occurs only at the end of training on test data, that is why one point), but how to get the full schedule of values CPE? However, this is not the only problem. In addition to the fact that only one point is output, the CPE values themselves seem to be incorrect. Normalized values sometimes reach some exorbitant values, such as 200 (in Sequential_Doubly_Robust, MAGIC, Weighted_Sequential_Doubly_Robust), although more often they are within 10. And in the estimators Direct_Method_Reward, Doubly_Robust_Reward, IPS_Reward, the values are always 0.90-0.98. However, as far as I understand, the normalized value shows how many times the new agent is better than the old one that was used to create data, and a value of, for example, 10 cannot be accepted in the Cartpole task when we used a random policy to generate data.I ran the tutorial on MAC OS 10.15.6
commit bc11359
Could you please also elaborate on the following since it also may relate to the issues I've faced?
I found two Evaluator classes in the reagent(in the reagent/evaluation/evaluator.py and reagent/ope/estimators/estimator.py), could you explain the differences? And what is the difference between the reagent/ope/estimators module and the reagent/evaluation module?
In the tutorial, the command
/reagent/workflow/cli.py run reagent.workflow.gym_batch_rl.offline_gym $CONFIG
creates data from a random policy, but how to generate data from some trained one?What probabilities should we use in calculating CPE algorithms(e.g., Importance Sampling) for DQN? logprob (that is, essentially counting probabilities after applying the greedy policy), or just take Softmax(Q-values) and count it as a probability?
The text was updated successfully, but these errors were encountered: