Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBT implementation #705

Merged
merged 28 commits into from
Jan 26, 2022
Merged

PBT implementation #705

merged 28 commits into from
Jan 26, 2022

Conversation

bouthilx
Copy link
Member

@bouthilx bouthilx commented Dec 1, 2021

The algorithm Population Based Training is implemented using trees with jump edges across them to keep track of jumps across trees during exploitation phases. These edges are particularly important to keep track of trial ancestry. If a trial is broken, we look at the true ancestor (from original tree instead of exploited one) to avoid being stuck on a tree that often leads to broken trials and also to increase likelihood of exploiting other trees. This also facilitates the backtracking strategy and visualisations of population evolution.

Set and create exp working dir inside workon

The directory was created inside the Consumer which is specific to the
cmdline interface. It should be done inside workon so that the python
and cmdline API uses the same logic. Also the generic WorkingDir class
was not handling directly cases where experiment has or not a defined
working dir. There is no apparent need for a generic class to create
temporary directory so this commit rework the class to handle the
experiment object directly. This also allows to handle in exit
resetting experiment.working_dir when it is set temporarily.

Infer trial working dir based on exp.working_dir

Why:

The trial working dir should be unique to the trial and depend on the
experiment's working dir. We can use the id of the trial (or variants
ignoring fidelity, experiment id or lies) to define a unique working
dir.

How:

Instead of setting the working dir directly, we set the experiment's
working dir.

Add parent attribute to Trial

Why:

Now that we can branch a trial we need to keep a trace to the trial's
ancestry. Also, when branching a trial but keeping the same
hyperparameters, it should not lead to the same ID since it is now a
separate trial that will be executed independently. The hash of a trial
will thus also depend on the value of the parent.

How:

For simplicity the parent attribute only reference to the ID of the
parent trial (full id), just like it is done for the reference to the
experiment. We should use a lazy tree node implementation like for the
EVC experiment tree node to reference trial parent object instead of
trial id. This should be done in the future TrialClient class.

Move tree.py to utils

Why:

The tree node will be used for PBT and probably for trial objects as
well in the future so it must be generalized.

Add Tree.node_depth and Tree.get_nodes_at_depth

We need these methods for fast retrievals in the tree in PBT algorithm.

@bouthilx bouthilx changed the title WIP: PBT implementation PBT implementation Dec 15, 2021
@bouthilx bouthilx marked this pull request as ready for review December 15, 2021 13:59
@bouthilx bouthilx requested a review from donglinjy December 15, 2021 13:59
logger = logging.getLogger(__name__)


class BaseExploit:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An overall comments about Exploit, for Deep Learning training Trial, how do we expect user to re-use the parent model weights?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weights should be saved in a file (any name, it does not matter) in trial.working_dir. PBT takes care of copying the files over to child trials.

This is explained here, but maybe it should be explained in the docstring of BaseExploit too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, user may need to know where they should save and load the weights in the their code, as we only do the copy, right? the actual save and load still depends on user code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed the save and load depends on user code, otherwise we would need to write specific versions for Pytorch, Tensorflow and other frameworks. Do you think I should improve the documentation of PBT to make this more clear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, may be mention that with PBT in Orion, where user code should save and load weights, so an under-performing model will be replaced with better performing one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to documentation: 6c9b5ff

Do you think there is something missing?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for me. just one more thing, do we have any place explain in user code, how do they refer to the trial.working_dir, as a env value or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is documented here: https://orion.readthedocs.io/en/stable/user/script.html#command-line-templating. I'll reference it in PBT's doc.

The directory was created inside the Consumer which is specific to the
cmdline interface. It should be done inside workon so that the python
and cmdline API uses the same logic. Also the generic WorkingDir class
was not handling directly cases where experiment has or not a defined
working dir. There is no apparent need for a generic class to create
temporary directory so this commit rework the class to handle the
experiment object directly. This also allows to handle in __exit__
resetting experiment.working_dir when it is set temporarily.
Why:

The trial working dir should be unique to the trial and depend on the
experiment's working dir. We can use the id of the trial (or variants
ignoring fidelity, experiment id or lies) to define a unique working
dir.

How:

Instead of setting the working dir directly, we set the experiment's
working dir.
Why:

Now that we can branch a trial we need to keep a trace to the trial's
ancestry. Also, when branching a trial but keeping the same
hyperparameters, it should not lead to the same ID since it is now a
separate trial that will be executed independently. The hash of a trial
will thus also depend on the value of the parent.

How:

For simplicity the parent attribute only reference to the ID of the
parent trial (full id), just like it is done for the reference to the
experiment. We should use a lazy tree node implementation like for the
EVC experiment tree node to reference trial parent object instead of
trial id. This should be done in the future TrialClient class.
Why:

The tree node will be used for PBT and probably for trial objects as
well in the future so it must be generalized.
We need these methods for fast retrievals in the tree in PBT algorithm.
bouthilx and others added 8 commits January 19, 2022 14:07
Why:

Database filled with previous version of Oríon will have trials with no
exp_working_dir. We must make sure that these trials when suggested will
have their exp_working_dir set properly before being passed to the
function to optimize.
Why:

We often need to compare trials and always relying on some specific
attributes is cumbersome. We can use trial.id to easily support the
__eq__ operator.
The name Lineage name was confusing. The class is not for a full
Lineage and rather for a single node of the lineage.
Copy link
Collaborator

@donglinjy donglinjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@bouthilx bouthilx merged commit 845cfb7 into Epistimio:develop Jan 26, 2022
@bouthilx bouthilx mentioned this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Introduces a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants