Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add has_suggested/observed, new test-suite and re-implementation of ASHA #604

Merged
merged 38 commits into from
May 13, 2021

Conversation

bouthilx
Copy link
Member

The initial goal was to fix #551 . To do so I added has_suggested and has_observed so that the algorithm can tell which trials were actually observed or if there were ignored the last time the producer called algorithm.observe. This was due to the fact that Hyperband/ASHA ignore trials that were not sampled previously, otherwise they cannot know in which rung they should be placed.

The adaptation of all algorithms and tests for all algorithms turned out to be very tedious. To solve this I re-implemented a generic test-suite that algorithms can inherit, ensure a large coverage of algorithm behaviors by default and also making it easier to introduce new algorithm methods in the future without re-implementing many tests. These generic tests uncovered additional bugs and the work for this Pull Request escalated into a full re-implementation of ASHA on top of Hyperband and many fixes across the different algorithms.

Addition of has_suggested/observed

To avoid passing multiple times the same points to the algorithms,
the producer needs to know whether an algorithm actually observed a
point. Until now the producer was keeping track on of the point passed
to infer which to pass later on, but some algorithms would not observe
points until they had suggested it (ex Hyperband). This caused issue
with the resumption of these algorithms as they would ignore the points
and would never be presented the points again.

Addition of n_suggested/observed

The algorithms now have n_suggested and n_observed properties.

Add generic test-suite for algorithms

Why:

The use-cases to test are basically the same across all algorithms.
Adding new generic methods or modifying them makes the maintenance of
the tests tedious while they are very consistent across algorithms.
With a test-suite we ensure to cover all basic use-cases for algorithms
and make the maintenance much easier.

How:

Define a base class with all tests and specialization methods that
subclasses can override to customize part of the tests.

Many methods are decorated as phases, which makes them parametrizable
with optimization phases. For instance all these tests will run one in
random search phase for TPE and again for the Bayesian Optimization
phases for the TPE. The same applies for Hyperband and its derivatives
to test first phase of random search, different phases of promotions and
phases of bracket repetitions.

Change sampling scheme

I changed the sampling scheme to make the tests easier and the behavior of the algorithm more flexible. The call to suggest
can be either num in which case the algorithm will sample up to num if able to, or None, in which case the algorithm will
sample as many points as possible for efficiency. For example Hyperband could return all samples for first rungs or all promotions at once, or a Bayesian Optimization algorithm could return all random initial points at once.

Fix bugs in TPE

The BO part of TPE was not reproducible

Fix bugs in Hyperband/ASHA

Hyperband

The repetitions were not saved properly in the state_dict and lead to
unreproducible results. Also with the new sampling scheme Hyperband can
either return all samples/promotions at once or return num at a
time. Finally, the rework makes it possible to reimplement ASHA on top
of Hyperband instead of maintaining two very similar implementations.

ASHA

Implementing ASHA on top of Hyperband makes it much leaner and easier to
maintain.

Also, the strategy for the allocation of samples to brackets is turned
into a deterministic one. This makes ASHA easier to test and the
behavior is more predictable. Once the first rungs are full according to
the budget then they cannot allow for more trials anymore, but this
new implementation of ASHA supports repetitions of brackets when the
first ones are filled.

Fix bugs in EvolutionaryES

  • It was non-reproducible. It now passes eves.rng to the mutation
    functions.
  • Categorical dimensions could only support numeric data.
  • The mutations would return duplicates (of samples from previous rungs)

Producer register trials in batch

It is faster to register trial all in batch and continue suggesting if
some failed than rolling back and restarting each time one trial fails
to register (because is duplicate). With the new sampling scheme there
should be no duplicate locally anyway. Duplicates could only appear when
there is multiple workers, in which case the first trial and all next
would be duplicates and thus the whole batch must be trashed.

@bouthilx bouthilx added bug Indicates an unexpected problem or unintended behavior enhancement Improves a feature or non-functional aspects (e.g., optimization, prettify, technical debt) documentation Regarding description, documentation, tutorials, etc labels Apr 27, 2021
@bouthilx bouthilx added this to the v0.1.15 milestone Apr 27, 2021
@bouthilx bouthilx requested a review from donglinjy April 28, 2021 15:00
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
src/orion/algo/random.py Outdated Show resolved Hide resolved
@@ -95,12 +95,12 @@ def produce(self):
self._sample_guard(start)

log.debug("### Algorithm suggests new points.")
new_points = self.naive_algorithm.suggest(self.pool_size)
new_points = self.naive_algorithm.suggest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we limit the number of points been suggested or been persisted smaller than max_trials, or there will be bunch of new trials in db but never executed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be in favor of registering all returned trials. We never know if some running trials will fail, forcing the producer to generate new trials. At least they would already be present. Does that seem reasonable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more wondering how this will be exposed to user. If there are too much new or pend status trials when they query the experiment status, may be a little bit confusing. Although I saw algorithms like gridsearch will actually limit the grid size to max_trials.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, but the algorithm should be able to sample more than max_trials if many of them fails. That's a problem with grid search by construction, it samples the grid and that's it. But for the other algorithms they can sample more. Maybe it should be something like max(max_trials - self.n_suggested, 1)? So that the algorithm can still return more than max_trials but overshoot as little as possible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this seems better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for giving this responsibility to the producer since it knows how many broken trials there are and can adjust the request accordingly. The method suggest() would have no default, and algorithms would be expected to return as many points as possible, up to num. I would be happy to have your feedback on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me. do you think it is necessary to validate the number of points suggested by the algorithm is actually smaller than request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rely on the unittests of the algorithms for this. If it ever happens that the algo returns more points it will be better if the producer registers them anyway, otherwise there will be issues with the algorithm thinking the additional trials were registered but they weren't.

@@ -55,17 +55,15 @@ def suggest(self, num=1):
.. note:: New parameters must be compliant with the problem's domain
`orion.algo.space.Space`.
"""
if num is None:
num = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just go with max_trials like gridsearch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would depend on what we decide here: #604 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, based on f6aea11 there would be no default num anymore. Typically the producer would request max_trials to the random search at the beginning which would return the whole set of trials to run.

@bouthilx
Copy link
Member Author

bouthilx commented May 7, 2021

Waiting on #605 to be merged before rebasing this PR.

bouthilx added 18 commits May 12, 2021 09:49
Why:

To avoid passing multiple times the same points to the algorithms,
the producer needs to know whether an algorithm actually observed a
point. Until now the producer was keeping track on of the point passed
to infer which to pass later on, but some algorithms would not observe
points until they had suggested it (ex Hyperband). This caused issue
with the resumption of these algorithms as they would ignore the points
and would never be presented the points again.
Why:

The fidelity index may be used by other algorithms than hyperband and
get_id is used by all algorithms.
Why:

The use-cases to test are basically the same across all algorithms.
Adding new generic methods or modifying them makes the maintenance of
the tests tedious while they are very consistent across algorithms.
With a test-suite we ensure to cover all basic use-cases for algorithms
and make the maintenance much easier.

How:

Define a base class with all tests and specialization methods that
subclasses can override to customize part of the tests.

Many methods are decorated as _phases_, which makes them parametrizable
with optimization phases. For instance all these tests will run one in
random search phase for TPE and again for the Bayesian Optimization
phases for the TPE. The same applies for Hyperband and its derivatives
to test first phase of random search, different phases of promotions and
phases of bracket repetitions.
Why:

The repetitions were not saved properly in the state_dict and lead to
unreproducible results. Also with the new sampling scheme Hyperband can
either return all samples/promotions at once or return ``num`` at a
time. Finally, the rework makes it possible to reimplement ASHA on top
of Hyperband instead of maintaining two very similar implementations.
Implementing ASHA on top of Hyperband makes it much leaner and easier to
maintain.

Also, the strategy for the allocation of samples to brackets is turned
into a deterministic one. This makes ASHA easier to test and the
behavior is more predictable. Once the first rungs are full according to
the budget then they cannot allow for more trials anymore, but this
new implementation of ASHA supports repetitions of brackets when the
first ones are filled.
There was a few bugs.
- It was non-reproducible. It now passes eves.rng to the mutation
functions.
- Categorical dimensions could only support numeric data.
- The mutations would return duplicates (of samples from previous rungs)
Why:

It is faster to register trial all in batch and continue suggesting if
some failed than rolling back and restarting each time one trial fails
to register (because is duplicate). With the new sampling scheme there
should be no duplicate locally anyway. Duplicates could only appear when
there is multiple workers, in which case the first trial and all next
would be duplicates and thus the whole batch must be trashed.
bouthilx added 19 commits May 12, 2021 09:51
Why:

The call was modifying the passed argument, eventually corrupting the
data passed by the algorithm.
Why:

All algorithms were duplicating the method to register trials inside
`_trials_info`. This should be hidden under a generic `register` method.
Why:

The algo does not know if some trial failed. It should not stop until
`max_trials` points have been observed. Otherwise we would never reach
`max_trials` is some are broken.
Why:

This is more efficient. :)
Why:

The producer should be responsible of requesting as many trials as
necessary to reach `max_trials`. It is then up to the algo to return as
many as possible.
@bouthilx bouthilx force-pushed the hotfix/algo_observed branch from 02dae48 to d074e85 Compare May 12, 2021 14:08
Why:

Now that random search samples `max_trials` we need to look at
`completed` trials to infer how many were executed.
Copy link
Collaborator

@donglinjy donglinjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fantastic. 👍

@bouthilx bouthilx merged commit 0c5c74d into Epistimio:develop May 13, 2021
@bouthilx bouthilx deleted the hotfix/algo_observed branch May 13, 2021 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior documentation Regarding description, documentation, tutorials, etc enhancement Improves a feature or non-functional aspects (e.g., optimization, prettify, technical debt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hyperband looses completed trials at restart
2 participants