-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add has_suggested/observed, new test-suite and re-implementation of ASHA #604
Conversation
src/orion/core/worker/producer.py
Outdated
@@ -95,12 +95,12 @@ def produce(self): | |||
self._sample_guard(start) | |||
|
|||
log.debug("### Algorithm suggests new points.") | |||
new_points = self.naive_algorithm.suggest(self.pool_size) | |||
new_points = self.naive_algorithm.suggest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we limit the number of points been suggested or been persisted smaller than max_trials
, or there will be bunch of new
trials in db but never executed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be in favor of registering all returned trials. We never know if some running trials will fail, forcing the producer to generate new trials. At least they would already be present. Does that seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more wondering how this will be exposed to user. If there are too much new
or pend
status trials when they query the experiment status, may be a little bit confusing. Although I saw algorithms like gridsearch
will actually limit the grid size to max_trials
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, but the algorithm should be able to sample more than max_trials if many of them fails. That's a problem with grid search by construction, it samples the grid and that's it. But for the other algorithms they can sample more. Maybe it should be something like max(max_trials - self.n_suggested, 1)
? So that the algorithm can still return more than max_trials
but overshoot as little as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for giving this responsibility to the producer since it knows how many broken trials there are and can adjust the request accordingly. The method suggest() would have no default, and algorithms would be expected to return as many points as possible, up to num
. I would be happy to have your feedback on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me. do you think it is necessary to validate the number of points suggested by the algorithm is actually smaller than request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rely on the unittests of the algorithms for this. If it ever happens that the algo returns more points it will be better if the producer registers them anyway, otherwise there will be issues with the algorithm thinking the additional trials were registered but they weren't.
src/orion/algo/random.py
Outdated
@@ -55,17 +55,15 @@ def suggest(self, num=1): | |||
.. note:: New parameters must be compliant with the problem's domain | |||
`orion.algo.space.Space`. | |||
""" | |||
if num is None: | |||
num = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just go with max_trials
like gridsearch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would depend on what we decide here: #604 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, based on f6aea11 there would be no default num
anymore. Typically the producer would request max_trials
to the random search at the beginning which would return the whole set of trials to run.
Waiting on #605 to be merged before rebasing this PR. |
Why: To avoid passing multiple times the same points to the algorithms, the producer needs to know whether an algorithm actually observed a point. Until now the producer was keeping track on of the point passed to infer which to pass later on, but some algorithms would not observe points until they had suggested it (ex Hyperband). This caused issue with the resumption of these algorithms as they would ignore the points and would never be presented the points again.
Why: The fidelity index may be used by other algorithms than hyperband and get_id is used by all algorithms.
Why: The use-cases to test are basically the same across all algorithms. Adding new generic methods or modifying them makes the maintenance of the tests tedious while they are very consistent across algorithms. With a test-suite we ensure to cover all basic use-cases for algorithms and make the maintenance much easier. How: Define a base class with all tests and specialization methods that subclasses can override to customize part of the tests. Many methods are decorated as _phases_, which makes them parametrizable with optimization phases. For instance all these tests will run one in random search phase for TPE and again for the Bayesian Optimization phases for the TPE. The same applies for Hyperband and its derivatives to test first phase of random search, different phases of promotions and phases of bracket repetitions.
Why: The repetitions were not saved properly in the state_dict and lead to unreproducible results. Also with the new sampling scheme Hyperband can either return all samples/promotions at once or return ``num`` at a time. Finally, the rework makes it possible to reimplement ASHA on top of Hyperband instead of maintaining two very similar implementations.
Implementing ASHA on top of Hyperband makes it much leaner and easier to maintain. Also, the strategy for the allocation of samples to brackets is turned into a deterministic one. This makes ASHA easier to test and the behavior is more predictable. Once the first rungs are full according to the budget then they cannot allow for more trials anymore, but this new implementation of ASHA supports repetitions of brackets when the first ones are filled.
There was a few bugs. - It was non-reproducible. It now passes eves.rng to the mutation functions. - Categorical dimensions could only support numeric data. - The mutations would return duplicates (of samples from previous rungs)
Why: It is faster to register trial all in batch and continue suggesting if some failed than rolling back and restarting each time one trial fails to register (because is duplicate). With the new sampling scheme there should be no duplicate locally anyway. Duplicates could only appear when there is multiple workers, in which case the first trial and all next would be duplicates and thus the whole batch must be trashed.
Why: The call was modifying the passed argument, eventually corrupting the data passed by the algorithm.
Why: All algorithms were duplicating the method to register trials inside `_trials_info`. This should be hidden under a generic `register` method.
Why: The algo does not know if some trial failed. It should not stop until `max_trials` points have been observed. Otherwise we would never reach `max_trials` is some are broken.
Why: This is more efficient. :)
Why: The producer should be responsible of requesting as many trials as necessary to reach `max_trials`. It is then up to the algo to return as many as possible.
02dae48
to
d074e85
Compare
Why: Now that random search samples `max_trials` we need to look at `completed` trials to infer how many were executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fantastic. 👍
The initial goal was to fix #551 . To do so I added
has_suggested
andhas_observed
so that the algorithm can tell which trials were actually observed or if there were ignored the last time the producer calledalgorithm.observe
. This was due to the fact that Hyperband/ASHA ignore trials that were not sampled previously, otherwise they cannot know in which rung they should be placed.The adaptation of all algorithms and tests for all algorithms turned out to be very tedious. To solve this I re-implemented a generic test-suite that algorithms can inherit, ensure a large coverage of algorithm behaviors by default and also making it easier to introduce new algorithm methods in the future without re-implementing many tests. These generic tests uncovered additional bugs and the work for this Pull Request escalated into a full re-implementation of ASHA on top of Hyperband and many fixes across the different algorithms.
Addition of has_suggested/observed
To avoid passing multiple times the same points to the algorithms,
the producer needs to know whether an algorithm actually observed a
point. Until now the producer was keeping track on of the point passed
to infer which to pass later on, but some algorithms would not observe
points until they had suggested it (ex Hyperband). This caused issue
with the resumption of these algorithms as they would ignore the points
and would never be presented the points again.
Addition of n_suggested/observed
The algorithms now have
n_suggested
andn_observed
properties.Add generic test-suite for algorithms
Why:
The use-cases to test are basically the same across all algorithms.
Adding new generic methods or modifying them makes the maintenance of
the tests tedious while they are very consistent across algorithms.
With a test-suite we ensure to cover all basic use-cases for algorithms
and make the maintenance much easier.
How:
Define a base class with all tests and specialization methods that
subclasses can override to customize part of the tests.
Many methods are decorated as phases, which makes them parametrizable
with optimization phases. For instance all these tests will run one in
random search phase for TPE and again for the Bayesian Optimization
phases for the TPE. The same applies for Hyperband and its derivatives
to test first phase of random search, different phases of promotions and
phases of bracket repetitions.
Change sampling scheme
I changed the sampling scheme to make the tests easier and the behavior of the algorithm more flexible. The call to
suggest
can be either
num
in which case the algorithm will sample up tonum
if able to, orNone
, in which case the algorithm willsample as many points as possible for efficiency. For example Hyperband could return all samples for first rungs or all promotions at once, or a Bayesian Optimization algorithm could return all random initial points at once.
Fix bugs in TPE
The BO part of TPE was not reproducible
Fix bugs in Hyperband/ASHA
Hyperband
The repetitions were not saved properly in the state_dict and lead to
unreproducible results. Also with the new sampling scheme Hyperband can
either return all samples/promotions at once or return
num
at atime. Finally, the rework makes it possible to reimplement ASHA on top
of Hyperband instead of maintaining two very similar implementations.
ASHA
Implementing ASHA on top of Hyperband makes it much leaner and easier to
maintain.
Also, the strategy for the allocation of samples to brackets is turned
into a deterministic one. This makes ASHA easier to test and the
behavior is more predictable. Once the first rungs are full according to
the budget then they cannot allow for more trials anymore, but this
new implementation of ASHA supports repetitions of brackets when the
first ones are filled.
Fix bugs in EvolutionaryES
functions.
Producer register trials in batch
It is faster to register trial all in batch and continue suggesting if
some failed than rolling back and restarting each time one trial fails
to register (because is duplicate). With the new sampling scheme there
should be no duplicate locally anyway. Duplicates could only appear when
there is multiple workers, in which case the first trial and all next
would be duplicates and thus the whole batch must be trashed.