Add has_suggested/observed, new test-suite and re-implementation of ASHA #604

bouthilx · 2021-04-27T16:43:08Z

The initial goal was to fix #551 . To do so I added has_suggested and has_observed so that the algorithm can tell which trials were actually observed or if there were ignored the last time the producer called algorithm.observe. This was due to the fact that Hyperband/ASHA ignore trials that were not sampled previously, otherwise they cannot know in which rung they should be placed.

The adaptation of all algorithms and tests for all algorithms turned out to be very tedious. To solve this I re-implemented a generic test-suite that algorithms can inherit, ensure a large coverage of algorithm behaviors by default and also making it easier to introduce new algorithm methods in the future without re-implementing many tests. These generic tests uncovered additional bugs and the work for this Pull Request escalated into a full re-implementation of ASHA on top of Hyperband and many fixes across the different algorithms.

Addition of has_suggested/observed

To avoid passing multiple times the same points to the algorithms,
the producer needs to know whether an algorithm actually observed a
point. Until now the producer was keeping track on of the point passed
to infer which to pass later on, but some algorithms would not observe
points until they had suggested it (ex Hyperband). This caused issue
with the resumption of these algorithms as they would ignore the points
and would never be presented the points again.

Addition of n_suggested/observed

The algorithms now have n_suggested and n_observed properties.

Add generic test-suite for algorithms

Why:

The use-cases to test are basically the same across all algorithms.
Adding new generic methods or modifying them makes the maintenance of
the tests tedious while they are very consistent across algorithms.
With a test-suite we ensure to cover all basic use-cases for algorithms
and make the maintenance much easier.

How:

Define a base class with all tests and specialization methods that
subclasses can override to customize part of the tests.

Many methods are decorated as phases, which makes them parametrizable
with optimization phases. For instance all these tests will run one in
random search phase for TPE and again for the Bayesian Optimization
phases for the TPE. The same applies for Hyperband and its derivatives
to test first phase of random search, different phases of promotions and
phases of bracket repetitions.

Change sampling scheme

I changed the sampling scheme to make the tests easier and the behavior of the algorithm more flexible. The call to suggest
can be either num in which case the algorithm will sample up to num if able to, or None, in which case the algorithm will
sample as many points as possible for efficiency. For example Hyperband could return all samples for first rungs or all promotions at once, or a Bayesian Optimization algorithm could return all random initial points at once.

Fix bugs in TPE

The BO part of TPE was not reproducible

Fix bugs in Hyperband/ASHA

Hyperband

The repetitions were not saved properly in the state_dict and lead to
unreproducible results. Also with the new sampling scheme Hyperband can
either return all samples/promotions at once or return num at a
time. Finally, the rework makes it possible to reimplement ASHA on top
of Hyperband instead of maintaining two very similar implementations.

ASHA

Implementing ASHA on top of Hyperband makes it much leaner and easier to
maintain.

Also, the strategy for the allocation of samples to brackets is turned
into a deterministic one. This makes ASHA easier to test and the
behavior is more predictable. Once the first rungs are full according to
the budget then they cannot allow for more trials anymore, but this
new implementation of ASHA supports repetitions of brackets when the
first ones are filled.

Fix bugs in EvolutionaryES

It was non-reproducible. It now passes eves.rng to the mutation
functions.
Categorical dimensions could only support numeric data.
The mutations would return duplicates (of samples from previous rungs)

Producer register trials in batch

It is faster to register trial all in batch and continue suggesting if
some failed than rolling back and restarting each time one trial fails
to register (because is duplicate). With the new sampling scheme there
should be no duplicate locally anyway. Duplicates could only appear when
there is multiple workers, in which case the first trial and all next
would be duplicates and thus the whole batch must be trashed.

src/orion/algo/tpe.py

src/orion/core/worker/producer.py

src/orion/algo/random.py

donglinjy · 2021-05-02T14:14:36Z

src/orion/core/worker/producer.py

@@ -95,12 +95,12 @@ def produce(self):
            self._sample_guard(start)

            log.debug("### Algorithm suggests new points.")
-            new_points = self.naive_algorithm.suggest(self.pool_size)
+            new_points = self.naive_algorithm.suggest()


should we limit the number of points been suggested or been persisted smaller than max_trials, or there will be bunch of new trials in db but never executed?

I would be in favor of registering all returned trials. We never know if some running trials will fail, forcing the producer to generate new trials. At least they would already be present. Does that seem reasonable?

I was more wondering how this will be exposed to user. If there are too much new or pend status trials when they query the experiment status, may be a little bit confusing. Although I saw algorithms like gridsearch will actually limit the grid size to max_trials.

Hum, but the algorithm should be able to sample more than max_trials if many of them fails. That's a problem with grid search by construction, it samples the grid and that's it. But for the other algorithms they can sample more. Maybe it should be something like max(max_trials - self.n_suggested, 1)? So that the algorithm can still return more than max_trials but overshoot as little as possible.

yeah, this seems better.

I opted for giving this responsibility to the producer since it knows how many broken trials there are and can adjust the request accordingly. The method suggest() would have no default, and algorithms would be expected to return as many points as possible, up to num. I would be happy to have your feedback on this.

looks good to me. do you think it is necessary to validate the number of points suggested by the algorithm is actually smaller than request?

I would rely on the unittests of the algorithms for this. If it ever happens that the algo returns more points it will be better if the producer registers them anyway, otherwise there will be issues with the algorithm thinking the additional trials were registered but they weren't.

donglinjy · 2021-05-04T15:13:35Z

src/orion/algo/random.py

@@ -55,17 +55,15 @@ def suggest(self, num=1):
        .. note:: New parameters must be compliant with the problem's domain
           `orion.algo.space.Space`.
        """
+        if num is None:
+            num = 1


should just go with max_trials like gridsearch?

It would depend on what we decide here: #604 (comment)

So, based on f6aea11 there would be no default num anymore. Typically the producer would request max_trials to the random search at the beginning which would return the whole set of trials to run.

bouthilx · 2021-05-07T17:46:42Z

Waiting on #605 to be merged before rebasing this PR.

Why: To avoid passing multiple times the same points to the algorithms, the producer needs to know whether an algorithm actually observed a point. Until now the producer was keeping track on of the point passed to infer which to pass later on, but some algorithms would not observe points until they had suggested it (ex Hyperband). This caused issue with the resumption of these algorithms as they would ignore the points and would never be presented the points again.

Why: The fidelity index may be used by other algorithms than hyperband and get_id is used by all algorithms.

Why: The use-cases to test are basically the same across all algorithms. Adding new generic methods or modifying them makes the maintenance of the tests tedious while they are very consistent across algorithms. With a test-suite we ensure to cover all basic use-cases for algorithms and make the maintenance much easier. How: Define a base class with all tests and specialization methods that subclasses can override to customize part of the tests. Many methods are decorated as _phases_, which makes them parametrizable with optimization phases. For instance all these tests will run one in random search phase for TPE and again for the Bayesian Optimization phases for the TPE. The same applies for Hyperband and its derivatives to test first phase of random search, different phases of promotions and phases of bracket repetitions.

Why: The repetitions were not saved properly in the state_dict and lead to unreproducible results. Also with the new sampling scheme Hyperband can either return all samples/promotions at once or return ``num`` at a time. Finally, the rework makes it possible to reimplement ASHA on top of Hyperband instead of maintaining two very similar implementations.

Implementing ASHA on top of Hyperband makes it much leaner and easier to maintain. Also, the strategy for the allocation of samples to brackets is turned into a deterministic one. This makes ASHA easier to test and the behavior is more predictable. Once the first rungs are full according to the budget then they cannot allow for more trials anymore, but this new implementation of ASHA supports repetitions of brackets when the first ones are filled.

There was a few bugs. - It was non-reproducible. It now passes eves.rng to the mutation functions. - Categorical dimensions could only support numeric data. - The mutations would return duplicates (of samples from previous rungs)

Why: It is faster to register trial all in batch and continue suggesting if some failed than rolling back and restarting each time one trial fails to register (because is duplicate). With the new sampling scheme there should be no duplicate locally anyway. Duplicates could only appear when there is multiple workers, in which case the first trial and all next would be duplicates and thus the whole batch must be trashed.

Why: The call was modifying the passed argument, eventually corrupting the data passed by the algorithm.

Why: All algorithms were duplicating the method to register trials inside `_trials_info`. This should be hidden under a generic `register` method.

Why: The algo does not know if some trial failed. It should not stop until `max_trials` points have been observed. Otherwise we would never reach `max_trials` is some are broken.

Why: This is more efficient. :)

Why: The producer should be responsible of requesting as many trials as necessary to reach `max_trials`. It is then up to the algo to return as many as possible.

Why: Now that random search samples `max_trials` we need to look at `completed` trials to infer how many were executed.

donglinjy

Looks fantastic. 👍

bouthilx added bug Indicates an unexpected problem or unintended behavior enhancement Improves a feature or non-functional aspects (e.g., optimization, prettify, technical debt) documentation Regarding description, documentation, tutorials, etc labels Apr 27, 2021

bouthilx added this to the v0.1.15 milestone Apr 27, 2021

bouthilx requested a review from donglinjy April 28, 2021 15:00