Add TPE algorithm into Orion #381

donglinjy · 2020-04-11T12:26:24Z

Tree-structured Parzen Estimator (TPE) algorithm is one of Sequential Model-Based
Global Optimization (SMBO) algorithms, which will build models to propose new points based
on the historical observed trials.

@bouthilx

codecov-io · 2020-04-12T02:31:19Z

Codecov Report

Merging #381 into develop will increase coverage by 42.81%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           develop     #381       +/-   ##
============================================
+ Coverage    45.30%   88.12%   +42.81%     
============================================
  Files           66        2       -64     
  Lines        12446      160    -12286     
  Branches       309       15      -294     
============================================
- Hits          5639      141     -5498     
+ Misses        6782       12     -6770     
+ Partials        25        7       -18

Impacted Files	Coverage Δ
src/orion/core/cli/__init__.py	`93.75% <0.00%> (-6.25%)`	⬇️
tests/unittests/core/io/test_evc_builder.py
tests/unittests/algo/test_base.py
tests/unittests/core/test_strategy.py
tests/unittests/core/io/database_test.py
tests/functional/parsing/test_parsing_insert.py
tests/unittests/core/evc/conftest.py
tests/unittests/core/worker/test_producer.py
tests/functional/commands/test_insert_command.py
tests/unittests/core/io/test_experiment_builder.py
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 724bbf0...e2f28dd. Read the comment docs.

docs/src/user/algorithms.rst

src/orion/algo/space.py

src/orion/algo/tpe.py

tests/functional/algos/test_algos.py

tests/unittests/algo/test_tpe.py

src/orion/algo/tpe.py

tests/unittests/algo/test_tpe.py

bouthilx · 2020-04-20T19:33:55Z

Great! We'll have a concise and simple implementation for TPE. :)

bouthilx · 2020-04-27T17:44:04Z

src/orion/algo/tpe.py

+        low, high = dimension.interval()
+        below_mus, below_sigmas, below_weights = \
+            adaptive_parzen_estimator(below_points, low, high, self.prior_weight,
+                                      self.equal_weight, flat_num=25)


Shouldn't flat_num be a hyper-parameter of TPE? Maybe we could merge equal_weight with flat_num; if flat_num is None or equal 0, we use equal weights, otherwise we use the ramp up with flat_num.

I was thinking to keep this parameter as internal at this moment, unlike equal_weight, this parameter has more impact on the performance, want to hide it with best practice value mentioned in the paper.

The value of 25 may very well be domain specific however. They did test on different ones (LFW, PubFig83 and subset of CIFAR10), but that's arguably small evidence.

okay, let's then expose an additional parameter for tpe.

tests/unittests/algo/test_tpe.py

bouthilx · 2020-04-27T20:37:35Z

This implementation is like in optuna where hyperparameters are assumed independent. Implementing a joint prior distribution is much trickier and isn't described in details in the paper AFAIK. The implementation of hyperopt seems to rely heavily on pyll to handle this. I would propose we start with this independent version, then add Integer/Categorical like you intended to and then tackle the joint version. Meanwhile we could add an argument independant=True and when set to False we raise NotImplementedError.

donglinjy · 2020-04-28T05:57:26Z

This implementation is like in optuna where hyperparameters are assumed independent. Implementing a joint prior distribution is much trickier and isn't described in details in the paper AFAIK. The implementation of hyperopt seems to rely heavily on pyll to handle this. I would propose we start with this independent version, then add Integer/Categorical like you intended to and then tackle the joint version. Meanwhile we could add an argument independant=True and when set to False we raise NotImplementedError.

My understanding is that hyperopt depend on pyll to describe the relative/depended parameters and optuna actually also support sample_relative and sample_independent except optuna tpe only implement sample_independent. What I am thinking is, dependency could mean partial of the hyper-parameters to search instead of them all. So I think from Orion search space definition perspective, we need first support to define the relative or depended search space.

bouthilx · 2020-04-28T13:36:49Z

src/orion/algo/tpe.py

+
+        ramp_weights = numpy.linspace(1.0 / total_num, 1.0, num=total_num - flat_num)
+        flat_weights = numpy.ones(flat_num)
+        return numpy.concatenate([ramp_weights, flat_weights])


From this paper: http://proceedings.mlr.press/v28/bergstra13.pdf

The first modification was to downweight trials as they age so that old results do not count for as much as more recent ones. We gave full weight to the most recent 25 trials and applied a linear ramp from 0 to 1.0 to older trials. This is a heuristic concession to TPE’s assumption of hyperparameter independence: as our search moves through space, we use temporal distance within the experiment as a surrogate for distance in the search space.

Seems what we are doing here is to ramp up from 1 / n up to 1.

yeah, I notice this too, but seems hyperopt did this too even the paper claim the other way.

Mismatch between paper and code in hyperband and now TPE again! 🤦 I would be curious to compare both version, but now isn't a good time. Maybe when we start benchmarking the algos. Do you think it's worth adding a paper_rampup parameter so that we can test both versions later on? I think a pointer to the code would again be important to justify the discrepancy with the paper.

For pointer, since we now put the weight ramp function inside the parzen estimator which will call the ramp function, and we already has a pointer for parzen to hyperopt. I guess we can omit this one.

For the papar_rampup, I guess we can, let's add this later together with the discrete support, as I assume there will be some other changes.

I believe it should be more explicit. We could re-use the same pointer but clearly state what we are referring to, that is, the use of an additional mid-space point and the ramp-up which is different than the paper, flat (and low) for trials before 25 last ones, and then linear up to 1.

okay, let's make it more explicitly when we have the paper_rampup weight update function, when we should move the current implementation outside parzen estimator and user can choose.

bouthilx · 2020-04-28T13:37:50Z

tests/unittests/algo/test_tpe.py

@@ -85,9 +85,12 @@ def test_adaptive_parzen_normal_estimator_weight():
    # prior weight
    mus, sigmas, weights = adaptive_parzen_estimator(obs_mus, low, high, prior_weight=0.5,
                                                     equal_weight=False, flat_num=25)
+    print(weights)


Left-over print. :)

bouthilx · 2020-04-28T13:46:18Z

My understanding is that hyperopt depend on pyll to describe the relative/depended parameters and optuna actually also support sample_relative and sample_independent except optuna tpe only implement sample_independent. What I am thinking is, dependency could mean partial of the hyper-parameters to search instead of them all. So I think from Orion search space definition perspective, we need first support to define the relative or depended search space.

I think you are right. From the same paper: http://proceedings.mlr.press/v28/bergstra13.pdf

The TPE algorithm is conspicuously deficient in optimizing each hyperparameter independently of the others. It is almost certainly the case that the optimal values of some hyperparameters depend on settings of others.

Let's keep it independent then. :)

bouthilx · 2020-04-28T16:03:43Z

docs/src/user/algorithms.rst

+
+The number of the most recent trials which get the full weight where the others will be
+applied with a linear ramp from 0 to 1.0. It will only take effect if ``equal_weight``
+is ``False``. Default is ``25``.


This is consistent with the paper but not with the implementation, right?

I suddenly realize seems I misunderstood this ramp up when you said from 1/n to 1. Paper mentions ramp from 0 to 1.0, but the weight of course can not be 0, so get start from 1/n is not that different compared with the paper.

But it's (1/n, 1/n, ... 1/n, growing to 1 through 25 steps), while paper says (0 growing to 1, 1, 1, ... 25 times).

…e-algo

donglinjy added 2 commits April 11, 2020 20:22

add tpe algorithm

f238a75

miss one function doc

4a6d866