Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TPE algorithm into Orion #381

Merged
merged 9 commits into from
Apr 30, 2020
Merged

Conversation

donglinjy
Copy link
Collaborator

@donglinjy donglinjy commented Apr 11, 2020

Tree-structured Parzen Estimator (TPE) algorithm is one of Sequential Model-Based
Global Optimization (SMBO) algorithms, which will build models to propose new points based
on the historical observed trials.

@bouthilx

@codecov-io
Copy link

codecov-io commented Apr 12, 2020

Codecov Report

Merging #381 into develop will increase coverage by 42.81%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           develop     #381       +/-   ##
============================================
+ Coverage    45.30%   88.12%   +42.81%     
============================================
  Files           66        2       -64     
  Lines        12446      160    -12286     
  Branches       309       15      -294     
============================================
- Hits          5639      141     -5498     
+ Misses        6782       12     -6770     
+ Partials        25        7       -18     
Impacted Files Coverage Δ
src/orion/core/cli/__init__.py 93.75% <0.00%> (-6.25%) ⬇️
tests/unittests/core/io/test_evc_builder.py
tests/unittests/algo/test_base.py
tests/unittests/core/test_strategy.py
tests/unittests/core/io/database_test.py
tests/functional/parsing/test_parsing_insert.py
tests/unittests/core/evc/conftest.py
tests/unittests/core/worker/test_producer.py
tests/functional/commands/test_insert_command.py
tests/unittests/core/io/test_experiment_builder.py
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 724bbf0...e2f28dd. Read the comment docs.

src/orion/algo/space.py Outdated Show resolved Hide resolved
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
src/orion/algo/tpe.py Outdated Show resolved Hide resolved
@bouthilx
Copy link
Member

Great! We'll have a concise and simple implementation for TPE. :)

low, high = dimension.interval()
below_mus, below_sigmas, below_weights = \
adaptive_parzen_estimator(below_points, low, high, self.prior_weight,
self.equal_weight, flat_num=25)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't flat_num be a hyper-parameter of TPE? Maybe we could merge equal_weight with flat_num; if flat_num is None or equal 0, we use equal weights, otherwise we use the ramp up with flat_num.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to keep this parameter as internal at this moment, unlike equal_weight, this parameter has more impact on the performance, want to hide it with best practice value mentioned in the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value of 25 may very well be domain specific however. They did test on different ones (LFW, PubFig83 and subset of CIFAR10), but that's arguably small evidence.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let's then expose an additional parameter for tpe.

@bouthilx
Copy link
Member

This implementation is like in optuna where hyperparameters are assumed independent. Implementing a joint prior distribution is much trickier and isn't described in details in the paper AFAIK. The implementation of hyperopt seems to rely heavily on pyll to handle this. I would propose we start with this independent version, then add Integer/Categorical like you intended to and then tackle the joint version. Meanwhile we could add an argument independant=True and when set to False we raise NotImplementedError.

@donglinjy
Copy link
Collaborator Author

This implementation is like in optuna where hyperparameters are assumed independent. Implementing a joint prior distribution is much trickier and isn't described in details in the paper AFAIK. The implementation of hyperopt seems to rely heavily on pyll to handle this. I would propose we start with this independent version, then add Integer/Categorical like you intended to and then tackle the joint version. Meanwhile we could add an argument independant=True and when set to False we raise NotImplementedError.

My understanding is that hyperopt depend on pyll to describe the relative/depended parameters and optuna actually also support sample_relative and sample_independent except optuna tpe only implement sample_independent. What I am thinking is, dependency could mean partial of the hyper-parameters to search instead of them all. So I think from Orion search space definition perspective, we need first support to define the relative or depended search space.


ramp_weights = numpy.linspace(1.0 / total_num, 1.0, num=total_num - flat_num)
flat_weights = numpy.ones(flat_num)
return numpy.concatenate([ramp_weights, flat_weights])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From this paper: http://proceedings.mlr.press/v28/bergstra13.pdf

The first modification was to downweight trials as they age so that old results do not count for as much as more recent ones. We gave full weight to the most recent 25 trials and applied a linear ramp from 0 to 1.0 to older trials. This is a heuristic concession to TPE’s assumption of hyperparameter independence: as our search moves through space, we use temporal distance within the experiment as a surrogate for distance in the search space.

Seems what we are doing here is to ramp up from 1 / n up to 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I notice this too, but seems hyperopt did this too even the paper claim the other way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch between paper and code in hyperband and now TPE again! 🤦 I would be curious to compare both version, but now isn't a good time. Maybe when we start benchmarking the algos. Do you think it's worth adding a paper_rampup parameter so that we can test both versions later on? I think a pointer to the code would again be important to justify the discrepancy with the paper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pointer, since we now put the weight ramp function inside the parzen estimator which will call the ramp function, and we already has a pointer for parzen to hyperopt. I guess we can omit this one.

For the papar_rampup, I guess we can, let's add this later together with the discrete support, as I assume there will be some other changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it should be more explicit. We could re-use the same pointer but clearly state what we are referring to, that is, the use of an additional mid-space point and the ramp-up which is different than the paper, flat (and low) for trials before 25 last ones, and then linear up to 1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, let's make it more explicitly when we have the paper_rampup weight update function, when we should move the current implementation outside parzen estimator and user can choose.

@@ -85,9 +85,12 @@ def test_adaptive_parzen_normal_estimator_weight():
# prior weight
mus, sigmas, weights = adaptive_parzen_estimator(obs_mus, low, high, prior_weight=0.5,
equal_weight=False, flat_num=25)
print(weights)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-over print. :)

@bouthilx
Copy link
Member

My understanding is that hyperopt depend on pyll to describe the relative/depended parameters and optuna actually also support sample_relative and sample_independent except optuna tpe only implement sample_independent. What I am thinking is, dependency could mean partial of the hyper-parameters to search instead of them all. So I think from Orion search space definition perspective, we need first support to define the relative or depended search space.

I think you are right. From the same paper: http://proceedings.mlr.press/v28/bergstra13.pdf

The TPE algorithm is conspicuously deficient in optimizing each hyperparameter independently of the others. It is almost certainly the case that the optimal values of some hyperparameters depend on settings of others.

Let's keep it independent then. :)


The number of the most recent trials which get the full weight where the others will be
applied with a linear ramp from 0 to 1.0. It will only take effect if ``equal_weight``
is ``False``. Default is ``25``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is consistent with the paper but not with the implementation, right?

Copy link
Collaborator Author

@donglinjy donglinjy Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suddenly realize seems I misunderstood this ramp up when you said from 1/n to 1. Paper mentions ramp from 0 to 1.0, but the weight of course can not be 0, so get start from 1/n is not that different compared with the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's (1/n, 1/n, ... 1/n, growing to 1 through 25 steps), while paper says (0 growing to 1, 1, 1, ... 25 times).

@bouthilx bouthilx merged commit b322fad into Epistimio:develop Apr 30, 2020
This was referenced Jun 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants