-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TPE algorithm into Orion #381
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #381 +/- ##
============================================
+ Coverage 45.30% 88.12% +42.81%
============================================
Files 66 2 -64
Lines 12446 160 -12286
Branches 309 15 -294
============================================
- Hits 5639 141 -5498
+ Misses 6782 12 -6770
+ Partials 25 7 -18 Continue to review full report at Codecov.
|
Great! We'll have a concise and simple implementation for TPE. :) |
src/orion/algo/tpe.py
Outdated
low, high = dimension.interval() | ||
below_mus, below_sigmas, below_weights = \ | ||
adaptive_parzen_estimator(below_points, low, high, self.prior_weight, | ||
self.equal_weight, flat_num=25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't flat_num be a hyper-parameter of TPE? Maybe we could merge equal_weight with flat_num; if flat_num is None or equal 0, we use equal weights, otherwise we use the ramp up with flat_num.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to keep this parameter as internal at this moment, unlike equal_weight, this parameter has more impact on the performance, want to hide it with best practice value mentioned in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of 25 may very well be domain specific however. They did test on different ones (LFW, PubFig83 and subset of CIFAR10), but that's arguably small evidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, let's then expose an additional parameter for tpe.
This implementation is like in optuna where hyperparameters are assumed independent. Implementing a joint prior distribution is much trickier and isn't described in details in the paper AFAIK. The implementation of hyperopt seems to rely heavily on pyll to handle this. I would propose we start with this independent version, then add Integer/Categorical like you intended to and then tackle the joint version. Meanwhile we could add an argument independant=True and when set to False we raise NotImplementedError. |
My understanding is that hyperopt depend on pyll to describe the relative/depended parameters and optuna actually also support sample_relative and sample_independent except optuna tpe only implement sample_independent. What I am thinking is, dependency could mean partial of the hyper-parameters to search instead of them all. So I think from Orion search space definition perspective, we need first support to define the relative or depended search space. |
|
||
ramp_weights = numpy.linspace(1.0 / total_num, 1.0, num=total_num - flat_num) | ||
flat_weights = numpy.ones(flat_num) | ||
return numpy.concatenate([ramp_weights, flat_weights]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this paper: http://proceedings.mlr.press/v28/bergstra13.pdf
The first modification was to downweight trials as they age so that old results do not count for as much as more recent ones. We gave full weight to the most recent 25 trials and applied a linear ramp from 0 to 1.0 to older trials. This is a heuristic concession to TPE’s assumption of hyperparameter independence: as our search moves through space, we use temporal distance within the experiment as a surrogate for distance in the search space.
Seems what we are doing here is to ramp up from 1 / n up to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I notice this too, but seems hyperopt did this too even the paper claim the other way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mismatch between paper and code in hyperband and now TPE again! 🤦 I would be curious to compare both version, but now isn't a good time. Maybe when we start benchmarking the algos. Do you think it's worth adding a paper_rampup
parameter so that we can test both versions later on? I think a pointer to the code would again be important to justify the discrepancy with the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For pointer, since we now put the weight ramp function inside the parzen estimator which will call the ramp function, and we already has a pointer for parzen to hyperopt. I guess we can omit this one.
For the papar_rampup, I guess we can, let's add this later together with the discrete support, as I assume there will be some other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it should be more explicit. We could re-use the same pointer but clearly state what we are referring to, that is, the use of an additional mid-space point and the ramp-up which is different than the paper, flat (and low) for trials before 25 last ones, and then linear up to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, let's make it more explicitly when we have the paper_rampup weight update function, when we should move the current implementation outside parzen estimator and user can choose.
tests/unittests/algo/test_tpe.py
Outdated
@@ -85,9 +85,12 @@ def test_adaptive_parzen_normal_estimator_weight(): | |||
# prior weight | |||
mus, sigmas, weights = adaptive_parzen_estimator(obs_mus, low, high, prior_weight=0.5, | |||
equal_weight=False, flat_num=25) | |||
print(weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left-over print. :)
I think you are right. From the same paper: http://proceedings.mlr.press/v28/bergstra13.pdf
Let's keep it independent then. :) |
|
||
The number of the most recent trials which get the full weight where the others will be | ||
applied with a linear ramp from 0 to 1.0. It will only take effect if ``equal_weight`` | ||
is ``False``. Default is ``25``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is consistent with the paper but not with the implementation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suddenly realize seems I misunderstood this ramp up when you said from 1/n to 1. Paper mentions ramp from 0 to 1.0, but the weight of course can not be 0, so get start from 1/n is not that different compared with the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's (1/n, 1/n, ... 1/n, growing to 1 through 25 steps), while paper says (0 growing to 1, 1, 1, ... 25 times).
Tree-structured Parzen Estimator (TPE)
algorithm is one of Sequential Model-BasedGlobal Optimization (SMBO) algorithms, which will build models to propose new points based
on the historical observed trials.
@bouthilx