-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Use dask.delayed within fit #730
Conversation
Hi @mrocklin Thank you for this PR. I rebased it to development branch and I will check it later. |
tpot/base.py
Outdated
@@ -46,6 +46,7 @@ | |||
from tqdm import tqdm | |||
from copy import copy, deepcopy | |||
|
|||
import dask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add installation of dask into ci/.travis_install.sh and .appveyor.yml for unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Though, just to reiterate, I'm not trying to get tests to work here at all. This is only up here for conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks. But it seems that it passed almost all the unit tests. Great!
I just had a quick test in my Window environment, I got the 2 IndexError as below:
Any idea? |
tpot/gp_deap.py
Outdated
warnings.simplefilter('ignore') | ||
# TODO: dive into and delay fit/transform calls on sklearn_pipeline.steps appropriately | ||
# This will help with shared intermediate results, profiling, etc.. | ||
# It looks like the dask_ml.model_selection._search.do_fit_and_score might have good logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger is this task easy for you by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can take a look today I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPOT is a fun problem to play with :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively @jcrist if you're around and have time you've probably done this before :)
I should add: Currently, TPOT uses a generational evolutionary algorithm (EA) model. That means that the EA keeps some fixed population of pipelines (by default, 100) and has to evaluate all of them before proceeding to the next generation. From an engineering perspective, generational EAs can be problematic because there can be a single pipeline that takes forever to evaluate and that will hold up the whole process. The generational EA cannot proceed until every pipeline is evaluated. For the longest time, we have wanted to re-engineer TPOT to follow a steady state EA. With a steady state EA, the EA still maintains a fixed population of pipelines (still 100 by default), but the pipelines are evaluated in parallel and the "parents" of the new pipelines are chosen based on the current population of pipelines that are already evaluated. That way, if one pipeline takes forever to evaluate, the EA can keep on optimizing based on the results from the faster pipelines. The diagram below sort of communicates those differences, if you stare at it long enough. Just something to keep in mind as this branch is worked on. I suspect that dask would seamlessly support this huge upgrade to TPOT. |
Yup. That would be pretty easy to use with Dask. You would need to use
the distributed scheduler (which, despite it's name, is quite lightweight
on a single machine (or single process (or single thread)). This would
force a dependency on Tornado though.
http://dask.pydata.org/en/latest/scheduling.html#dask-distributed-local
…On Thu, Jul 19, 2018 at 6:04 PM Randy Olson ***@***.***> wrote:
I should add: Currently, TPOT uses a *generational* evolutionary
algorithm (EA) model. That means that the EA keeps some fixed population of
pipelines (by default, 100) and has to evaluate all of them before
proceeding to the next generation. From an engineering perspective,
generational EAs can be problematic because there can be a single pipeline
that takes forever to evaluate and that will hold up the whole process.
For the longest time, we have wanted to re-engineer TPOT to follow a *steady
state* EA. With a steady state EA, pipelines are evaluated in parallel
and the "parents" of the new pipelines are chosen based on the current
population of pipelines that are already evaluated. That way, if one
pipeline takes forever to evaluate, the EA can keep on optimizing based on
the results from the faster pipelines.
The diagram below sort of communicates those differences, if you stare at
it long enough.
[image: image]
<https://user-images.githubusercontent.com/1719223/42972661-084d073c-8b65-11e8-9692-c54c68ff2002.png>
Just something to keep in mind as this branch is worked on. I suspect that
dask would seamlessly support this huge upgrade to TPOT.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#730 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszCsW9p_wxyPFacT4_sKqxdy2LJv8ks5uIQKCgaJpZM4VRUUv>
.
|
When diff --git a/tpot/gp_deap.py b/tpot/gp_deap.py
index 3a5bfcf..e99d494 100644
--- a/tpot/gp_deap.py
+++ b/tpot/gp_deap.py
@@ -433,7 +433,7 @@ def _wrapped_cross_val_score(sklearn_pipeline, features, target,
try:
return _fit_and_score(*args, **kwargs)
except Exception:
- return -float('inf')
+ return [[- float('inf'), -float('inf')]]
with warnings.catch_warnings():
warnings.simplefilter('ignore') @mrocklin are you able to give me push access to this branch? |
@TomAugspurger you should now have push access. FWIW I'm less concerned about getting tests to pass here than I am about investigating how dask can be useful. I think that the two approaches to that are:
|
Pushed a rough attempt at 1 (this is still very much a work in progress). Having the visibility into what's actually going on is nice. I'll test this out on an actual cluster later, but things should work nicely for CPU-bound problems like this. I'm still working a bit to understand the various types and computation as things flow through wrapped_cross_score, _fit_and_score, etc. I think that the Is it possible to specify parameters such that all the randomness is removed from a |
Going forward, are the TPOT devs interested in an optional dependency on Dask-ML, which would add new dependencies on dask (pure python) six, and multipledispatch? The diagnostics also require distributed (which depends on tornado) and Bokeh. To use things on a cluster, you'll need distributed. The benefits would be
If so, what API do you envision? |
I suspect that we could put the imports inside the That's a nice video @TomAugspurger :) Seeing this scale would be interesting. |
I would hope that this would be pretty apparent by looking at the Graph diagnostic page (I recommend avoding the newest bokeh 0.13.0, which has layout issues if so) |
Also I'm curious, why |
Indeed, this would be entirely optional. An ImportError could be raised during early in
Leftover from a debugging session.
It's a bit messy, though there seems to be some sharing... |
Yes, TPOT has a
There should be shared information between pipelines, especially with larger population sizes. The optimization procedure within TPOT often creates slightly-modified copies of existing pipelines, so there is a possibility to share information between previously-evaluated pipelines and new pipelines (e.g., a "parent" and "child" with shared components), as well as information between pairs of new pipelines (e.g., two "twin" pipelines). |
wrt dependencies: Given that we give a strong recommendation to install TPOT on top of an Anaconda Python install, I suspect it wouldn't be too painful to add dask and related packages as dependencies. However, I would like for it to be an optional dependency for users that don't want/need parallelization at dask's scale. wrt API, thinking out loud: We definitely want to maintain a |
Thanks for the heads-up about The API design here is a bit tricky, since Dask could conceivably help TPOT in two ways
from sklearn.externals import joblib
with joblib.parallel_backend('dask', scatter=[X, y]):
tpot.fit(X, y) This would only potentially be useful for users with a cluster.
Without having thought about it too much, I'd propose we reserve the "backend" terminology for which joblib backend is used (multiprocessing / threaded / dask), and guide users to the I'll do some benchmarking on the two approaches on a cluster tomorrow. |
Thank you a lot @TomAugspurger for good process and nice video. wrt API, I prefer 2nd approach but I agree with @rhiever that dask should be an optional dependency for users while it should be highly recommended on installation guide. I think we could add one parameter (I like
BTW, I think steps about crossover and mutation are also very time-consuming, but we can also add parallel computing support on that one later. |
Ok, now I did
Looks good so far. No exceptions for several runs. And of course, I did not check the resulting pipelines. Only difference is some warnings that usually do not appear in verbose level 2:
As far as I can tell, TPOT uses 3 verbose levels. Only level 3 shows everything and even level 2 is quite silent, only deprecation warnings, errors and essential scores from time to time. I think level 1 is for users that aren't that much interested in what TPOT does ;-) |
@saddy001 not hijacking at all!. Thank you for reporting issues here. TPOT will print everything (all caught warning and exception) when evaluating pipelines or when generating pipelines via randomly creator in initial generation or crossover/mutation in later generations. But most of warnings/exceptions are from scikit-learn operators/TPOT builtin operators and they should be raised in 3 cases:
Those warning/exceptions when using 3 verbose levels are for diagnosis. |
This branch is still under dev/test now. I am not sure if |
I've got another one, but this time a crash.
Last line means core dumped. Do you have an idea only from "corrupted size vs. prev_size"? I couldn't get a minimal example crashing. |
What scikit-learn version? 0.19.x has a thread safety bug with a similar error. Fixed in scikit-learn/scikit-learn#9569 on master. |
It's 0.19.1. The fix is from 2017. Isn't it pip-packaged already? |
No, it wasn't backported.
…On Thu, Aug 23, 2018 at 11:53 AM saddy001 ***@***.***> wrote:
It's 0.19.1. The fix is from 2017. Isn't it pip-packaged already?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#730 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIgjv-1ZQnbmZUn6_wm_WsxwHERVmks5uTt3ygaJpZM4VRUUv>
.
|
There seems to be an unrelated failure on master now. I didn't dig deeply, but IIUC, it comes from NumPy 1.15.0 (or maybe 1.15.1) raising a warning when an internal umath module is imported. Older versions of scikit-learn imported this, and there's a test ensuring stdout is quiet, which is now failing because of the warning. I've attempted to fix that in d279253, which I can split into a separate PR if desired. |
My attempted fix didn't work. Just skipping for now on this branch, though that's probably not the best solution. |
Ok, this seems to be reliably passing now that I've ensured that Dask's Some questions on how to proceed:
|
@TomAugspurger Thank you very much for debugging CI. The answers to those questions above:
Yes, I think it is fine to pass that test for now.
Great, we could update docs of installation after the new release.
Thanks for the docs. I think we should highly recommend TPOT users to use
I agree we can add parallelization of mutation and crossover as a followup. |
@TomAugspurger Please let me know if you are OK with merging this PR. I will go ahead to merge it to dev branch and then merge #740. After update docs and some final checks, I think we can have a minor release of TPOT. |
Thanks @weixuanfu, I just pushed a commit removing some print statements I added and fixing a doc issue. Once that passes I think this will be good to go. |
Alrightly, all green other than the coveralls failure, which I think can be ignored. |
Yep, 0.1% coverage decrease can be ignored for now. Thanks! |
@TomAugspurger |
I thought I submitted this already, but couldn't find it. My apologies if this is a double-post
This should not be merged, it likely breaks existing behavior
This addresses #304 . It sprinkles dask.delayed in a couple places of the current codebase. To improve things we should do the following:
_fit_and_score
function to use dask.delayed on every step of a pipeline. This would help to improve the sharing of intermediate results and would also improve diagnostics.dask_ml/model_selection/_search.py::do_fit_and_score
does some of this, but it was heavily optimized for efficiency. It would be good to do the same thing, but with dask.delayed here, which would probably be nicer for external devs even if it adds a millisecond or two of overhead. cc @jcristIf anyone wants to take a shot at task 1 I suspect that this would be interesting work.