Is creating an ensemble out of the TPOT population useful? #105

rhiever · 2016-03-06T14:59:47Z

One of the common arguments against population-based optimization methods is that they are significantly slower than methods that work with one (or a few) solutions at a time. I think one smart way to turn that argument on its head would be to see if creating an ensemble out of the TPOT population would be useful.

An initial exploration could be to run TPOT as normal, and collect additional statistics about the performance of the population as an ensemble. This could be done with a very "hacky" version of TPOT; no need to engineer it before we prove this idea's efficacy.

Basically, for every generation:

Store the classifications of every individual
Use various ensemble methods to combine their classifications into a single classification (min, max, threshold, majority, weighted based on performance on training set)
Plot the effectiveness of all of these population ensemble methods over time

What to look for:

Does the population ensemble perform better than the absolute best individual (early on, later on, always)?
Does the population ensemble perform better as more generations pass?
What ensemble method(s) perform best?

bartleyn · 2016-03-07T23:06:30Z

Should we choose a couple/few data sets to test on, to try and create a more robust analysis? Which ones might be most appropriate? MNIST, wine, breast cancer?

rhiever · 2016-03-07T23:18:08Z

If you run this script, you'll have access to a whole bunch of 'em. Take your pick. :-)

I think just one data set is fine to start with though, as a proof of concept.

rhiever · 2016-03-12T16:46:25Z

Ping. Run into any issues with this?

bartleyn · 2016-03-14T18:08:19Z

I've created my own version of the eaSimple algorithm where we can dig into the individual/ensemble statistics, but I've had difficulty in exposing and aggregating each individual pipeline's guesses. But in the last day or so I've broken through that, and have started getting numbers. I think I'm gonna spin up a cheap AWS instance and just run a ton of tests.

rhiever · 2016-03-14T18:31:26Z

I don't think you'll need to roll your own version of eaSimple. Here's some code from another project I ran where you can store the population in the log and then do post-analysis on the population in the log.

stats = tools.Statistics(lambda ind: (int(ind.fitness.values[0]), round(ind.fitness.values[1], 2)))
stats.register("Minimum", np.min, axis=0)
stats.register("Maximum", np.max, axis=0)
# This should store a copy of pop every generation
stats.register("Population", lambda x: copy.deepcopy(pop))

# Use normal TPOT settings, of course -- not these settings
pop, log = algorithms.eaSimple(pop, toolbox, cxpb=0., mutpb=0.5, ngen=1000, 
                               stats=stats, halloffame=hof, verbose=False)

Let me know if that works. Alternatively, you can modify the HOF to store the top 100 best pipelines discovered so far, and change

stats.register("Population", lambda x: copy.deepcopy(pop))

to

stats.register("HOF", lambda x: copy.deepcopy(hof))

and that will only change the analysis slightly -- using the best 100 pipelines ever as the ensemble instead of the pipelines currently in the population.

bartleyn · 2016-03-14T18:36:39Z

Interesting, I had convinced myself that the Statistics object wouldn't be able to give us access to the population directly. I'll test this out; it shouldn't change my analysis after the fact that much.

bartleyn · 2016-03-14T21:15:10Z

Got it working with the statistics object, thanks for the tips. I'm gonna spin up these tests.

rhiever · 2016-03-14T22:59:37Z

Great! 👍

bartleyn · 2016-03-20T03:06:15Z

Some of the shorter tests are wrapping up and I think I have enough for some preliminary results -- I'll try to clean things up and link them here in the next couple of days.

rhiever · 2016-03-20T03:42:10Z

Let me know if you want to schedule another video chat. I'm excited to hear how this turned out!

bartleyn · 2016-03-23T03:26:51Z

Hey so I've cleaned up some of my data and made it available here. I've been trying to come up with useful visualizations, and figured it'd be more productive to share it in the meantime.

rhiever · 2016-03-23T11:10:15Z

What are each of the new columns? I'm looking at the data this morning.

bartleyn · 2016-03-23T16:10:52Z

Alright, so I took the same ideas from the consensus operators that we tried and applied them here. Each individual / population is evaluated on the test dataset.

Weights

acc_* – each individual's guess is weighted according to their individual accuracy.
uni_* – each individual's guess has the same weight.

Selection

*_max_class – the class that has the highest weight (or in the uni case, the highest frequency) is the ensemble's guess for that test instance.
*_mean_class – the class with the mean weight / frequency is the ensemble's guess
*_median_class – the class with the median weight / frequency is the ensemble's guess
*_min_class – the class with the minimum weight / frequency is the ensemble's guess
*_threshold_class – the first class that passes a certain threshold in percentage of weight is the ensemble's guess.

bartleyn · 2016-03-24T03:46:42Z

I'm getting a lot of variance in a few of the columns, so I may run some more trials.

rhiever · 2016-03-24T19:50:22Z

What benchmarks are you running it on? It looks like the classification accuracy for many of the runs are fairly high, so there probably isn't much room for ensembles to improve. What about a harder data set, e.g., GAMETES-hard? Maybe we should just run a large benchmark on the HPCC?

bartleyn · 2016-03-24T22:54:43Z

You're right that the data I was using was perhaps too easy– I was using testing code that tested with the sklearn digits dataset, rather than MNIST! This is embarrassing to say the least. On the bright side, at least these tests suggest that the operators are somewhat robust in the smaller-data, slightly-longer, slightly-bigger population 'regime'.

In the interest of time, how about I'll run the same tests on random samples from the GAMETES-hard and MNIST proper to see if there's promise, and in the mean-time we can prep for a larger HPCC benchmark? I can run my tests in a more parallel manner so it's not a week turnaround.

rhiever · 2016-03-24T23:59:31Z

Sounds good to me. Want to send in a PR on this branch?

I'm currently finishing up some other TPOT benchmarks -- shouldn't take
more than the weekend -- but I can slate this benchmark for the next batch.

On Thu, Mar 24, 2016 at 6:54 PM, Nathan [email protected] wrote:

You're right that the data I was using was perhaps too easy– I was using
testing code that tested with the sklearn digits dataset, rather than
MNIST! This is embarrassing to say the least. On the bright side, at least
these tests suggest that the operators are somewhat robust in the
smaller-data, slightly-longer, slightly-bigger population 'regime'.

In the interest of time, how about I'll run the same tests on random
samples from the GAMETES-hard and MNIST proper to see if there's promise,
and in the mean-time we can prep for a larger HPCC benchmark? I can run my
tests in a more parallel manner so it's not a week turnaround.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#105 (comment)

Randal S. Olson, Ph.D.
Postdoctoral Researcher, Institute for Biomedical Informatics
University of Pennsylvania

E-mail: [email protected] | Twitter: @randal_olson
https://twitter.com/randal_olson
http://www.randalolson.com

rhiever added enhancement question labels Mar 6, 2016

rhiever mentioned this issue Mar 7, 2016

Add Consensus Operators #96

Closed

rhiever added the being worked on label Apr 25, 2016

rhiever removed the being worked on label Aug 13, 2016

AIAdventures mentioned this issue Jun 6, 2017

Titanic example -problem with 2nd last cell. #492

Closed

saddy001 mentioned this issue Mar 20, 2018

Segfault on optimization process #676

Closed

perib mentioned this issue Sep 21, 2023

TPOT2 and the future of TPOT development -- From the Devs #1322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is creating an ensemble out of the TPOT population useful? #105

Is creating an ensemble out of the TPOT population useful? #105

rhiever commented Mar 6, 2016

bartleyn commented Mar 7, 2016

rhiever commented Mar 7, 2016

rhiever commented Mar 12, 2016

bartleyn commented Mar 14, 2016

rhiever commented Mar 14, 2016

bartleyn commented Mar 14, 2016

bartleyn commented Mar 14, 2016

rhiever commented Mar 14, 2016

bartleyn commented Mar 20, 2016

rhiever commented Mar 20, 2016

bartleyn commented Mar 23, 2016

rhiever commented Mar 23, 2016

bartleyn commented Mar 23, 2016

bartleyn commented Mar 24, 2016

rhiever commented Mar 24, 2016

bartleyn commented Mar 24, 2016

rhiever commented Mar 24, 2016

Is creating an ensemble out of the TPOT population useful? #105

Is creating an ensemble out of the TPOT population useful? #105

Comments

rhiever commented Mar 6, 2016

bartleyn commented Mar 7, 2016

rhiever commented Mar 7, 2016

rhiever commented Mar 12, 2016

bartleyn commented Mar 14, 2016

rhiever commented Mar 14, 2016

bartleyn commented Mar 14, 2016

bartleyn commented Mar 14, 2016

rhiever commented Mar 14, 2016

bartleyn commented Mar 20, 2016

rhiever commented Mar 20, 2016

bartleyn commented Mar 23, 2016

rhiever commented Mar 23, 2016

bartleyn commented Mar 23, 2016

Weights

Selection

bartleyn commented Mar 24, 2016

rhiever commented Mar 24, 2016

bartleyn commented Mar 24, 2016

rhiever commented Mar 24, 2016