Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MODL Optimal Binning Does not work with large arrays? #28

Closed
Sinansi opened this issue May 15, 2020 · 2 comments
Closed

MODL Optimal Binning Does not work with large arrays? #28

Sinansi opened this issue May 15, 2020 · 2 comments

Comments

@Sinansi
Copy link

Sinansi commented May 15, 2020

if you use array size of 100 elements, it does work. But if you use array size of 1000 elements, it never complete.

this will work

data = [randn(100); randn(100)]
labels = [fill(:cat, 100); fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)

this will not work

data = [randn(1000); randn(1000)]
labels = [fill(:cat, 1000); fill(:dog, 1000)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)

Is there maximum array size for MODL Optimal Supervised Binning?

Thank you!

@tawheeler
Copy link
Contributor

tawheeler commented May 15, 2020

MODL is quadratic in the sample count, so increasing by a factor of 10 theoretically increases runtime by a factor of 100. In the paper we used MODL on a dataset with 1372 samples, so it must be possible, it just takes a while. On top of this, if using more samples causes you to use swap memory, the program is simply going to be super slow.

Optimal binning is inherently expensive. Perhaps suboptimal binning is sufficient for your application? You could base your bin edges on a subsampling of the data.

It is entirely possible that the algorithm could be more effectively implemented in Julia. Feel free to contribute a PR!

@Sinansi
Copy link
Author

Sinansi commented May 17, 2020

@tawheeler Thanks for your reply. I applied optimal binning by subsampling and it worked well for my case. All good :)

@Sinansi Sinansi closed this as completed May 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants