Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

Closed
rsayn opened this issue Nov 6, 2020 · 9 comments
Labels

Comments

@rsayn
Copy link

rsayn commented Nov 6, 2020

Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES

Describe the bug
Exporting TabularLearner via learn.export() leads to huge Pickle file size (>80MB).

To Reproduce
Steps to reproduce the behavior:

  1. Create a TabularLearner
  2. Train it
  3. Export it to pickle file via learn.export(filepath)

Expected behavior
The Pickle file should be smaller in size.

Error with full stack trace
N/A

Additional context
By creating different learners with DataFrames of varying size, I noticed that the size of the pickled file increases with the dataset dimension, although after re-loading the serialized file learn.dls is empty as expected.

@muellerzr
Copy link
Contributor

There's a temporary solution here: https://walkwithfastai.com/tab.export#Exporting-our-TabularPandas

@muellerzr
Copy link
Contributor

I've narrowed down the issue to the ReadTabBatch transform, we're always storing a copy of the dataframe into memory through this

@jph00
Copy link
Member

jph00 commented Nov 7, 2020

Should be fixed now.

@jph00 jph00 closed this as completed Nov 7, 2020
@jph00 jph00 added the bug label Nov 7, 2020
@rsayn
Copy link
Author

rsayn commented Nov 12, 2020

Hi, I upgraded Fastai to the lastest version (2.1.5) and I'm still experiencing the same issue.

I created a notebook on Colab to reproduce the problem:
https://colab.research.google.com/drive/1yvQwIrC9zfI0jq5xZq_h4JVkoXWszcHC?usp=sharing

@claudiobottari
Copy link

I've done the same (after upgrading to 2.1.5) and I also experienced the same issue:
Model trained with adult.csv, exported to a 161K file.
Model trained with the same dataset, but containing 500K rows, exported to a 8Mb file.

https://colab.research.google.com/drive/1zhSKeJCB5CvTiQKgYWubey9w1VzbNiG2?usp=sharing

@muellerzr
Copy link
Contributor

@claudiobottari I've found the issue. It's after we fit the model

@muellerzr
Copy link
Contributor

muellerzr commented Nov 12, 2020

Here's a minimal reproducer showing that there is a duplicate validation dataframe being added after fit:

https://gist.github.com/muellerzr/df3fc4a12b021be85639afddab3c5d32

@jph00 we should reopen this issue

@jph00 jph00 reopened this Nov 12, 2020
@ababino
Copy link
Contributor

ababino commented Nov 13, 2020

The problem is that ProgressCallback is not deleting its pbar attribute after the fit. I modified @muellerzr gist showing how to solve the problem.
https://gist.github.com/ababino/2a2c67ac264e2ed8c95144377b9be2b4

@rsayn
Copy link
Author

rsayn commented Dec 2, 2020

@muellerz can you provide a snippet on how to deserialize a bugged model and re-serialize it with the correct size?
I have a lot of models affected by this issue that I can't retrain.
Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants