[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

rsayn · 2020-11-06T09:04:33Z

Please confirm you have the latest versions of fastai, fastcore, fastscript, and nbdev prior to reporting a bug (delete one): YES

Describe the bug
Exporting TabularLearner via learn.export() leads to huge Pickle file size (>80MB).

To Reproduce
Steps to reproduce the behavior:

Create a TabularLearner
Train it
Export it to pickle file via learn.export(filepath)

Expected behavior
The Pickle file should be smaller in size.

Error with full stack trace
N/A

Additional context
By creating different learners with DataFrames of varying size, I noticed that the size of the pickled file increases with the dataset dimension, although after re-loading the serialized file learn.dls is empty as expected.

The text was updated successfully, but these errors were encountered:

muellerzr · 2020-11-06T16:11:09Z

There's a temporary solution here: https://walkwithfastai.com/tab.export#Exporting-our-TabularPandas

muellerzr · 2020-11-06T22:25:16Z

I've narrowed down the issue to the ReadTabBatch transform, we're always storing a copy of the dataframe into memory through this

jph00 · 2020-11-07T18:57:32Z

Should be fixed now.

rsayn · 2020-11-12T10:28:13Z

Hi, I upgraded Fastai to the lastest version (2.1.5) and I'm still experiencing the same issue.

I created a notebook on Colab to reproduce the problem:
https://colab.research.google.com/drive/1yvQwIrC9zfI0jq5xZq_h4JVkoXWszcHC?usp=sharing

claudiobottari · 2020-11-12T10:43:53Z

I've done the same (after upgrading to 2.1.5) and I also experienced the same issue:
Model trained with adult.csv, exported to a 161K file.
Model trained with the same dataset, but containing 500K rows, exported to a 8Mb file.

https://colab.research.google.com/drive/1zhSKeJCB5CvTiQKgYWubey9w1VzbNiG2?usp=sharing

muellerzr · 2020-11-12T14:02:29Z

@claudiobottari I've found the issue. It's after we fit the model

muellerzr · 2020-11-12T15:34:13Z

Here's a minimal reproducer showing that there is a duplicate validation dataframe being added after fit:

https://gist.github.com/muellerzr/df3fc4a12b021be85639afddab3c5d32

@jph00 we should reopen this issue

ababino · 2020-11-13T09:00:33Z

The problem is that ProgressCallback is not deleting its pbar attribute after the fit. I modified @muellerzr gist showing how to solve the problem.
https://gist.github.com/ababino/2a2c67ac264e2ed8c95144377b9be2b4

rsayn · 2020-12-02T13:01:46Z

@muellerz can you provide a snippet on how to deserialize a bugged model and re-serialize it with the correct size?
I have a lot of models affected by this issue that I can't retrain.
Thanks in advance.

This was referenced Nov 7, 2020

Fix dataset leak in ReadTabBatch #2948

Merged

Dataset leak issue with at least Tabular #2949

Closed

jph00 closed this as completed Nov 7, 2020

jph00 added the bug label Nov 7, 2020

jph00 reopened this Nov 12, 2020

ababino mentioned this issue Nov 13, 2020

Fix progresscallback #2980

Closed

jph00 closed this as completed Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

rsayn commented Nov 6, 2020

muellerzr commented Nov 6, 2020

muellerzr commented Nov 6, 2020

jph00 commented Nov 7, 2020

rsayn commented Nov 12, 2020 •

edited

Loading

claudiobottari commented Nov 12, 2020

muellerzr commented Nov 12, 2020

muellerzr commented Nov 12, 2020 •

edited

Loading

ababino commented Nov 13, 2020

rsayn commented Dec 2, 2020

[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

[Model serialization] Exporting TabularLearner via learn.export() leads to huge file size #2945

Comments

rsayn commented Nov 6, 2020

muellerzr commented Nov 6, 2020

muellerzr commented Nov 6, 2020

jph00 commented Nov 7, 2020

rsayn commented Nov 12, 2020 • edited Loading

claudiobottari commented Nov 12, 2020

muellerzr commented Nov 12, 2020

muellerzr commented Nov 12, 2020 • edited Loading

ababino commented Nov 13, 2020

rsayn commented Dec 2, 2020

rsayn commented Nov 12, 2020 •

edited

Loading

muellerzr commented Nov 12, 2020 •

edited

Loading