TabularPandas data transform reproducibility #2826

Isaac-Flath · 2020-09-22T13:29:17Z

Feature requests should first be proposed on the forum.

Link to forum discussion.

Zach said it's useful enough to put in a PR without waiting for a forum discussion. He Ok'd me calling him out in this way in the PR :)

I did place this in the fastaiv2 tabular thread here: https://forums.fast.ai/t/fastai-v2-tabular/53530/235

Is your feature request related to a problem? Please describe.

I have trained a model using XGBoost, but I did the data processing for training and validation sets using TabularPandas (similar to the approach done in the fastai book). I did not use dataloaders or a learner object. Now, I need to use it for monthly inference, but the only way I can get it to process properly is to have a training and validation set. I just want to apply the transforms to the validation set the same way each month. For example, I believe whether a _na column is created is dependent on the data given to it and if it has any null values. The order that categories show up in the training set also matters for categorify.
For inference, I just want to process the "validation set" and make my predictions.

Describe the solution you'd like

I would like a way to export the transform logic of a TabularPandas object, then import it whenever I want to process a dataframe into a TabularPandas object that can be used for inference.

Describe alternatives you've considered

I have not been able to get my alternative to work, without a pretty large training dataset being processed each time I want to do inference. Go-live for my project is coming up pretty quick, so I am moving my project off of fastai due to this.

The workaround I was attempting was to have a static dummy training set that gets processed with any new data so that I have a 'training' and validation set. I was attempting to doctored the training set to ensure that the right columns have null values, categories in the right order in the data, etc. Then I create the TabularPandas object and do inference on the validation set (new data). I have spent a good chunk of time trying to get this to work without a massive training set being reprocessed repeatedly, but I have to abandon those efforts for this project as I am on a time constraint.

Additional context

The example in the fastai tabular book where a RandomForest is being trained is a great example. Now, if you need to load the model up a month later to do inference only using the random forest - how would you process the new data?

Ideally, I would like to avoid dataloaders (as it doesn't give me anything for this problem). I would also like to avoid processing extra 'training' data as I only want to do inference and it really shouldn't be necessary.

muellerzr · 2020-09-22T14:55:05Z

@jph00 this'll be something I want to pick up and implement, as I could easily see its value when folks want to use fastai tabular for preprocessing and then use it for other libraries and move into production. In my head I view it as something like to.export(), and follows the same protocol that learn.export would do, just isolated to the TabularPandas level.

marii-moe · 2021-06-24T09:55:12Z

Think this one was completed in #2857

muellerzr · 2021-06-24T09:57:21Z

Yup! This can be closed :)

…

On Thu, Jun 24, 2021 at 5:55 AM Marii ***@***.***> wrote: Think this one was completed in #2857 <#2857> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2826 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB3YCV32NY4TXXVVFSIE4TDTUL6ITANCNFSM4RVY52EQ> .

HenryDashwood · 2023-01-20T17:13:07Z

I think this feature may have been deprecated or never merged. The export method isn't there in the master branch. Is this not a good way to reapply the TabularPandas processes on new data in e.g. production?

muellerzr · 2023-01-20T20:38:09Z

@HenryDashwood I've done so on Walk with fastai, see here: https://walkwithfastai.com/tab.export (sadly on the PR it got lost on time so it didn't quite get merged 😢 )

HenryDashwood · 2023-01-20T22:15:38Z

Yeah that's what I've gone with. Seems to work!

muellerzr mentioned this issue Oct 3, 2020

Have the ability to export TabularPandas objects #2857

Closed

marii-moe added the enhancement label Jun 24, 2021

jph00 closed this as completed Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TabularPandas data transform reproducibility #2826

TabularPandas data transform reproducibility #2826

Isaac-Flath commented Sep 22, 2020 •

edited

Loading

muellerzr commented Sep 22, 2020 •

edited

Loading

marii-moe commented Jun 24, 2021

muellerzr commented Jun 24, 2021 via email

HenryDashwood commented Jan 20, 2023

muellerzr commented Jan 20, 2023

HenryDashwood commented Jan 20, 2023

TabularPandas data transform reproducibility #2826

TabularPandas data transform reproducibility #2826

Comments

Isaac-Flath commented Sep 22, 2020 • edited Loading

muellerzr commented Sep 22, 2020 • edited Loading

marii-moe commented Jun 24, 2021

muellerzr commented Jun 24, 2021 via email

HenryDashwood commented Jan 20, 2023

muellerzr commented Jan 20, 2023

HenryDashwood commented Jan 20, 2023

Isaac-Flath commented Sep 22, 2020 •

edited

Loading

muellerzr commented Sep 22, 2020 •

edited

Loading