-
Notifications
You must be signed in to change notification settings - Fork 7.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TabularPandas data transform reproducibility #2826
Comments
@jph00 this'll be something I want to pick up and implement, as I could easily see its value when folks want to use fastai tabular for preprocessing and then use it for other libraries and move into production. In my head I view it as something like |
Think this one was completed in #2857 |
Yup! This can be closed :)
…On Thu, Jun 24, 2021 at 5:55 AM Marii ***@***.***> wrote:
Think this one was completed in #2857
<#2857>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2826 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB3YCV32NY4TXXVVFSIE4TDTUL6ITANCNFSM4RVY52EQ>
.
|
I think this feature may have been deprecated or never merged. The |
@HenryDashwood I've done so on Walk with fastai, see here: https://walkwithfastai.com/tab.export (sadly on the PR it got lost on time so it didn't quite get merged 😢 ) |
Yeah that's what I've gone with. Seems to work! |
Feature requests should first be proposed on the forum.
Link to forum discussion.
Zach said it's useful enough to put in a PR without waiting for a forum discussion. He Ok'd me calling him out in this way in the PR :)
I did place this in the fastaiv2 tabular thread here: https://forums.fast.ai/t/fastai-v2-tabular/53530/235
Is your feature request related to a problem? Please describe.
I have trained a model using XGBoost, but I did the data processing for training and validation sets using TabularPandas (similar to the approach done in the fastai book). I did not use dataloaders or a learner object. Now, I need to use it for monthly inference, but the only way I can get it to process properly is to have a training and validation set. I just want to apply the transforms to the validation set the same way each month. For example, I believe whether a _na column is created is dependent on the data given to it and if it has any null values. The order that categories show up in the training set also matters for categorify.
For inference, I just want to process the "validation set" and make my predictions.
Describe the solution you'd like
I would like a way to export the transform logic of a TabularPandas object, then import it whenever I want to process a dataframe into a TabularPandas object that can be used for inference.
Describe alternatives you've considered
I have not been able to get my alternative to work, without a pretty large training dataset being processed each time I want to do inference. Go-live for my project is coming up pretty quick, so I am moving my project off of fastai due to this.
The workaround I was attempting was to have a static dummy training set that gets processed with any new data so that I have a 'training' and validation set. I was attempting to doctored the training set to ensure that the right columns have null values, categories in the right order in the data, etc. Then I create the TabularPandas object and do inference on the validation set (new data). I have spent a good chunk of time trying to get this to work without a massive training set being reprocessed repeatedly, but I have to abandon those efforts for this project as I am on a time constraint.
Additional context
The example in the fastai tabular book where a RandomForest is being trained is a great example. Now, if you need to load the model up a month later to do inference only using the random forest - how would you process the new data?
Ideally, I would like to avoid dataloaders (as it doesn't give me anything for this problem). I would also like to avoid processing extra 'training' data as I only want to do inference and it really shouldn't be necessary.
The text was updated successfully, but these errors were encountered: