-
-
Notifications
You must be signed in to change notification settings - Fork 146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with huge datasets #456
Comments
In the current way of doing things, this isn't supported. There are a few things which stops this use-case
For this to be done, we need to first move data processing to something like polars or spark which supports out-of-memory processing and then change the dataloaders to load only the batch we need into memory. Different options available are:
Once the I, personally, don't have enough time on my hand to take on such a large undertaking. But I would gladly guide anyone who wants to pick it up. |
Hello, Sorry if I bother you again |
Of course. There is a |
I read the documentation but I have some troubles in loading only the model and predictin a new row, in the sense that I get different results as I do not know how to process correctly the data to be passed to the model |
Hello,
more than a feature request this is an advice request.
I have to train a tabular model on a huge dataset (more than 10 million rows) and I am not able to fit it entirely into memory as a Dataframe.
I would like to use the entire dataset for train/test/val without using a subset, and I wanted to know how would you suggest to operate in this case.
An alternative I've considered is to have a custom dataloader that loads into memory only the requested batch given a list of ids, but I don't know where to start and what should I actually modify or implement.
Some help would really be appreciated, thank you!
The text was updated successfully, but these errors were encountered: