Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with huge datasets #456

Open
santurini opened this issue Jun 7, 2024 · 4 comments
Open

Working with huge datasets #456

santurini opened this issue Jun 7, 2024 · 4 comments

Comments

@santurini
Copy link

Hello,
more than a feature request this is an advice request.

I have to train a tabular model on a huge dataset (more than 10 million rows) and I am not able to fit it entirely into memory as a Dataframe.
I would like to use the entire dataset for train/test/val without using a subset, and I wanted to know how would you suggest to operate in this case.

An alternative I've considered is to have a custom dataloader that loads into memory only the requested batch given a list of ids, but I don't know where to start and what should I actually modify or implement.

Some help would really be appreciated, thank you!

@manujosephv
Copy link
Owner

In the current way of doing things, this isn't supported.

There are a few things which stops this use-case

  1. Currently the data processing happens with pandas and numpy which means the categorical encoding, normalization etc all takes place in memory
  2. Let's consider that we managed to do all that processing in a lazy manner and turn off PyTorch Tabular's own handling of these items, then the dataset and dataloader needs to change.

For this to be done, we need to first move data processing to something like polars or spark which supports out-of-memory processing and then change the dataloaders to load only the batch we need into memory.

Different options available are:

  1. Adopt nvtabular from Merlin(NVIDIA) as the core data processing unit. (Pros: Easy to use framework, GPU data processing inbuilt, ready-to-use dataloaders, etc. Cons: Documentation isnt that great and library is not very mature)
  2. Adopt polars as the data processing library (Pros: Supports out-of-memory, uses all cores, blazing fast, etc. Cons: Will be difficult to work on truly huge data (100M rows and upwards)
  3. Adopt spark (Pros: Truly distributed, Cons: Too much overhead that for smaller usecases it is overkill)

Once the TabularDataModule is re-written, then making a dataloader which loads lazily is a simple task.

I, personally, don't have enough time on my hand to take on such a large undertaking. But I would gladly guide anyone who wants to pick it up.

@santurini
Copy link
Author

Hello,
I though about it this days. Is there the possibility to resume a training from a checkpoint? In this way I could split the dataset in smaller chunks and then resume the training when I load a different chunk.

Sorry if I bother you again

@manujosephv
Copy link
Owner

Of course. There is a save_model, and load_model if you want to save the entire model including datamodules etc. and there is save_weights and load_weights if you just want to save checkpoints. Documentation would have more details.

@santurini
Copy link
Author

I read the documentation but I have some troubles in loading only the model and predictin a new row, in the sense that I get different results as I do not know how to process correctly the data to be passed to the model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants