Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load Big Dataset Need So Much RAM #230

Open
fahadh4ilyas opened this issue Nov 21, 2024 · 0 comments
Open

Load Big Dataset Need So Much RAM #230

fahadh4ilyas opened this issue Nov 21, 2024 · 0 comments

Comments

@fahadh4ilyas
Copy link

I have a server with 8 GPU and 1TB of RAM. My parquet dataset size is 25GB (5 million data). If I'm using deepspeed to load the data, every process rank load the dataset and each need more than 150GB of RAM only for loading the parquet table and convert it to dictionary of numpy. So, the total memory usage only for loading the dataset is 150*8=1200GB memory of RAM.

Is there a way to make loading the dataset is only to the first rank and then shared to another rank?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant