-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to control GPU ram usage #9
Comments
At a guess the likely issue is the vocabulary size of your dataset. What's the vocabulary size you have for your 3.3GB dataset? The dataset isn't actually kept on the GPU device's memory so shouldn't impact the model size. The solutions would include an adaptive softmax, which this codebase used to have but which I removed, or to reduce the vocabulary size through wordpieces or similar. If you have a large vocabulary then the GPU memory will balloon quite rapidly as it's required for the softmax output of each and every timestep. |
It's a char-based lm, and the data is lowercased, so aside the 26 letters,
some apostrophes and dashes plus some monetary symbols, there is nothing
else. The vocab size is less than 100.
How can i diagnose this issue?
…On Mon., Dec. 9, 2019, 7:48 p.m. Stephen Merity, ***@***.***> wrote:
At a guess the likely issue is the vocabulary size of your dataset. What's
the vocabulary size you have for your 3.3GB dataset? The dataset isn't
actually kept on the GPU device's memory so shouldn't impact the model size.
The solutions would include an adaptive softmax, which this codebase used
to have but which I removed, or to reduce the vocabulary size through
wordpieces or similar.
If you have a large vocabulary then the GPU memory will balloon quite
rapidly as it's required for the softmax output of each and every timestep.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9?email_source=notifications&email_token=ACGTL2Z5WIG2YDAOKDJDUSLQX3RN3A5CNFSM4JXBSM52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGLHIKY#issuecomment-563508267>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGTL27DVIMDPKSTAKTBAPTQX3RN3ANCNFSM4JXBSM5Q>
.
|
That's quite odd. Are you able to replicate the initial results on |
I was able to reproduce the enwik8 results without problem (not the exact
BPC published, but very close)
I will try with a smaller sample of my dataset and see. If need be, I'll
go and see if there is a .cuda() put in the wrong place.
I had added some print statements in the data loading method, here are the
numbers I'm getting (for the entire dataset):
train.txt, 1555434404 tokens
valid.txt, 1978645700 tokens
test.txt, 2375699684 tokens
…On Tue, Dec 10, 2019 at 2:07 PM Stephen Merity ***@***.***> wrote:
That's quite odd. Are you able to replicate the initial results on enwik8?
I would try doing that first. My GPU only had ~12GB of RAM so there's no
reason you shouldn't be able to do this as far as I'm aware assuming your
data is character level. If you can replicate then try a 100MB chunk of
your dataset and if that still works then potentially I do have a line of
code that unexpectedly puts the dataset in GPU memory. If that's the case
it's an easy fix of finding that line (like a .cuda()), removing that
from the massive dataset, and putting a .cuda() when the snippets of data
are loaded for training.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9?email_source=notifications&email_token=ACGTL2YX4AEGWS4DWHNMCT3QX7SIZA5CNFSM4JXBSM52YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGQMNXY#issuecomment-564184799>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGTL26UULVJQNW6EUSAOBLQX7SIZANCNFSM4JXBSM5Q>
.
|
Ah, I was wrong. The dataset is loaded into memory - it was a previous version of the codebase I optimized that for sorry. The fix is to take out the dataset transfer to GPU in batchify and add it to This may slow the training down a little, I'm not certain, as small batches of data will be shuffled back and forth between CPU and GPU, but it will allow you to train without having the dataset in GPU RAM. You'll obviously need to store it in CPU RAM however. |
Wonderful, thanks, that seems to do the trick! With a smaller dataset and without the fix, I'm getting the following "throughput"
so 1.6 batches per sec. With the larger dataset and the fix you suggested:
So about 1.5 batches per sec. Not bad. Both exps use The large data set o GPU 0, the "small data set" runs on GPU 1 and nvidia-smi reports:
|
I'm so glad! Sorry about the wild goose / bug chase =] It appears that the overhead isn't all that substantial which is reassuring. The technique of loading individual batches to GPU memory was the approach I used for WikiText-103 as RAM was scarce. Various optimizations could be made, such as loading a number of batches at the same time, but that's likely a little over the top. There are big gains to come from all directions as the model really deserves some optimization love. For your experiment I would note that the embedding size of 512 will likely limit your model as that's the size of the LSTM hidden state as well. LSTMs are not as efficient when working with smaller hidden states due to the forget mask recurrence limiting their expressiveness. You should still get reasonable results but it may require some tweaking. If you're interested in telling me more about what dataset / task you're exploring I'd love to hear it, online or offline :) |
Thanks for sharing this code! I'd like to try on my own training dataset, but I keep getting GPU OOM problems:
I've cut down the batch size to 8, emb size to 512, nhid to 2048 and nlayers to 2 and I still get the exact same message.
My training data set is 3.3GB (that's 1/10 of the data I would like to throw at it) so I'm already way over the enwik8 dataset (173MB) so I wonder where I should tweak the model/code...
The text was updated successfully, but these errors were encountered: