This repository has been archived by the owner on Oct 9, 2023. It is now read-only.
ASR Task Failing due to CUDA Memory Issue - How to introduce Lightning Fabric support for Lightning Flash Tasks? #1657
Unanswered
greeshmasmenon
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi,
I am trying to finetune the
Wav2vec2 model ("facebook/wav2vec2-large-960h-lv60-self")
with custom data that i have.GPU : Tesla V100-SXM2-16GB
Number of GPUs: 8 (2 Nodes of 4 each)
Shape of audio dataset (Each audio segment is roughly 3 seconds long) :
{ "training": [ 133328, 4 ], "validation": [ 33332, 3 ] }
Each audio segment is roughly 3 seconds long. The training arguments are below :
I am getting the CUDA Memory error -
I would like to use a distributed training approach as I have gone down to BATCH SIZE of 1 and ACCUMULATE_GRAD_BATCHES = 1 and don't know how to reduce the data loaded to the GPU still further. Any advice here would be appreciated?
Also, I want to start looking at Lightning Fabric and introduce the parallel training procedures to see if it solves my problem. With the high level interfaces, I am not sure where to start. Can someone guide me how?
Logs:
Beta Was this translation helpful? Give feedback.
All reactions