QLoRA Worse Memory When linear_nf4 Used on Output #1433

pbontrager · 2024-12-18T16:14:58Z

When I replace the output layer for llama3.1 70B

nn.Linear(8192, 128_256, bias=False) with FrozenNF4Linear(8192, 128_256, bias=False)

in torchtune, I surprisingly end up using a lot more memory. Leaving the output layer in bf16 results in the training run using ~43gb of peak memory active, while quantizing the output results in ~52gb active. I wonder if this is due to the large size of the output layer.

Steps to reproduce:

Replace nn.Linear with FrozenNF4Linear in the model here (FrozenNF4Linear is just a linear_nf4 wrapper)
tune conifg here
command: tune run lora_finetune_single_device --config ./70B_qlora_long_context.yaml tokenizer.max_seq_len=8192

The text was updated successfully, but these errors were encountered:

supriyar · 2024-12-26T19:21:53Z

cc @andrewor14 @drisspg

drisspg · 2025-01-02T20:19:01Z

I can take a look

pbontrager changed the title ~~QLoRA Worse Memory When FrozenNF4Linear used on output~~ QLoRA Worse Memory When linear_nf4 Used on Output Dec 18, 2024

supriyar added the performance label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QLoRA Worse Memory When linear_nf4 Used on Output #1433

QLoRA Worse Memory When linear_nf4 Used on Output #1433

pbontrager commented Dec 18, 2024

supriyar commented Dec 26, 2024

drisspg commented Jan 2, 2025

QLoRA Worse Memory When linear_nf4 Used on Output #1433

QLoRA Worse Memory When linear_nf4 Used on Output #1433

Comments

pbontrager commented Dec 18, 2024

supriyar commented Dec 26, 2024

drisspg commented Jan 2, 2025