Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QLoRA Worse Memory When linear_nf4 Used on Output #1433

Open
pbontrager opened this issue Dec 18, 2024 · 2 comments
Open

QLoRA Worse Memory When linear_nf4 Used on Output #1433

pbontrager opened this issue Dec 18, 2024 · 2 comments

Comments

@pbontrager
Copy link

When I replace the output layer for llama3.1 70B

nn.Linear(8192, 128_256, bias=False) with FrozenNF4Linear(8192, 128_256, bias=False)

in torchtune, I surprisingly end up using a lot more memory. Leaving the output layer in bf16 results in the training run using ~43gb of peak memory active, while quantizing the output results in ~52gb active. I wonder if this is due to the large size of the output layer.

Steps to reproduce:

  • Replace nn.Linear with FrozenNF4Linear in the model here (FrozenNF4Linear is just a linear_nf4 wrapper)
  • tune conifg here
  • command: tune run lora_finetune_single_device --config ./70B_qlora_long_context.yaml tokenizer.max_seq_len=8192
@pbontrager pbontrager changed the title QLoRA Worse Memory When FrozenNF4Linear used on output QLoRA Worse Memory When linear_nf4 Used on Output Dec 18, 2024
@supriyar
Copy link
Contributor

cc @andrewor14 @drisspg

@drisspg
Copy link
Contributor

drisspg commented Jan 2, 2025

I can take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants