-
Notifications
You must be signed in to change notification settings - Fork 517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expected is_sm80 to be true, but got false #101
Comments
@t-vi does this ring any bells? |
unfortunately not, but I'll be sure to dig into it. |
If we could get rid of that complex op in the RoPE implementation and still match up the results it would unblock a ton (see test_rope.py) |
There is a known and fixed upstream bug about this check, maybe try a nightly? |
but I can expand the rope to use reals if that helps. gets rid of the stupid warning, too. |
Thanks @t-vi |
Closing as nightly has solved it and we reference the workaround in the README. |
Hi all, I was still encountering with PyTorch nightly (as of 2023-04-13) on an A10 while running LoRA finetuning. As a temp fix, I have found that disabling flash attention backend in the scaled dot-product attention calculation around the loss function resolved the issue. In with torch.cuda.amp.autocast(enabled=True, dtype=torch.bfloat16) as autocast, torch.backends.cuda.sdp_kernel(enable_flash=False) as disable:
input_ids, targets = get_batch(fabric, train_data)
logits = model(input_ids)
loss = loss_fn(logits, targets)
fabric.backward(loss) |
Oh interesting, thanks for bringing this up @AurelienSaussay |
I'm imagining the same issue comes up with LLaMA-Adapter on A10, can you confirm? |
Also, the autocast part should be already taken care of by
Do you confirm @awaelchli ? |
Yes I downgraded to torch 2.0 and was able to prevent the issue with |
So let's add this line (commented) to the scripts and mention in the README to uncomment that line if that error comes up. |
Can anyone help me with this. {'eval_interval': 600, 'save_interval': 1000, 'eval_iters': 100, 'log_interval': 1, 'devices': 1, 'learning_rate': 0.003, 'batch_size': 128.0, 'micro_batch_size': 2, 'gradient_accumulation_iters': 64.0, 'epoch_size': 50000, 'num_epochs': 5, 'max_iters': 125000, 'weight_decay': 0.02, 'warmup_steps': 781.0} |
I tried running the finetuning scripts on a 3090 GPU and got this error:
This was on the branch of #100 where I added the
EmptyInitOnDevice()
context manager. It looks like the conversion to complex_dtype caused problems in the backward.Both
and
fail with this error.
The text was updated successfully, but these errors were encountered: