Skip to content

Conversation

@tymat
Copy link

@tymat tymat commented Jun 28, 2025

  • Fix LayerNorm.forward() to use tensor operations instead of scalar operations
  • Replace sum_keepdim()/size with mean_keepdim() to preserve gradients
  • Use broadcast_add() with epsilon tensor instead of scalar addition
  • Fix ops::layer_norm_slow() with same gradient-preserving changes
  • Update ops::layer_norm() to use slow implementation for proper gradients
  • Add comprehensive gradient flow test (now passes with 100% gradient flow)
  • Add numerical equivalence test to ensure accuracy is maintained
  • Fixes training issues where LayerNorm parameters weren't being updated

Resolves gradient propagation bug where only 33% of parameters received gradients during backpropagation, preventing proper model training. #3011

- Fix LayerNorm.forward() to use tensor operations instead of scalar operations
- Replace sum_keepdim()/size with mean_keepdim() to preserve gradients
- Use broadcast_add() with epsilon tensor instead of scalar addition
- Fix ops::layer_norm_slow() with same gradient-preserving changes
- Update ops::layer_norm() to use slow implementation for proper gradients
- Add comprehensive gradient flow test (now passes with 100% gradient flow)
- Add numerical equivalence test to ensure accuracy is maintained
- Fixes training issues where LayerNorm parameters weren't being updated

Resolves gradient propagation bug where only 33% of parameters received
gradients during backpropagation, preventing proper model training.
@AlpineVibrations
Copy link

great. it would be awesome to have more training code examples and workflows with candle

@ivarflakstad
Copy link
Member

Hey! Thanks for this :)

I think we'll have to implement this in the optimized kernels as well before we can merge.
I assume all the variants (cpu, cuda, metal) suffer from the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants