Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Status of prototype features #1807

Open
2 of 12 tasks
msaroufim opened this issue Mar 1, 2025 · 4 comments
Open
2 of 12 tasks

Status of prototype features #1807

msaroufim opened this issue Mar 1, 2025 · 4 comments
Labels
rfc topic: deprecation Use this tag if this PR deprecates a feature tracker

Comments

@msaroufim
Copy link
Member

msaroufim commented Mar 1, 2025

I was parsing through our prototype folder and wanted to give my take on what should be promoted, deleted or requires further discussion

  • spinquant, awq, autoround, hqq: All of these are algorithm implementations and if we are convinced they are correct and pass reference checks relative to the original repos we should promote those to out of prototype. In particular the benefits we should lean into are accelerated performance using torch.compile and serialization support with the HF hub @jerryzh168
  • DORA: This technique didn't end up picking up as much as low bit optimizers, the idea of low rank as a form of compression is interesting but I would lean on deleting this although I could convinced that having some latent space optimizers in is relevant considering MLA is so hot right now Delete DORA #1815
  • Profiler: This never worked with torch.compile and so for us has limited utility unless @jeromeku can fix and if not we can delete
  • float8nocompile: This doesn't feel like it should be a prototype feature and I'd like to hear some detail on the promotion plan @danielvegamyhre
  • common: This folder probably shouldn't exist
  • quantized_training: This should remain in prototype because it primarily targets older or consumer GPUs, concretely as our focus moves to blackwell there won't be much of a difference between dtypes for inference vs training. Granted stochastic rounding should be promoted as a utility
  • low_bit_optim: This is great work, we should promote it out of prototype
  • Split_k: This just solves a very narrow problem so should be deleted in favor of using inductor matmul templates. It was meant more as an example of how to ship triton kernels with ao which well isn't hard because it's all JIT Remove split_k kernel #1816
  • mx_formats: With blackwell out, this should be promoted out of prototype
  • Sparsity: 2:4 sparsity for inference should be moved out of prototype, it's likely going to continue being relevant esp for future Flash attention implementations
  • Kernel: this is a kernel autotuner, it should be deleted since we can just rely on inductor's max autotune mechanism
  • dtypes: This is mostly for bitnet support, should either be deleted or refactored into quantized training

I'd love to hear more on folks especially if you disagree with anything!

cc @supriyar @jerryzh168 @drisspg @vkuzo @gau-nernst

@vkuzo
Copy link
Contributor

vkuzo commented Mar 2, 2025

float8nocompile: This doesn't feel like it should be a prototype feature

I don't think this should be a separate feature. If we decide to polish this, IMO it should be a setting in the regular float8/int8/mx flows to use fused eager mode kernels. I'd want to make sure overall UX does not regress with respect to activation checkpointing and that performance is actually compelling on real models important today (large enough gemms) before shipping this.

@danielvegamyhre
Copy link
Contributor

danielvegamyhre commented Mar 2, 2025

float8nocompile: This doesn't feel like it should be a prototype feature and I'd like to hear some detail on the promotion plan @danielvegamyhre

IMO the main blocker to promoting this from prototype is better composability with AC, for which we need to implement the feature request here pytorch/pytorch#144928 From my conversations with Jeffrey my understanding is he agrees it would be a useful feature, has some ideas in mind about how to implement it, and is planning to do it some time this half (cc @soulitzer please correct me if I'm mistaken about this).

If we decide to polish this, IMO it should be a setting in the regular float8/int8/mx flows to use fused eager mode kernels.

This is an interesting idea as well, I'd be interested in exploring that once the AC API described in the feature request has landed.

I'd want to make sure overall UX does not regress with respect to activation checkpointing and that performance is actually compelling on real models important today (large enough gemms) before shipping this.

+1

@supriyar
Copy link
Contributor

supriyar commented Mar 3, 2025

Agree with you on the things we should deprecate/delete. We can perhaps do them during our next BE day/week? cc @andrewor14

For the rest like quantization algorithms or sparsity I'll defer to @jerryzh168 and @jcaip to share their thoughts.

@msaroufim msaroufim added tracker rfc topic: deprecation Use this tag if this PR deprecates a feature labels Mar 3, 2025
@jcaip
Copy link
Contributor

jcaip commented Mar 4, 2025

cc @msaroufim

For sparsity: 2:4, marlin, and BSR all have been promoted out of prototype, the only things that remain are:

  • the old structured pruner / sparsifier which is used for masking - I am in favor of deleting this, as my general sense is that people are doing their own masking. But we'll need to update this tutorial that currently use this API.
  • some superblock eval / train code (the actual implementation is in torchao.sparsity) - I would like to delete this half as 90% of this is shared with the reference torchvision implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc topic: deprecation Use this tag if this PR deprecates a feature tracker
Projects
None yet
Development

No branches or pull requests

5 participants