Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[float8] improve eager numerics for dynamic scales and gets on par with torch.compile #904

Merged
merged 43 commits into from
Oct 1, 2024

Commits on Sep 19, 2024

  1. Configuration menu
    Copy the full SHA
    6bf0f5c View commit details
    Browse the repository at this point in the history
  2. leave torch.linalg.vector_norm for another PR

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    553687f View commit details
    Browse the repository at this point in the history
  3. cuda

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    19a592d View commit details
    Browse the repository at this point in the history
  4. remove _data and investigate

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    218290e View commit details
    Browse the repository at this point in the history
  5. remove _data comment

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 19, 2024
    Configuration menu
    Copy the full SHA
    24ec914 View commit details
    Browse the repository at this point in the history

Commits on Sep 21, 2024

  1. upcast to float32 is enough

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    c099486 View commit details
    Browse the repository at this point in the history
  2. explain why float32

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    b93ffc8 View commit details
    Browse the repository at this point in the history
  3. _data parity

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    ebff416 View commit details
    Browse the repository at this point in the history
  4. handle sm8.9

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 21, 2024
    Configuration menu
    Copy the full SHA
    8978ab2 View commit details
    Browse the repository at this point in the history

Commits on Sep 22, 2024

  1. fix transformer unit test

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 22, 2024
    Configuration menu
    Copy the full SHA
    f17dc12 View commit details
    Browse the repository at this point in the history

Commits on Sep 26, 2024

  1. print if error

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    511c751 View commit details
    Browse the repository at this point in the history
  2. Add tutorial for trainable tensor subclass (pytorch#908)

    Summary: The new tutorial provides an example of how to implement
    a trainable tensor subclass that wraps quantized data. This extends
    the existing `MyDTypeTensor` with a few necessary steps to ensure
    proper gradient updates, namely:
    
    1. Define a differentiable constructor
    2. Define backward pass for ops of interest (e.g. torch.nn.functional.linear)
    3. Handle special ops used by the optimizer (e.g. aten.add, aten.add_)
    
    Test Plan:
    python tutorials/developer_api_guide/my_trainable_tensor_subclass.py
    andrewor14 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    9becda1 View commit details
    Browse the repository at this point in the history
  3. Introducing 1-bit quantization for Llama in torchchat (pytorch#910)

    Differential Revision: D63052325
    
    Pull Request resolved: pytorch#911
    vaishnavi17 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    e4fdca9 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    0cd4d37 View commit details
    Browse the repository at this point in the history
  5. [float8] fix typo in bitwise_identical unit test (pytorch#918)

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    014558d View commit details
    Browse the repository at this point in the history
  6. Adding example for quantized tensor + tensor parallelism (pytorch#785)

    * [WIP] Adding example for quantized tensor + tensor parallelism
    
    Summary:
    This PR adds an example of how quantized tensor subclass can work with DTensor: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/README.md
    
    End goal is to rewrite https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py with normal llama2 implementation and show case with DTensor + AffineQuantizedTensor + torch.compile we can get on par performance with the custom tensor parallel implementation
    
    Test Plan:
    torchrun --standalone --nnodes=1 --nproc-per-node=4 tutorials/developer_api_guide/tensor_parallel.py
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * tensor parallel file
    
    * Use DTensor.from instead of distribute_tensor
    
    * implementing aten.slice.Tensor (WIP)
    
    * working
    
    * some shape fix and use more quant primitive ops
    
    * Add rowwise test
    
    * make rowwise sharding work
    
    * compile still not working yet
    
    * fake tensor didn't pick up shape changes from transpose
    
    * backend='eager'
    
    * change transpose to non-inplace op
    
    * add error message
    
    * works now with torch nightly
    
    * remove print
    
    * ruff
    
    * Clean up
    
    * Fix device id
    
    ---------
    
    Co-authored-by: Ke Wen <[email protected]>
    2 people authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    3267402 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    1e07eff View commit details
    Browse the repository at this point in the history
  8. Add workaround to recover the perf for quantized vit in torch.compile (

    …pytorch#926)
    
    Add temporary workaround to recover the perf for quantized vit under torch.compile
    
    Summary:
    Recently we found a perf drop in quantized vit due to pytorch#898 (comment)
    This PR add a temp fix until we figure out the longer term fix.
    
    I think ideally we should figure out why the tensor subclass check failed in torch.compile (https://github.com/pytorch/pytorch/blob/e4d294221b140fdbb49a64f297bc60c9fcc2f80e/torch/nn/modules/activation.py#L1286) and fix that
    
    Test Plan:
    python tutorials/quantize_vit/run_vit_b_quant.py
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    jerryzh168 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    ebdeed0 View commit details
    Browse the repository at this point in the history
  9. clean up device checks in float8 unit test files (pytorch#923)

    Summary:
    
    While working on rowwise scaling I noticed that some of the CUDA
    device capability checks we had in the test files did not make sense,
    cleaning this up.
    
    Test Plan:
    
    tests pass on my H100
    
    CI, it should skip less tests now since CI only has CUDA capability 8, 9
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    vkuzo authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    09ffa22 View commit details
    Browse the repository at this point in the history
  10. [low-bit optim] Change 8-bit and FP8 optim block size from 2048 to 25…

    …6 to match new bnb v0.44 (pytorch#927)
    gau-nernst authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    0b8dd85 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    87faf04 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    3a9fdb0 View commit details
    Browse the repository at this point in the history
  13. Remove two if statements in fp8 padding (pytorch#935)

    Reviewed By: vkuzo
    
    Differential Revision: D63051205
    
    Pull Request resolved: pytorch#935
    Approved by: https://github.com/vkuzo
    y-sq authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    fc6c393 View commit details
    Browse the repository at this point in the history
  14. [Distributed] Improve sharding example (pytorch#937)

    * [Distributed] Improve sharding example
    
    * Add comment
    kwen2501 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    0043ace View commit details
    Browse the repository at this point in the history
  15. Add composable QAT quantizer (pytorch#938)

    Summary: This is a utility for users who wish to apply multiple
    QAT quantizers to their models. In the near future, we expect
    to add an embedding QAT quantizer that composes with the
    existing linear QAT quantizers.
    
    Test Plan:
    python test/quantization/test_qat.py -k test_composable_qat_quantizer
    andrewor14 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    ab3435c View commit details
    Browse the repository at this point in the history
  16. resolve conflict with latest main

    Differential Revision: D63048850
    
    Pull Request resolved: pytorch#912
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    a05a40f View commit details
    Browse the repository at this point in the history
  17. Add torchchat quantizer

    Differential Revision: D62394341
    
    Pull Request resolved: pytorch#897
    metascroy authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    334891b View commit details
    Browse the repository at this point in the history
  18. Add compile tests to test suite (pytorch#906)

    * Add compile tests to test suite
    
    Summary:
    This is a follow up PR addressing pytorch#839 (comment)
    We can add more compiler related tests in the future.
    
    Next
    * refactor a bit to use quantize_ API directly
    * use the test suite in existing API tests
    
    Test Plan:
    python torchao/testing/utils.py
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * rename
    
    * add result check
    jerryzh168 authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    c706139 View commit details
    Browse the repository at this point in the history
  19. Fix up CMakeLists and reorganize some code locations

    Differential Revision: D62711903
    
    Pull Request resolved: pytorch#948
    metascroy authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    93554c0 View commit details
    Browse the repository at this point in the history
  20. [float8] all-reduce amax on dp mesh instead of global pg (pytorch#933)

    * [float8] all-reduce amax on dp mesh instead of global pg
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * liner
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * improve comments
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * move hp tensor inside if
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * linter
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * linter
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * linter
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * linter
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    
    * linter
    
    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    efd9bb9 View commit details
    Browse the repository at this point in the history
  21. int8 dynamic quant + bsr support (pytorch#821)

    This PR, adds in int8 dynamicquant + bsr support.
    
    Changes:
    * Use i8i8 -> bf16 matmul to maintain accuracy
    * Added a block sparse layout type to AffineQuantizedTensor + check/impl.  
    * Cleaned up benchmark.py script and add a single line `benchmark.sh` file for acceleration numbers
    * Updated eval.py and added a single line `evaluate.sh` file for accuracy numbers
    * Lots of lint formatting and README updates
    * torch.compile now working and is correct
    jcaip authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    85126cc View commit details
    Browse the repository at this point in the history
  22. fixing some issues with our support for 70/405B models (pytorch#941)

    Summary: download and convert scripts needed to be updated alongside
    model.py config files
    
    Test Plan: python generate.py --checkpoint_path ../../../checkpoints/meta-llama/Meta-Llama-3.1-70B/model.pth
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    HDCharles authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    a5a426e View commit details
    Browse the repository at this point in the history
  23. Configuration menu
    Copy the full SHA
    e7270f1 View commit details
    Browse the repository at this point in the history
  24. Add executorch parallel

    Differential Revision: D62711909
    
    Pull Request resolved: pytorch#953
    metascroy authored and weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    352685c View commit details
    Browse the repository at this point in the history
  25. Configuration menu
    Copy the full SHA
    168cfe9 View commit details
    Browse the repository at this point in the history
  26. Configuration menu
    Copy the full SHA
    5900c3e View commit details
    Browse the repository at this point in the history
  27. test CI

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    37e1479 View commit details
    Browse the repository at this point in the history
  28. better comment on why upcasting

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    2efde49 View commit details
    Browse the repository at this point in the history
  29. control seed

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    8c04f4f View commit details
    Browse the repository at this point in the history
  30. move unit test to test_compile

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    04b229b View commit details
    Browse the repository at this point in the history
  31. fix typo

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    8b7c2ef View commit details
    Browse the repository at this point in the history

Commits on Sep 27, 2024

  1. float64 upcasting after allreduce

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 27, 2024
    Configuration menu
    Copy the full SHA
    9346afd View commit details
    Browse the repository at this point in the history

Commits on Sep 30, 2024

  1. use LinearMMConfig

    Summary:
    
    Test Plan:
    
    Reviewers:
    
    Subscribers:
    
    Tasks:
    
    Tags:
    weifengpy committed Sep 30, 2024
    Configuration menu
    Copy the full SHA
    3d0da20 View commit details
    Browse the repository at this point in the history