FSDP2 supports all-gather using FP8: https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323 Wondering if we could do this directly using TransformerEngine instead of torch-ao? Thanks!