Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
run: |
doxygen docs/Doxyfile
cd docs
make html
make html SPHINXOPTS="-W"
- name: 'Upload docs'
uses: actions/upload-artifact@v4
with:
Expand Down
30 changes: 24 additions & 6 deletions docs/api/pytorch.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,22 @@ pyTorch
.. autoapiclass:: transformer_engine.pytorch.CudaRNGStatesTracker()
:members: reset, get_states, set_states, add, fork

.. autoapifunction:: transformer_engine.pytorch.fp8_autocast

.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

Comment on lines 39 to 40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do want to have them still in the documentation though, just marked as deprecated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see that you just moved it.

.. autoapifunction:: transformer_engine.pytorch.autocast

.. autoapifunction:: transformer_engine.pytorch.quantized_model_init

.. autoapifunction:: transformer_engine.pytorch.checkpoint


.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables

.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context


Recipe availability
------------------------
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the past I had some trouble with the titles where the underline was not exact same length as the text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the line is shorter, then warning is generated. In that case there is no warning and it renders properly, so I guess it's ok now.


.. autoapifunction:: transformer_engine.pytorch.is_fp8_available

.. autoapifunction:: transformer_engine.pytorch.is_mxfp8_available
Expand All @@ -63,9 +69,8 @@ pyTorch

.. autoapifunction:: transformer_engine.pytorch.get_default_recipe

.. autoapifunction:: transformer_engine.pytorch.make_graphed_callables

.. autoapifunction:: transformer_engine.pytorch.get_cpu_offload_context
Mixture of Experts (MoE) functions
------------------------------------------

.. autoapifunction:: transformer_engine.pytorch.moe_permute

Expand All @@ -79,9 +84,22 @@ pyTorch

.. autoapifunction:: transformer_engine.pytorch.moe_sort_chunks_by_index_with_probs


GEMM Comm overlap
---------------------

.. autoapifunction:: transformer_engine.pytorch.initialize_ub

.. autoapifunction:: transformer_engine.pytorch.destroy_ub

.. autoapiclass:: transformer_engine.pytorch.UserBufferQuantizationMode
:members: FP8, NONE


Deprecated functions
---------------------


.. autoapifunction:: transformer_engine.pytorch.fp8_autocast

.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: missing final newline

Suggested change
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: missing final newline

Suggested change
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init
.. autoapifunction:: transformer_engine.pytorch.fp8_model_init

24 changes: 23 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,10 @@
]

templates_path = ["_templates"]
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
exclude_patterns = [
"_build",
"sphinx_rtd_theme",
]

source_suffix = ".rst"

Expand Down Expand Up @@ -101,3 +104,22 @@

autoapi_generate_api_docs = False
autoapi_dirs = [root_path / "transformer_engine"]


# There are 2 warnings about the same namespace (transformer_engine) in two different c++ api
# docs pages. This seems to be the only way to suppress these warnings.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: trailing whitespace on line 109

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/conf.py
Line: 109:110

Comment:
**syntax:** trailing whitespace on line 109

How can I resolve this? If you propose a fix, please make it concise.

def setup(app):
"""Custom Sphinx setup to filter warnings."""
import logging

# Filter out duplicate C++ declaration warnings
class DuplicateDeclarationFilter(logging.Filter):
def filter(self, record):
message = record.getMessage()
if "Duplicate C++ declaration" in message and "transformer_engine" in message:
return False
return True

# Apply filter to Sphinx logger
logger = logging.getLogger("sphinx")
logger.addFilter(DuplicateDeclarationFilter())
1 change: 1 addition & 0 deletions docs/debug.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

Precision debug tools
==============================================

Expand Down
15 changes: 9 additions & 6 deletions docs/debug/1_getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
See LICENSE for license information.

Getting started
==============
===============================
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 15 chars ('Getting started') but underline has 31 '=' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 7:7

Comment:
**syntax:** underline mismatch - title has 15 chars ('Getting started') but underline has 31 '=' chars

How can I resolve this? If you propose a fix, please make it concise.


.. note::

Expand Down Expand Up @@ -38,7 +38,7 @@ To start debugging, one needs to create a configuration YAML file. This file lis
one - ``UserProvidedPrecision`` - is a custom feature implemented by the user. Nvidia-DL-Framework-Inspect inserts features into the layers according to the config.

Example training script
----------------------
------------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 24 chars ('Example training script') but underline has 30 '-' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 41:41

Comment:
**syntax:** underline mismatch - title has 24 chars ('Example training script') but underline has 30 '-' chars

How can I resolve this? If you propose a fix, please make it concise.


Let's look at a simple example of training a Transformer layer using Transformer Engine with FP8 precision. This example demonstrates how to set up the layer, define an optimizer, and perform a few training iterations using synthetic data.

Expand Down Expand Up @@ -81,7 +81,7 @@ We will demonstrate two debug features on the code above:
2. Logging statistics for other GEMM operations, such as gradient statistics for data gradient GEMM within the LayerNormLinear sub-layer of the TransformerLayer.

Config file
----------
------------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 11 chars ('Config file') but underline has 30 '-' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 84:84

Comment:
**syntax:** underline mismatch - title has 11 chars ('Config file') but underline has 30 '-' chars

How can I resolve this? If you propose a fix, please make it concise.


We need to prepare the configuration YAML file, as below

Expand Down Expand Up @@ -114,7 +114,8 @@ We need to prepare the configuration YAML file, as below
Further explanation on how to create config files is in the :doc:`next part of the documentation <2_config_file_structure>`.

Adjusting Python file
--------------------
----------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 22 chars ('Adjusting Python file') but underline has 28 '-' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 117:117

Comment:
**syntax:** underline mismatch - title has 22 chars ('Adjusting Python file') but underline has 28 '-' chars

How can I resolve this? If you propose a fix, please make it concise.



.. code-block:: python

Expand Down Expand Up @@ -145,7 +146,8 @@ In the modified code above, the following changes were made:
3. Added ``debug_api.step()`` after each of the forward-backward pass.

Inspecting the logs
------------------
----------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 20 chars ('Inspecting the logs') but underline has 28 '-' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 149:149

Comment:
**syntax:** underline mismatch - title has 20 chars ('Inspecting the logs') but underline has 28 '-' chars

How can I resolve this? If you propose a fix, please make it concise.



Let's look at the files with the logs. Two files will be created:

Expand Down Expand Up @@ -213,7 +215,8 @@ The second log file (``nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-
INFO - transformer_layer.self_attention.layernorm_qkv_activation_l1_norm iteration=000004 value=130776.7969

Logging using TensorBoard
------------------------
----------------------------
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: underline mismatch - title has 27 chars ('Logging using TensorBoard') but underline has 28 '-' chars

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/debug/1_getting_started.rst
Line: 218:218

Comment:
**syntax:** underline mismatch - title has 27 chars ('Logging using TensorBoard') but underline has 28 '-' chars

How can I resolve this? If you propose a fix, please make it concise.



Precision debug tools support logging using `TensorBoard <https://www.tensorflow.org/tensorboard>`_. To enable it, one needs to pass the argument ``tb_writer`` to the ``debug_api.initialize()``. Let's modify ``train.py`` file.

Expand Down
15 changes: 9 additions & 6 deletions docs/debug/2_config_file_structure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,14 @@
See LICENSE for license information.

Config File Structure
====================
===========================

To enable debug features, create a configuration YAML file to specify the desired behavior, such as determining which GEMMs (General Matrix Multiply operations) should run in higher precision rather than FP8 and defining which statistics to log.
Below, we outline how to structure the configuration YAML file.

General Format
-------------
----------------------------


A config file can have one or more sections, each containing settings for specific layers and features:

Expand Down Expand Up @@ -55,7 +56,8 @@ Sections may have any name and must contain:
3. Additional fields describing features for those layers.

Layer Specification
------------------
----------------------------


Debug layers can be identified by a ``name`` parameter:

Expand Down Expand Up @@ -89,7 +91,8 @@ Examples:
(...)

Names in Transformer Layers
--------------------------
--------------------------------


There are three ways to assign a name to a layer in the Transformer Engine:

Expand Down Expand Up @@ -154,7 +157,7 @@ Below is an example ``TransformerLayer`` with four linear layers that can be inf


Structured Configuration for GEMMs and Tensors
---------------------------------------------
-----------------------------------------------------

Sometimes a feature is parameterized by a list of tensors or by a list of GEMMs.
There are multiple ways of describing this parameterization.
Expand Down Expand Up @@ -216,7 +219,7 @@ We can use both structs for tensors and GEMMs. The tensors_struct should be nest
gemm_feature_param1: value

Enabling or Disabling Sections and Features
------------------------------------------
-------------------------------------------------

Debug features can be enabled or disabled with the ``enabled`` keyword:

Expand Down
7 changes: 4 additions & 3 deletions docs/debug/3_api_debug_setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ Please refer to the Nvidia-DL-Framework-Inspect `documentation <https://github.c
Below, we outline the steps for debug initialization.

initialize()
-----------
----------------------------


Must be called once on every rank in the global context to initialize Nvidia-DL-Framework-Inspect.

Expand All @@ -34,7 +35,7 @@ Must be called once on every rank in the global context to initialize Nvidia-DL-
log_dir="./log_dir")

set_tensor_reduction_group()
--------------------------
-----------------------------------------

Needed only for logging tensor stats. In multi-GPU training, activation and gradient tensors are distributed across multiple nodes. This method lets you specify the group for the reduction of stats; see the `reduction group section <./4_distributed.rst#reduction-groups>`_ for more details.

Expand All @@ -61,7 +62,7 @@ If the tensor reduction group is not specified, then statistics are reduced acro
# activation/gradient tensor statistics are reduced along pipeline_parallel_group

set_weight_tensor_tp_group_reduce()
---------------------------------
-----------------------------------------

By default, weight tensor statistics are reduced within the tensor parallel group. This function allows you to disable that behavior; for more details, see `reduction group section <./4_distributed.rst#reduction-groups>`_.

Expand Down
2 changes: 1 addition & 1 deletion docs/debug/3_api_features.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
See LICENSE for license information.

Debug features
==========
===========================

.. autoapiclass:: transformer_engine.debug.features.log_tensor_stats.LogTensorStats
.. autoapiclass:: transformer_engine.debug.features.log_fp8_tensor_stats.LogFp8TensorStats
Expand Down
13 changes: 8 additions & 5 deletions docs/debug/4_distributed.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
See LICENSE for license information.

Distributed training
===================
====================================

Nvidia-Pytorch-Inspect with Transformer Engine supports multi-GPU training. This guide describes how to run it and how the supported features work in the distributed setting.

Expand All @@ -14,7 +14,8 @@ To use precision debug tools in multi-GPU training, one needs to:
2. If one wants to log stats, one may want to invoke ``debug_api.set_tensor_reduction_group`` with a proper reduction group.

Behavior of the features
-----------------------
----------------------------


In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function similarly to the single-GPU case, with no notable differences.

Expand All @@ -28,7 +29,8 @@ In a distributed setting, **DisableFP8GEMM** and **DisableFP8Layer** function si
Logging-related features are more complex and will be discussed further in the next sections.

Reduction groups
--------------
----------------------------


In setups with tensor, data, or pipeline parallelism, some tensors are distributed across multiple GPUs, requiring a reduction operation to compute statistics for these tensors.

Expand Down Expand Up @@ -65,15 +67,16 @@ Below, we illustrate configurations for a 4-node setup with tensor parallelism s


Microbatching
-----------
----------------------------


Let's dive into how statistics collection works with microbatching. By microbatching, we mean invoking multiple ``forward()`` calls for each ``debug_api.step()``. The behavior is as follows:

- For weight tensors, the stats remain the same for each microbatch because the weight does not change.
- For other tensors, the stats are accumulated.

Logging to files and TensorBoard
------------------------------
-------------------------------------------

In a single-node setup with ``default_logging_enabled=True``, all logs are saved by default to ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-0.log``. In multi-GPU training, each node writes its reduced statistics to its unique file, named ``log_dir/nvdlfw_inspect_statistics_logs/nvdlfw_inspect_globalrank-i.log`` for rank i. Because these logs contain reduced statistics, the logged values are identical for all nodes within a reduction group.

Expand Down
1 change: 1 addition & 0 deletions docs/debug/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

See LICENSE for license information.

API
============

Expand Down
6 changes: 3 additions & 3 deletions docs/examples/attention/attention.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@
"\n",
"[Transformer Engine](https://github.com/NVIDIA/TransformerEngine.git) supports the calculation of dot product attention in two frameworks, [PyTorch](https://github.com/pytorch/pytorch) and [JAX](https://github.com/google/jax). The API for each framework is\n",
"\n",
"- [transformer_engine.pytorch.DotProductAttention](../../api/pytorch.rst#transformer_engine.pytorch.DotProductAttention)\n",
"- [transformer_engine.jax.flax.DotProductAttention](../../api/jax.rst#transformer_engine.jax.flax.DotProductAttention)"
"- [transformer_engine.pytorch.DotProductAttention](../../api/pytorch.rst#transformer_engine.pytorch.dotproductattention)\n",
"- [transformer_engine.jax.flax.DotProductAttention](../../api/jax.rst#transformer_engine.jax.flax.dotproductattention)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about that? Just looked at the current docs page and the link works properly there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, some debugging leftover. It does also work, but I will remove this change.

]
},
{
Expand Down Expand Up @@ -606,7 +606,7 @@
"\n",
"A unique feature of Transformer Engine is its FP8 support, not only for the `Linear` layers but also for dot product attention. Transformer Engine's FP8 attention support is through its cuDNN attention sub-backend 2. Recall Figure 1: the two `MatMul` operations are performed in FP8 for computational efficiency, and the `SoftMax` operation is performed in FP32 for numerical accuracy.\n",
"\n",
"Transformer Engine supports FP8 attention through its [C APIs](../../api/c/fused_attn.rst), and [PyTorch API](../../api/pytorch.rst#transformer_engine.pytorch.DotProductAttention), as of v2.0. Its PyTorch API offers two options, both controlled through the FP8 recipe definition, `transformer_engine.common.recipe.DelayedScaling`.\n",
"Transformer Engine supports FP8 attention through its [C APIs](../../api/c/fused_attn.rst), and [PyTorch API](../../api/pytorch.rst#transformer_engine.pytorch.dotproductattention), as of v2.0. Its PyTorch API offers two options, both controlled through the FP8 recipe definition, `transformer_engine.common.recipe.DelayedScaling`.\n",
"\n",
"- `DelayedScaling.fp8_dpa=True (default=False)`: This enables the use of cuDNN attention sub-backend 2, when it does support the provided user inputs. The `FusedAttention` module for cuDNN attention takes FP16 or BF16 tensors as inputs, performs dot product attention in FP8, and returns attention logits in FP16 or BF16 (same as the input type). Casting operations are required to cast tensors to FP8 at the beginning, and back to FP16/BF16 at the end of the module.\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
"\n",
"For those seeking a deeper understanding of text generation mechanisms in Transformers, it is recommended to check out the [HuggingFace generation tutorial](https://huggingface.co/docs/transformers/llm_tutorial).\n",
"\n",
"In a previous tutorial on [Llama](../te_llama/tutorial_accelerate_hf_llama_finetuning_with_te.ipynb), it was demonstrated how finetuning of an open-source Llama model can be accelerated using Transformer Engine's `TransformerLayer`. Building on that foundation, this tutorial showcases how to accelerate the token generation from the open-source Hugging Face Gemma 7B model.\n",
"In a previous tutorial on [Llama](../te_llama/tutorial_accelerate_hf_llama_with_te.ipynb), it was demonstrated how finetuning of an open-source Llama model can be accelerated using Transformer Engine's `TransformerLayer`. Building on that foundation, this tutorial showcases how to accelerate the token generation from the open-source Hugging Face Gemma 7B model.\n",
"\n",
"This tutorial introduces several features of the Transformer Engine library that contribute towards this goal. A brief explanation is as follows:\n",
"\n",
Expand Down
3 changes: 1 addition & 2 deletions transformer_engine/jax/cpp_extensions/activation.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

import numpy as np
import transformer_engine_jax
from transformer_engine_jax import NVTE_Activation_Type
from transformer_engine_jax import NVTE_Activation_Type, QuantizeLayout
from .base import BasePrimitive, register_primitive
from .misc import (
jax_dtype_to_te_dtype,
Expand All @@ -32,7 +32,6 @@
from ..quantize import ScaledTensor, ScaledTensorFactory, NoScaleTensor
from ..quantize import (
Quantizer,
QuantizeLayout,
DelayedScaleQuantizer,
ScalingMode,
)
Expand Down
2 changes: 1 addition & 1 deletion transformer_engine/jax/cpp_extensions/gemm.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
get_device_compute_capability,
initialize_cgemm_communicator,
get_cgemm_num_max_streams,
QuantizeLayout,
)

from .base import BasePrimitive, register_primitive
Expand All @@ -40,7 +41,6 @@
GroupedQuantizer,
get_quantize_config,
QuantizerSet,
QuantizeLayout,
noop_quantizer_set,
is_fp8_gemm_with_all_layouts_supported,
apply_padding_to_scale_inv,
Expand Down
3 changes: 2 additions & 1 deletion transformer_engine/jax/cpp_extensions/misc.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
from jax.interpreters.mlir import dtype_to_ir_type

import transformer_engine_jax
from transformer_engine_jax import QuantizeLayout

from ..sharding import get_padded_spec as te_get_padded_spec
from ..quantize import ScaledTensorFactory, QuantizeLayout
from ..quantize import ScaledTensorFactory

TEDType = transformer_engine_jax.DType

Expand Down
Loading