ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

wprazuch · 2024-07-19T12:35:36Z

🐛 Bug

With newer transformer-engine version in our container, fp8 TE benchmarking for benchmark_litgpt.py script is not possible.

Version where the script works:

transformer_engine @ file:///dist/transformer_engine-1.9.0.dev0%2B7326af9-cp310-cp310-linux_x86_64.whl#sha256=74ed8c251b8304bc9dbc77f2746b0df8652204001edaf57706c720acfa944ced

Version where the script does not work:

transformer_engine @ file:///dist/transformer_engine-1.10.0.dev0%2Bc57a81f-cp310-cp310-linux_x86_64.whl#sha256=b53e4719a09a6726668dd658b74705fa3d8e62a10b87dfc6c5132be96b72dac1

With this in mind, I would like to suggest some workarounds, or even changes in the benchmark_litgpt.py script. We could get rid of the from lightning.fabric.plugins.precision.transformer_engine import TransformerEnginePrecision dependency entirely, and instead use a simple function (we used it in our internal benchmarks and it did just fine):

def swap_linear_layers_for_te(model: nn.Module) -> None:
    def _resursively_swap_linear_layers_for_te(module: nn.Module) -> None:
        for n, m in module.named_children():
            if len(list(m.children())) > 0:
                _resursively_swap_linear_layers_for_te(m)

            if isinstance(m, nn.Linear):
                bias_flag = m.bias is not None
                new_linear = te.Linear(
                    m.in_features, m.out_features, bias=bias_flag
                )
                setattr(module, n, new_linear)

    initial_params_cnt = parameters_cnt(model)
    _resursively_swap_linear_layers_for_te(model)
    assert initial_params_cnt == parameters_cnt(model)
    for m in model.modules():
        assert not isinstance(m, nn.Linear)
    logging.info(f"--> Now model has linear layers from transformer_engine!")

We could have similar function for nn.LayerNorm, or even add it to the above snippet as a flag to swap LayerNorm as well. or even swap both layers by default.

And while we are in the topic of layer swapping, I think it would make sense to re-visit the scope of swapping layers, which we discussed with @tfogal. Actually, when swapping only Linear without LayerNorm we receive much more successful benchmarks altogether as we get less OOM errors. I would say it is worth considering creating something like a fallback mechanism (cc: @tfogal) that works:

Try to swap both Linear and LayerNorm
If we get OOM, then try swapping Linear only and see if works.

I understand this might not be ideal, but at this point fp8 functionality is nothing like a switch. It is still a matter of different configurations and approaches to run fp8. I am happy to learn what others think about this.

If the decision is to stay with the current wrapper class TransformerEnginePrecision, please let me know, and I will try to fix it and release a PR in the respective pytorch-lightning repository.

To Reproduce

Steps to reproduce the behavior:

Run the command:

python thunder/benchmarks/benchmark_litgpt.py \
    --model_name phi-2 \
    --compile eager \
    --low_precision_mode fp8-delayed-te

Observe the error:

Time to instantiate model: 0.13 seconds.
Traceback (most recent call last):
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
    CLI(benchmark_main)
  File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, init)
  File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 582, in benchmark_main
    benchmark = Benchmark_litGPT(**kwargs)
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 240, in __init__
    te_precision = TransformerEnginePrecision(weights_dtype=torch.bfloat16, replace_layers=True)
  File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/plugins/precision/transformer_engine.py", line 75, in __init__
    raise ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE))
ModuleNotFoundError: Requirement 'transformer_engine>=0.11.0' not met. HINT: Try running `pip install -U 'transformer_engine>=0.11.0'`

Code sample

As in:
thunder/benchmarks/benchmark_litgpt.py

Expected behavior

The code should run because the requirement is met.

Environment

As in the 20240719 container

Additional context

None

The text was updated successfully, but these errors were encountered:

kshitij12345 · 2024-07-19T13:52:38Z

Given that TE is installed, I think it is a bug in the lightning.fabric plugin as to how it is checking the installation of TE (using RequirementCache).

https://github.com/Lightning-AI/pytorch-lightning/blob/e6c26d2d22fc4678b2cf47d57697ebae68b09529/src/lightning/fabric/plugins/precision/transformer_engine.py#L35

https://github.com/Lightning-AI/pytorch-lightning/blob/e6c26d2d22fc4678b2cf47d57697ebae68b09529/src/lightning/fabric/plugins/precision/transformer_engine.py#L74-L75

tfogal · 2024-07-19T14:42:05Z

@t-vi @lantiga can you get the appropriate pair[s] of eyes to take a look (based on Kshiteej's comment that this is in lightning fabric)?

awaelchli · 2024-07-22T15:37:26Z

@wprazuch Can you show the command you used to install your version of TE so I can replicate?

awaelchli · 2024-07-23T09:28:30Z

I sent a fix here Lightning-AI/utilities#292 such that the utility can parse dev/pre-release versions like the one from transformer engine (e.g., 1.10.0.dev0+931b44f).

How I installed TE to verify this:

git clone https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
git submodule update --init --recursive
pip install .

Then I checked this works:

from lightning_utilities.core.imports import RequirementCache

assert RequirementCache("transformer-engine>=0.11.0")

The installed version of TE was: 1.10.0.dev0+931b44f

awaelchli · 2024-07-23T09:30:09Z

This should solve the requirement check. Whether the TE plugin in Lightning is compatible with recent versions of TE remains to be seen. It hasn't been touched/used in some time.

tfogal · 2024-07-26T21:00:35Z

Is this resolved? @wprazuch

wprazuch · 2024-07-29T14:22:57Z

@tfogal resolved 👍

tfogal added the mixology Issues that the mixology team has surfaced label Jul 19, 2024

awaelchli mentioned this issue Jul 23, 2024

Fix parsing pre-release package versions Lightning-AI/utilities#292

Merged

wprazuch closed this as completed Jul 29, 2024

This was referenced Aug 23, 2024

replace fabric TransformerEnginePrecision with custom swapping function #1035

Closed

Replace TransformerEnginePrecision with a custom function #1037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

wprazuch commented Jul 19, 2024

kshitij12345 commented Jul 19, 2024

tfogal commented Jul 19, 2024

awaelchli commented Jul 22, 2024

awaelchli commented Jul 23, 2024 •

edited

Loading

awaelchli commented Jul 23, 2024 •

edited

Loading

tfogal commented Jul 26, 2024

wprazuch commented Jul 29, 2024

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

Comments

wprazuch commented Jul 19, 2024

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

Additional context

kshitij12345 commented Jul 19, 2024

tfogal commented Jul 19, 2024

awaelchli commented Jul 22, 2024

awaelchli commented Jul 23, 2024 • edited Loading

awaelchli commented Jul 23, 2024 • edited Loading

tfogal commented Jul 26, 2024

wprazuch commented Jul 29, 2024

awaelchli commented Jul 23, 2024 •

edited

Loading

awaelchli commented Jul 23, 2024 •

edited

Loading