Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

Closed
wprazuch opened this issue Jul 19, 2024 · 7 comments
Closed

ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE)) #809

wprazuch opened this issue Jul 19, 2024 · 7 comments
Labels
mixology Issues that the mixology team has surfaced

Comments

@wprazuch
Copy link
Contributor

🐛 Bug

With newer transformer-engine version in our container, fp8 TE benchmarking for benchmark_litgpt.py script is not possible.

Version where the script works:

transformer_engine @ file:///dist/transformer_engine-1.9.0.dev0%2B7326af9-cp310-cp310-linux_x86_64.whl#sha256=74ed8c251b8304bc9dbc77f2746b0df8652204001edaf57706c720acfa944ced

Version where the script does not work:

transformer_engine @ file:///dist/transformer_engine-1.10.0.dev0%2Bc57a81f-cp310-cp310-linux_x86_64.whl#sha256=b53e4719a09a6726668dd658b74705fa3d8e62a10b87dfc6c5132be96b72dac1

With this in mind, I would like to suggest some workarounds, or even changes in the benchmark_litgpt.py script. We could get rid of the from lightning.fabric.plugins.precision.transformer_engine import TransformerEnginePrecision dependency entirely, and instead use a simple function (we used it in our internal benchmarks and it did just fine):

def swap_linear_layers_for_te(model: nn.Module) -> None:
    def _resursively_swap_linear_layers_for_te(module: nn.Module) -> None:
        for n, m in module.named_children():
            if len(list(m.children())) > 0:
                _resursively_swap_linear_layers_for_te(m)

            if isinstance(m, nn.Linear):
                bias_flag = m.bias is not None
                new_linear = te.Linear(
                    m.in_features, m.out_features, bias=bias_flag
                )
                setattr(module, n, new_linear)

    initial_params_cnt = parameters_cnt(model)
    _resursively_swap_linear_layers_for_te(model)
    assert initial_params_cnt == parameters_cnt(model)
    for m in model.modules():
        assert not isinstance(m, nn.Linear)
    logging.info(f"--> Now model has linear layers from transformer_engine!")

We could have similar function for nn.LayerNorm, or even add it to the above snippet as a flag to swap LayerNorm as well. or even swap both layers by default.

And while we are in the topic of layer swapping, I think it would make sense to re-visit the scope of swapping layers, which we discussed with @tfogal. Actually, when swapping only Linear without LayerNorm we receive much more successful benchmarks altogether as we get less OOM errors. I would say it is worth considering creating something like a fallback mechanism (cc: @tfogal) that works:

  1. Try to swap both Linear and LayerNorm
  2. If we get OOM, then try swapping Linear only and see if works.

I understand this might not be ideal, but at this point fp8 functionality is nothing like a switch. It is still a matter of different configurations and approaches to run fp8. I am happy to learn what others think about this.

If the decision is to stay with the current wrapper class TransformerEnginePrecision, please let me know, and I will try to fix it and release a PR in the respective pytorch-lightning repository.

To Reproduce

Steps to reproduce the behavior:

  1. Run the command:
python thunder/benchmarks/benchmark_litgpt.py \
    --model_name phi-2 \
    --compile eager \
    --low_precision_mode fp8-delayed-te
  1. Observe the error:
Time to instantiate model: 0.13 seconds.
Traceback (most recent call last):
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
    CLI(benchmark_main)
  File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, init)
  File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
    return component(**cfg)
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 582, in benchmark_main
    benchmark = Benchmark_litGPT(**kwargs)
  File "/mnt/nvdl/usr/wprazuch/Projects/lightning_ai/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 240, in __init__
    te_precision = TransformerEnginePrecision(weights_dtype=torch.bfloat16, replace_layers=True)
  File "/usr/local/lib/python3.10/dist-packages/lightning/fabric/plugins/precision/transformer_engine.py", line 75, in __init__
    raise ModuleNotFoundError(str(_TRANSFORMER_ENGINE_AVAILABLE))
ModuleNotFoundError: Requirement 'transformer_engine>=0.11.0' not met. HINT: Try running `pip install -U 'transformer_engine>=0.11.0'`

Code sample

As in:
thunder/benchmarks/benchmark_litgpt.py

Expected behavior

The code should run because the requirement is met.

Environment

As in the 20240719 container

Additional context

None

@tfogal
Copy link
Collaborator

tfogal commented Jul 19, 2024

@t-vi @lantiga can you get the appropriate pair[s] of eyes to take a look (based on Kshiteej's comment that this is in lightning fabric)?

@tfogal tfogal added the mixology Issues that the mixology team has surfaced label Jul 19, 2024
@awaelchli
Copy link
Contributor

@wprazuch Can you show the command you used to install your version of TE so I can replicate?

@awaelchli
Copy link
Contributor

awaelchli commented Jul 23, 2024

I sent a fix here Lightning-AI/utilities#292 such that the utility can parse dev/pre-release versions like the one from transformer engine (e.g., 1.10.0.dev0+931b44f).

How I installed TE to verify this:

git clone https://github.com/NVIDIA/TransformerEngine.git
cd TransformerEngine
git submodule update --init --recursive
pip install .

Then I checked this works:

from lightning_utilities.core.imports import RequirementCache

assert RequirementCache("transformer-engine>=0.11.0")

The installed version of TE was: 1.10.0.dev0+931b44f

@awaelchli
Copy link
Contributor

awaelchli commented Jul 23, 2024

This should solve the requirement check. Whether the TE plugin in Lightning is compatible with recent versions of TE remains to be seen. It hasn't been touched/used in some time.

@tfogal
Copy link
Collaborator

tfogal commented Jul 26, 2024

Is this resolved? @wprazuch

@wprazuch
Copy link
Contributor Author

@tfogal resolved 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mixology Issues that the mixology team has surfaced
Projects
None yet
Development

No branches or pull requests

4 participants