Litgpt benchmark by jjsjann123 · Pull Request #4320 · NVIDIA/Fuser

jjsjann123 · 2025-04-25T22:46:42Z

Fixes #4253

github-actions · 2025-04-25T22:47:27Z

Review updated until commit 183b4f9

Description

Added LitGPT benchmark configurations
Implemented LitGPT model setup in rope_ops.py
Updated conftest.py with new resize marker
Fixed import in cross_entropy_loss.py

Changes walkthrough 📝

Relevant files

Enhancement

conftest.py `Add resize marker` benchmarks/python/conftest.py Added `resize` marker to pytest configuration	+4/-0
model_configs.py `Add LitGPT configuration` benchmarks/python/model_configs.py Added `litgpt_cfg` function to load LitGPT configurations Moved `AutoConfig` import inside functions	+16/-2
rope_ops.py `Add LitGPT setup in rope_ops` benchmarks/python/rope_ops.py Implemented `Litgpt` function for LitGPT model setup Added LitGPT configurations to `rope_setup`	+128/-0
test_rope.py `Update test_rope with LitGPT` benchmarks/python/test_rope.py Added LitGPT variations to test parameters Marked LitGPT tests with `resize` marker	+10/-0

Bug fix

cross_entropy_loss.py `Fix import in cross_entropy_loss` benchmarks/python/cross_entropy_loss.py Corrected import from `transformers.models.mistral` instead of `phi3`	+1/-1

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Import Error

The import statement for MistralPreTrainedModel is incorrect. It should be imported from the correct module.

from transformers.models.mistral import MistralPreTrainedModel

Code Complexity

The Litgpt function is quite long and complex. Consider breaking it down into smaller, more manageable functions.

def Litgpt(seq_length, model_name):
    class LitgptRope(torch.nn.Module):
        def __init__(self, config) -> None:
            from litgpt.model import apply_rope

            self.fused_apply_rotary_pos_emb_cached = None

            super().__init__()
            self.config = config
            self.apply_rope = apply_rope

        def forward(
            self,
            qkv: torch.Tensor,
            cos: torch.Tensor,
            sin: torch.Tensor,
        ) -> torch.Tensor:
            B, T, _ = qkv.shape  # batch size, sequence length

            # assemble into a number of query groups to support MHA, MQA and GQA together (see `config.n_query_groups`)
            q_per_kv = self.config.n_head // self.config.n_query_groups
            total_qkv = q_per_kv + 2  # each group has 1+ queries, 1 key, and 1 value
            qkv = qkv.view(
                B, T, self.config.n_query_groups, total_qkv, self.config.head_size
            )
            qkv = qkv.permute(0, 2, 3, 1, 4)  # (B, n_query_groups, total_qkv, T, hs)

            # split batched computation into three
            q, k, v = qkv.split((q_per_kv, 1, 1), dim=2)

            # maybe repeat k and v if for the non multi-head attention cases
            # training: flash attention requires it
            # inference: multi-query would require a full kv cache so avoid it to limit its memory usage
            if (
                self.config.n_query_groups != self.config.n_head
                and self.config.n_query_groups != 1
            ):
                k = k.expand(
                    B, self.config.n_query_groups, q_per_kv, T, self.config.head_size
                )
                v = v.expand(
                    B, self.config.n_query_groups, q_per_kv, T, self.config.head_size
                )

            q = q.reshape(B, -1, T, self.config.head_size)  # (B, nh_q, T, hs)
            k = k.reshape(B, -1, T, self.config.head_size)  # (B, nh_k, T, hs)
            v = v.reshape(B, -1, T, self.config.head_size)  # (B, nh_v, T, hs)

            q_roped = self.apply_rope(q[..., : self.config.rope_n_elem], cos, sin)
            k_roped = self.apply_rope(k[..., : self.config.rope_n_elem], cos, sin)
            q = torch.cat((q_roped, q[..., self.config.rope_n_elem :]), dim=-1)
            k = torch.cat((k_roped, k[..., self.config.rope_n_elem :]), dim=-1)
            return q, k, v

    cfg = configs["litgpt"](model_name)
    # overwrite seq_length
    cfg.seq_len = seq_length

    def inputs():
        qkv = torch.randn(
            cfg.batch_size,
            cfg.seq_len,
            (cfg.n_head + 2 * cfg.n_query_groups) * cfg.head_size,
            device="cuda",
            dtype=torch.bfloat16,
            requires_grad=True,
        )
        cos = torch.randn(
            1,
            cfg.seq_len,
            cfg.rope_n_elem,
            device="cuda",
            dtype=torch.bfloat16,
            requires_grad=False,
        )
        sin = torch.randn(
            1,
            cfg.seq_len,
            cfg.rope_n_elem,
            device="cuda",
            dtype=torch.bfloat16,
            requires_grad=False,
        )
        return qkv, cos, sin

    def grads():
        grad = torch.randn(
            cfg.batch_size,
            cfg.n_head,
            cfg.seq_len,
            cfg.head_size,
            device="cuda",
            dtype=torch.bfloat16,
            requires_grad=False,
        )
        return grad

    # Manual IOBytes computes the total bandwidth for thunder backward trace.
    def iobytes():
        n_elements = 0
        # adding size of qkv.grad
        n_elements += (
            cfg.batch_size
            * cfg.seq_len
            * (cfg.n_head + 2 * cfg.n_query_groups)
            * cfg.head_size
        )
        # adding size of sin, cos (saved from forward)
        n_elements += 2 * cfg.seq_len * cfg.rope_n_elem
        # adding size of q, k, v (saved from forward)
        n_elements += 3 * cfg.batch_size * cfg.seq_len * cfg.n_head * cfg.head_size
        # totoal io sizes
        return n_elements * torch.bfloat16.itemsize

    return LitgptRope(cfg).cuda().bfloat16(), inputs, grads, iobytes

Test Coverage

Ensure that the new test cases cover a variety of scenarios and edge cases for the litgpt models.

"litgpt-gemma-2-9b",
"litgpt-mistral-7b",
"litgpt-meta-llama-3-8B",
"litgpt-phi3.5-mini",

jjsjann123 · 2025-04-25T23:04:59Z

hmmm. the number again doesn't match the original benchmark. I need to take another look at that.

The added benchmark

Name (time in us)                                                                           Mean
--------------------------------------------------------------------------------------------------------
test_rope_fwd_benchmark[seq_length=4096-executor='torchcompile'-variation='litgpt']      90.1059 (1.0)
test_rope_fwd_benchmark[seq_length=4096-executor='thunder'-variation='litgpt']          112.1219 (1.24)
test_rope_bwd_benchmark[seq_length=4096-executor='torchcompile'-variation='litgpt']     140.5902 (1.56)
test_rope_bwd_benchmark[seq_length=4096-executor='thunder'-variation='litgpt']          329.3374 (3.66)
--------------------------------------------------------------------------------------------------------

vs reference benchmark

                Executor                     Model     DType  Batch  Seq-Len  Fwd-Krnls  Fwd-K-Time(ms)  Bwd-Krnls  Bwd-K-Time(ms)
1          torch.compile  Meta-Llama-3-8B-Instruct  bfloat16      1     4096          3           0.090          3           0.129
3        Thunder-nvFuser  Meta-Llama-3-8B-Instruct  bfloat16      1     4096          3           0.098          6           0.289

benchmarks/python/rope_ops.py

jjsjann123 · 2025-04-28T17:07:36Z

kernel looks the same. The difference in measured time is coming from:

not clearing L2 cache
not clearing grad on inputs

Since those are coming from the reference implementation. I'm not going to update that.

jjsjann123 · 2025-04-28T17:27:55Z

!test

Priya2698 · 2025-04-28T20:52:10Z

benchmarks/python/requirements.txt

@@ -0,0 +1 @@
+litgpt[all]


Can we make this requirement local?
We have several benchmark files that can run okay without this module. So this can be a hassle when trying to run unrelated benchmarks.

One way could be to use @pytest.mark.skipif to check for the presence of this module in relevant benchmarks.

Sounds fair. I guess I should have stick to what the other benchmarks are doing and rely on the module installed in the container. At least we do have litgpt in our CI containers. I'll remove this file.

benchmarks/python/rope_ops.py

naoyam · 2025-04-28T21:25:15Z

Can you add a marker? #4290

benchmarks/python/conftest.py

naoyam · 2025-04-29T00:20:52Z

I don't have any comment anymore. Thanks @jjsjann123 for adding these benchmarks. I'll let @Priya2698 to give a final stamp.

Priya2698

Please remove requirements.txt or make it a local check. LGTM otherwise.

jjsjann123 · 2025-04-29T18:13:54Z

!test --pybench

jjsjann123 · 2025-04-29T22:43:32Z

benchmarks/python/test_rope.py

+        "litgpt-gemma-2-9b",
+        "litgpt-mistral-7b",
+        "litgpt-meta-llama-3-8B",
+        "litgpt-phi3.5-mini",


@xwang233 do we need to manually add new entries in dashboard?

~~it might not show up in the PR benchmark results~~ perhaps it will work automatically (let's see), but it will show up in nightly benchmark results once merged

jjsjann123 · 2025-04-30T17:48:34Z

hmmm... seeing an import error.

00:19:01 FAILED benchmarks/python/test_cross_entropy_loss.py::test_cross_entropy_fwd_benchmark[executor='thunder-torchcompile'-variation='hf_mistral_nemo'] - ImportError: cannot import name 'MistralPreTrainedModel' from 'transformers.models.phi3' (/usr/local/lib/python3.12/dist-packages/transformers/models/phi3/__init__.py)

Let me investigate.

jjsjann123 · 2025-04-30T18:08:06Z

errr. that's coming from the cross_entropy benchmark. I"ll just patch that.
It's a bit unfortunate that given the numerical mismatch in python benchmark, real errors are buried in false negative signals.

cc'ing @protonu

jjsjann123 · 2025-04-30T18:08:12Z

!build

jjsjann123 · 2025-04-30T18:09:37Z

benchmarks/python/cross_entropy_loss.py

        super().__init__("hf_mistral_nemo", dtype)

    def model(self):
-        from transformers.models.phi3 import MistralPreTrainedModel


@protonu Just so that you are aware.

jjsjann123 added 2 commits April 25, 2025 14:53

adding litgpt benchmark

41ac781

fixing script

1e5812b

jjsjann123 commented Apr 25, 2025

View reviewed changes

benchmarks/python/rope_ops.py Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into litgpt_benchmark

ec729b2

adding iobytes computation

04d4144

jjsjann123 requested review from Priya2698 and naoyam April 28, 2025 17:27

jjsjann123 marked this pull request as ready for review April 28, 2025 17:27

Priya2698 reviewed Apr 28, 2025

View reviewed changes

jjsjann123 added 5 commits April 28, 2025 15:52

fixing iobytes

228cf5c

populate the benchmark with the rest of litgpt model configs

e6447a2

configs

f987def

typo

7936055

adding resize marker

51cd064

jjsjann123 requested a review from Priya2698 April 29, 2025 00:16

jjsjann123 commented Apr 29, 2025

View reviewed changes

benchmarks/python/conftest.py Show resolved Hide resolved

Priya2698 approved these changes Apr 29, 2025

View reviewed changes

jjsjann123 added 2 commits April 29, 2025 11:11

removing requirements

9134cc3

Merge remote-tracking branch 'origin/main' into litgpt_benchmark

ab2b495

jjsjann123 commented Apr 29, 2025

View reviewed changes

jjsjann123 added 2 commits April 30, 2025 11:07

Merge remote-tracking branch 'origin/main' into litgpt_benchmark

592d011

fixing import in cross_entropy_loss

183b4f9

jjsjann123 commented Apr 30, 2025

View reviewed changes

jjsjann123 merged commit 8c12206 into main Apr 30, 2025
16 checks passed

jjsjann123 deleted the litgpt_benchmark branch April 30, 2025 20:43

		@@ -0,0 +1 @@
		litgpt[all]

Conversation

jjsjann123 commented Apr 25, 2025

Uh oh!

github-actions bot commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Uh oh!

jjsjann123 commented Apr 25, 2025

Uh oh!

Uh oh!

jjsjann123 commented Apr 28, 2025

Uh oh!

jjsjann123 commented Apr 28, 2025

Uh oh!

Priya2698 Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Apr 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naoyam commented Apr 28, 2025

Uh oh!

Uh oh!

naoyam commented Apr 29, 2025

Uh oh!

Priya2698 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Apr 29, 2025

Uh oh!

jjsjann123 Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

xwang233 Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jjsjann123 commented Apr 30, 2025

Uh oh!

jjsjann123 commented Apr 30, 2025

Uh oh!

jjsjann123 commented Apr 30, 2025

Uh oh!

jjsjann123 Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Apr 25, 2025 •

edited

Loading

Priya2698 left a comment •

edited

Loading

xwang233 Apr 29, 2025 •

edited

Loading