Add an int64 path for mlp kernels by mmathew23 · Pull Request #3614 · unslothai/unsloth

mmathew23 · 2025-11-18T22:22:20Z

The llama mlp kernels produce nans with extremely long context length. This is happens when the num_elements is greater than 2**31. In these cases it's best to calculate offsets with tl.int64 instead of int32. This PR will route to int64 kernels if the num_elements is big enough.

danielhanchen · 2025-11-19T13:00:19Z

    device = gate.device
    out = torch.empty((batch, seq_len, hd), dtype = gate.dtype, device = device)
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
+    if n_elements <= (2**31) - 1024:


Why -1024? Is it maybe hd?

yes I forgot to account for hd. The idea is that I wanted to add a buffer just to be safe.

wait actually it is 1024, ie the BLOCK_SIZE.

danielhanchen · 2025-11-19T13:00:51Z

    batch_seq_len, hd = e.shape
    n_elements = e.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),)
+    if n_elements <= (2**31) - 1024:


Maybe move (2**31) to a global var

danielhanchen · 2025-11-19T13:46:16Z

+    e,
+    g,
+    n_elements,
+    BLOCK_SIZE: tl.constexpr,


there is actually a way to use 1 kernel only and dispatch, but for now this is fine - we can refactor later

mmathew23 · 2025-11-19T22:35:35Z

Why -1024? Is it maybe hd?

So the idea is that offsets cannot be more than 2**31-1 which means n_elements<=2**31. I want to add a buffer before this point and since we are processing in BLOCK_SIZE blocks instead of hidden_dim blocks I figured it would be better. Plus we get the added benefit of the behavior remaining consistent across models.

I've updated the PR to reflect your comments and finalized it. Let me know if there's anything else to address.

* Add an int64 path for mlp kernels * move constant expressions to globals * fix name

danielhanchen reviewed Nov 19, 2025

View reviewed changes

mmathew23 force-pushed the tiled/contextlen branch 2 times, most recently from c008eca to 262ada3 Compare November 19, 2025 17:24

Add an int64 path for mlp kernels

833d91f

mmathew23 force-pushed the tiled/contextlen branch from 262ada3 to 833d91f Compare November 19, 2025 19:22

mmathew23 marked this pull request as ready for review November 19, 2025 22:16

mmathew23 added 2 commits November 19, 2025 20:12

move constant expressions to globals

c428266

fix name

265ef5e

danielhanchen merged commit ac82560 into unslothai:main Nov 20, 2025
1 check passed

abiswas-realadvice pushed a commit to abiswas-realadvice/unsloth that referenced this pull request May 14, 2026

Add an int64 path for mlp kernels (unslothai#3614)

e8e0f6a

* Add an int64 path for mlp kernels * move constant expressions to globals * fix name

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add an int64 path for mlp kernels#3614

Add an int64 path for mlp kernels#3614
danielhanchen merged 3 commits into
unslothai:mainfrom
mmathew23:tiled/contextlen

mmathew23 commented Nov 18, 2025

Uh oh!

danielhanchen Nov 19, 2025

Uh oh!

mmathew23 Nov 19, 2025

Uh oh!

mmathew23 Nov 19, 2025

Uh oh!

danielhanchen Nov 19, 2025

Uh oh!

danielhanchen Nov 19, 2025

Uh oh!

mmathew23 commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mmathew23 commented Nov 18, 2025

Uh oh!

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mmathew23 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mmathew23 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

danielhanchen Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

mmathew23 commented Nov 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants