Perf tuning and expansion of cases covered for wvSplitKrc by amd-hhashemi · Pull Request #33493 · vllm-project/vllm

amd-hhashemi · 2026-02-01T01:21:59Z

mi355 measurements before and after changes:
m, n, K , bfor(us), aftr (us)
128, 16, 2880, 4.55, 4.56
640, 16, 2880, 4.80, 4.83
128, 32, 2880, 3.91, 3.21
640, 32, 2880, 4.13, 4.05
128, 64, 2880, 4.42, 3.23
640, 64, 2880, 4.88, 4.43
128, 128, 2880, 4.51, 3.98
640, 128, 2880, 5.89, 5.92

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

gemini-code-assist

Code Review

This pull request introduces performance tuning for the wvSplitKrc kernel and expands the cases it covers. The changes are mainly in csrc/rocm/skinny_gemms.cu, with corresponding updates in the dispatch logic in vllm/model_executor/layers/utils.py and test cases in tests/kernels/quantization/test_rocm_skinny_gemms.py. While the performance optimizations seem promising, I've identified a few critical issues. There's a logic mismatch between the Python dispatch code and the C++ kernel implementation that could lead to incorrect kernel dispatching. Additionally, a crucial out-of-bounds check appears to have been incorrectly removed in the kernel, which could lead to incorrect computations. I've provided detailed comments on these issues.

csrc/rocm/skinny_gemms.cu

vllm/model_executor/layers/utils.py

tests/kernels/quantization/test_rocm_skinny_gemms.py

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

AndreasKaratzas · 2026-02-03T01:41:21Z

Is this PR related to #33527?
I am asking to know if I've got to wait before testing it.

amd-hhashemi · 2026-02-03T04:51:37Z

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.

This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine, cases seen in gpt-oss).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

AndreasKaratzas · 2026-02-03T04:54:00Z

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.

This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

I'll launch a CI cycle tomorrow then with this one to see if there is any test regressing.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

When you post it, CC me if possible so we can prevent any possible regressions :) There has been a huge effort to keep AMD CI green.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi · 2026-02-04T04:36:29Z

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.
This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

I'll launch a CI cycle tomorrow then with this one to see if there is any test regressing.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

When you post it, CC me if possible so we can prevent any possible regressions :) There has been a huge effort to keep AMD CI green.

@AndreasKaratzas This is the 3rd PR. It adds padding support to the fp16/bf16 version of skinny gemm solutions.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

AndreasKaratzas · 2026-02-12T00:53:02Z

@amd-hhashemi Thanks for the kernel improvements in this PR.

We've been investigating a persistent test failure in test_serving_tokens.py::test_same_response_as_chat_completions on gfx950 (MI355X) and traced it back to this PR. The root cause is that the wvSplitKrc_ kernel now produces non-deterministic results across sequential invocations with identical inputs.

Estimated root cause:

The old code had a conditional direct-write path that was deterministic:

bool doRdc = (kfitsPerRdc * kFit < K);

When doRdc was false, results were written directly to C[] without atomics --- fully deterministic.

This PR changed it to:

bool doRdc = true;  // Assuming (kfitsPerRdc * kFit < K) is always true

and removed the entire if (!doRdc) direct-write branch. Now all results go through atomicAdd(&glbl[...], sum4[...]), and since floating-point atomicAdd is non-associative, the execution order of waves across kernel launches causes different rounding, producing different outputs for identical inputs.

Reproduction:

After reverting this PR, both tests passed:

pytest tests/entrypoints/openai/test_serving_tokens.py::test_same_response_as_chat_completions -xvs
pytest tests/v1/e2e/test_deterministic_prefix_caching.py -xvs (this test was introduced here: [Feature][Scheduler] Add split prefix caching feature to eliminate bf16 GEMM tiling divergence across cache-hit/miss paths #34046 --- this PR is still under review, so to test with this one, copy the code from the PR first)

Is it possible to have deterministic kernels, and only if there is an option passed, something like --fast-skinny-gemms activate non-determinism? It is critical that we stay deterministic.

amd-hhashemi · 2026-02-12T15:11:06Z

@AndreasKaratzas Hi, can you please try with #34410? It's a one-liner. I root-caused an issue that shows up on some vLLM dockers on N<=16 GEMMs (as seen in single prompt gptoss). It was occuring only in non-eager mode prompts for me, and I was never able to get an out of threshold test on any of the GEMM sizes.
That !doRdc path is never being taken by the GEMM sizes we limit this solution too. That path was meant for K<512 sizes. But the Aiter GEMMs seem to be doing good enough on those, so i let them take it 😄.
Also the whole point of this solution is to do faster ksplit via atomic reduction. The atomic reduction is in full float32 and it should not produce run-to-run noise that shows up in tokens.
For validation purposes, I do have a deterministic path for the reduce (does traditional store-readback-and-sum). I can add it as a mode.
I'll look closer at the test you mentioned today.
But thanks for bring it up, btw. Now I'm curious if deterministic reduce (still doing fused reduction with atomic counting, just not doing the reduction itself with float atomics) actually performs better, as it avoids the atomic bottleneck.🤔 will investigate.

AndreasKaratzas · 2026-02-16T02:06:22Z

@amd-hhashemi Back at it again 😅 So there is another inaccuracy observed

test_cudagraph_divergence.py

import math
from dataclasses import dataclass

import torch

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig
from vllm.distributed import cleanup_dist_env_and_memory

MODEL = "meta-llama/Llama-3.2-1B-Instruct"
MAX_MODEL_LEN = 256
SEED = 42
GPU_MEM_UTIL = 0.4
MAX_LOGPROBS = 5
TOP_LOGPROBS = 3
MAX_TOKENS = 10
PROMPT = "Hello world " * 50


@dataclass
class RunConfig:
    name: str
    enforce_eager: bool
    compile_ranges_split_points: list[int] | None  # None = use default


def make_sampling_params():
    normal = SamplingParams(
        temperature=0,
        logprobs=TOP_LOGPROBS,
        max_tokens=MAX_TOKENS,
        ignore_eos=False,
    )
    penalty = SamplingParams(
        temperature=0,
        logprobs=TOP_LOGPROBS,
        max_tokens=MAX_TOKENS,
        ignore_eos=False,
        presence_penalty=-1.0,
    )
    return normal, penalty


def run_config(config: RunConfig):
    print(f"\n{'='*60}")
    print(f"Running: {config.name}")
    print(f"  enforce_eager={config.enforce_eager}")
    print(f"  compile_ranges_split_points={config.compile_ranges_split_points}")
    print(f"{'='*60}")

    kwargs = dict(
        model=MODEL,
        max_logprobs=MAX_LOGPROBS,
        max_model_len=MAX_MODEL_LEN,
        seed=SEED,
        gpu_memory_utilization=GPU_MEM_UTIL,
        enable_prefix_caching=False,
        enable_chunked_prefill=True,
        max_num_batched_tokens=32,
        enforce_eager=config.enforce_eager,
    )

    if config.compile_ranges_split_points is not None and not config.enforce_eager:
        kwargs["compilation_config"] = CompilationConfig(
            compile_ranges_split_points=config.compile_ranges_split_points,
        )

    llm = LLM(**kwargs)
    normal_params, penalty_params = make_sampling_params()
    results = llm.generate(
        [PROMPT, PROMPT], [normal_params, penalty_params]
    )

    del llm
    torch.cuda.empty_cache()
    cleanup_dist_env_and_memory()

    return results


def extract_logprobs(results):
    per_request = []
    for result in results:
        positions = []
        for lp_dict in result.outputs[0].logprobs:
            positions.append(lp_dict)
        per_request.append(positions)
    return per_request


def compare(name_a, lps_a, name_b, lps_b):
    labels = ["no_penalty", "with_penalty"]
    print(f"\n{'#'*70}")
    print(f"COMPARISON: {name_a}  vs  {name_b}")
    print(f"{'#'*70}")

    max_diff = 0.0
    total = 0
    fail_5 = 0
    fail_10 = 0

    for req_idx in range(len(lps_a)):
        label = labels[req_idx]
        a = lps_a[req_idx]
        b = lps_b[req_idx]

        if len(a) != len(b):
            print(f"  [{label}] LENGTH MISMATCH: {len(a)} vs {len(b)}")
            continue

        print(f"\n  [{label}] {len(a)} positions")
        print(f"  {'pos':>4} {'token':>15} {'rank':>5} "
              f"{'lp_A':>12} {'lp_B':>12} {'diff':>10} {'rel%':>8}")
        print(f"  {'-'*72}")

        for pos in range(len(a)):
            common = set(a[pos].keys()) & set(b[pos].keys())
            for tid in sorted(common):
                la = a[pos][tid]
                lb = b[pos][tid]
                diff = abs(la.logprob - lb.logprob)
                denom = max(abs(la.logprob), abs(lb.logprob), 1e-10)
                rel = (diff / denom) * 100
                max_diff = max(max_diff, diff)
                total += 1

                c5 = math.isclose(la.logprob, lb.logprob,
                                  rel_tol=5e-2, abs_tol=1e-1)
                c10 = math.isclose(la.logprob, lb.logprob,
                                   rel_tol=1e-1, abs_tol=1e-1)
                if not c5:
                    fail_5 += 1
                if not c10:
                    fail_10 += 1

                flag = ""
                if not c5:
                    flag = " <-- FAIL@5%"
                if not c10:
                    flag = " <-- FAIL@10%"

                print(f"  {pos:>4} {la.decoded_token!r:>15} "
                      f"{la.rank:>3}      "
                      f"{la.logprob:>12.6f} {lb.logprob:>12.6f} "
                      f"{diff:>10.6f} {rel:>7.2f}%{flag}")

    print(f"\n  SUMMARY: {total} comparisons, max_diff={max_diff:.6f}, "
          f"fail@5%={fail_5}, fail@10%={fail_10}")
    return max_diff, fail_5, fail_10


def main():
    configs = [
        RunConfig("eager",       enforce_eager=True,  compile_ranges_split_points=None),
        RunConfig("eager2",      enforce_eager=True,  compile_ranges_split_points=None),
        RunConfig("graph_sp32",  enforce_eager=False, compile_ranges_split_points=[32]),
        RunConfig("graph_sp64",  enforce_eager=False, compile_ranges_split_points=[64]),
    ]

    all_results = {}
    all_lps = {}
    for cfg in configs:
        all_results[cfg.name] = run_config(cfg)
        all_lps[cfg.name] = extract_logprobs(all_results[cfg.name])

    comparisons = [
        ("eager",      "eager2",     "Eager vs Eager (sanity: expect zero diff)"),
        ("eager",      "graph_sp32", "Eager vs CUDA graph split=[32]"),
        ("eager",      "graph_sp64", "Eager vs CUDA graph split=[64]"),
        ("graph_sp32", "graph_sp64", "CUDA graph split=[32] vs split=[64] (the confound)"),
    ]

    print(f"\n\n{'='*70}")
    print("ALL COMPARISONS")
    print(f"{'='*70}")

    summary = []
    for a, b, desc in comparisons:
        print(f"\n--- {desc} ---")
        md, f5, f10 = compare(a, all_lps[a], b, all_lps[b])
        summary.append((desc, md, f5, f10))

    print(f"\n\n{'='*70}")
    print("FINAL SUMMARY")
    print(f"{'='*70}")
    print(f"\n{'Description':<55} {'MaxDiff':>10} {'F@5%':>6} {'F@10%':>6}")
    print(f"{'-'*80}")
    for desc, md, f5, f10 in summary:
        print(f"{desc:<55} {md:>10.6f} {f5:>6} {f10:>6}")

if __name__ == "__main__":
    main()

If you run this with default settings, i.e., VLLM_ROCM_USE_SKINNY_GEMM=1 you get:

Running with `VLLM_ROCM_USE_SKINNY_GEMM=1`

======================================================================
ALL COMPARISONS
======================================================================

--- Eager vs Eager (sanity: expect zero diff) ---

######################################################################
COMPARISON: eager  vs  eager2
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.948747   0.014764    0.50%
     0             '!'   3         -3.338510    -3.323747   0.014764    0.44%
     0        ' Hello'   1         -1.526010    -1.573747   0.047736    3.03%
     1             ' '   2         -4.101175    -4.104499   0.003324    0.08%
     1        ' world'   1         -0.101175    -0.104499   0.003324    3.18%
     1        ' Hello'   3         -4.476175    -4.354499   0.121676    2.72%
     2             '!'   3         -2.633208    -2.632675   0.000532    0.02%
     2             ' '   2         -2.258208    -2.257675   0.000532    0.02%
     2        ' Hello'   1         -0.883208    -0.882675   0.000532    0.06%
     3             ' '   3         -4.667778    -4.668096   0.000318    0.01%
     3        ' world'   1         -0.042778    -0.043096   0.000318    0.74%
     3        ' Hello'   2         -4.417778    -4.418096   0.000318    0.01%
     4             ' '   2         -2.624230    -2.607321   0.016910    0.64%
     4          '\n\n'   3         -3.749230    -3.857321   0.108090    2.80%
     4        ' Hello'   1         -0.249230    -0.232321   0.016909    6.78%
     5             ' '   3         -5.520310    -5.520316   0.000005    0.00%
     5        ' world'   1         -0.020310    -0.020316   0.000005    0.03%
     5        ' Hello'   2         -4.895310    -4.895316   0.000005    0.00%
     6             ' '   2         -3.008644    -3.008459   0.000185    0.01%
     6          '\n\n'   3         -4.383644    -4.383459   0.000185    0.00%
     6        ' Hello'   1         -0.133644    -0.133459   0.000185    0.14%
     7             ' '   3         -5.887640    -5.764194   0.123446    2.10%
     7        ' world'   1         -0.012640    -0.014195   0.001555   10.95%
     7        ' Hello'   2         -5.637640    -5.514194   0.123446    2.19%
     8             ' '   2         -3.337700    -3.218463   0.119236    3.57%
     8          '\n\n'   3         -4.837699    -4.718463   0.119236    2.46%
     8        ' Hello'   1         -0.087700    -0.093463   0.005764    6.17%
     9             ' '   3         -6.010201    -6.010258   0.000057    0.00%
     9        ' world'   1         -0.010200    -0.010258   0.000057    0.56%
     9        ' Hello'   2         -5.885201    -5.885258   0.000057    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.934591   0.010285    0.35%
     0             '!'   3         -3.319876    -3.309591   0.010285    0.31%
     0        ' Hello'   1         -1.569876    -1.559591   0.010285    0.66%
     1             ' '   2         -4.102213    -3.987195   0.115018    2.80%
     1        ' world'   1         -0.102213    -0.112195   0.009982    8.90%
     1        ' Hello'   3         -4.477213    -4.362195   0.115018    2.57%
     2             '!'   3         -2.635979    -2.626584   0.009395    0.36%
     2             ' '   2         -2.260979    -2.251584   0.009395    0.42%
     2        ' Hello'   1         -0.885979    -0.876584   0.009395    1.06%
     3             ' '   3         -4.668216    -4.548712   0.119503    2.56%
     3        ' world'   1         -0.043216    -0.048712   0.005497   11.28%
     3        ' Hello'   2         -4.418216    -4.298712   0.119503    2.70%
     4             ' '   2         -2.509679    -2.723243   0.213564    7.84% <-- FAIL@5%
     4          '\n\n'   3         -3.759679    -3.848243   0.088564    2.30%
     4        ' Hello'   1         -0.259679    -0.223243   0.036436   14.03%
     5             ' '   3         -5.519681    -5.397882   0.121799    2.21%
     5        ' world'   1         -0.019681    -0.022882   0.003201   13.99%
     5        ' Hello'   2         -5.019681    -4.772882   0.246799    4.92%
     6             ' '   2         -3.008634    -3.120678   0.112044    3.59%
     6          '\n\n'   3         -4.383634    -4.370678   0.012956    0.30%
     6        ' Hello'   1         -0.133634    -0.120678   0.012956    9.70%
     7             ' '   3         -5.888097    -5.762966   0.125131    2.13%
     7        ' world'   1         -0.013097    -0.012966   0.000131    1.00%
     7        ' Hello'   2         -5.513097    -5.637966   0.124869    2.21%
     8             ' '   2         -3.105620    -3.338585   0.232965    6.98% <-- FAIL@5%
     8          '\n\n'   3         -4.605620    -4.713585   0.107965    2.29%
     8        ' Hello'   1         -0.105620    -0.088585   0.017035   16.13%
     9             ' '   3         -6.010182    -6.010158   0.000024    0.00%
     9        ' world'   1         -0.010182    -0.010158   0.000025    0.24%
     9        ' Hello'   2         -5.885182    -5.885158   0.000024    0.00%

  SUMMARY: 60 comparisons, max_diff=0.246799, fail@5%=2, fail@10%=0

--- Eager vs CUDA graph split=[32] ---

######################################################################
COMPARISON: eager  vs  graph_sp32
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.951316   0.012194    0.41%
     0             '!'   3         -3.338510    -3.326316   0.012194    0.37%
     0        ' Hello'   1         -1.526010    -1.513816   0.012194    0.80%
     1             ' '   2         -4.101175    -3.979824   0.121351    2.96%
     1        ' world'   1         -0.101175    -0.104824   0.003650    3.48%
     1        ' Hello'   3         -4.476175    -4.479825   0.003650    0.08%
     2             '!'   3         -2.633208    -2.617305   0.015903    0.60%
     2             ' '   2         -2.258208    -2.242305   0.015903    0.70%
     2        ' Hello'   1         -0.883208    -0.867305   0.015903    1.80%
     3             ' '   3         -4.667778    -4.667899   0.000121    0.00%
     3        ' world'   1         -0.042778    -0.042899   0.000121    0.28%
     3        ' Hello'   2         -4.417778    -4.417899   0.000121    0.00%
     4             ' '   2         -2.624230    -2.723199   0.098969    3.63%
     4          '\n\n'   3         -3.749230    -3.848199   0.098969    2.57%
     4        ' Hello'   1         -0.249230    -0.223199   0.026031   10.44%
     5             ' '   3         -5.520310    -5.520334   0.000024    0.00%
     5        ' world'   1         -0.020310    -0.020334   0.000024    0.12%
     5        ' Hello'   2         -4.895310    -4.895334   0.000024    0.00%
     6             ' '   2         -3.008644    -3.008428   0.000216    0.01%
     6          '\n\n'   3         -4.383644    -4.383428   0.000216    0.00%
     6        ' Hello'   1         -0.133644    -0.133428   0.000216    0.16%
     7             ' '   3         -5.887640    -5.762982   0.124658    2.12%
     7        ' world'   1         -0.012640    -0.012983   0.000343    2.64%
     7        ' Hello'   2         -5.637640    -5.637982   0.000342    0.01%
     8             ' '   2         -3.337700    -3.218273   0.119427    3.58%
     8          '\n\n'   3         -4.837699    -4.718273   0.119426    2.47%
     8        ' Hello'   1         -0.087700    -0.093273   0.005573    5.98%
     9             ' '   3         -6.010201    -6.010166   0.000034    0.00%
     9        ' world'   1         -0.010200    -0.010166   0.000034    0.33%
     9        ' Hello'   2         -5.885201    -5.885166   0.000034    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.954682   0.009805    0.33%
     0             '!'   3         -3.319876    -3.329682   0.009805    0.29%
     0        ' Hello'   1         -1.569876    -1.517182   0.052694    3.36%
     1             ' '   2         -4.102213    -3.979517   0.122695    2.99%
     1        ' world'   1         -0.102213    -0.104517   0.002305    2.21%
     1        ' Hello'   3         -4.477213    -4.479517   0.002305    0.05%
     2             '!'   3         -2.635979    -2.692531   0.056552    2.10%
     2             ' '   2         -2.260979    -2.317531   0.056552    2.44%
     2        ' Hello'   1         -0.885979    -0.817531   0.068448    7.73%
     3             ' '   3         -4.668216    -4.548652   0.119564    2.56%
     3        ' world'   1         -0.043216    -0.048652   0.005437   11.17%
     3        ' Hello'   2         -4.418216    -4.298652   0.119564    2.71%
     4             ' '   2         -2.509679    -2.625171   0.115492    4.40%
     4          '\n\n'   3         -3.759679    -3.750171   0.009508    0.25%
     4        ' Hello'   1         -0.259679    -0.250171   0.009508    3.66%
     5             ' '   3         -5.519681    -5.398014   0.121667    2.20%
     5        ' world'   1         -0.019681    -0.023014   0.003333   14.48%
     5        ' Hello'   2         -5.019681    -4.773014   0.246667    4.91%
     6             ' '   2         -3.008634    -3.119569   0.110935    3.56%
     6          '\n\n'   3         -4.383634    -4.494569   0.110935    2.47%
     6        ' Hello'   1         -0.133634    -0.119569   0.014065   10.53%
     7             ' '   3         -5.888097    -5.763004   0.125093    2.12%
     7        ' world'   1         -0.013097    -0.013004   0.000093    0.71%
     7        ' Hello'   2         -5.513097    -5.638004   0.124907    2.22%
     8             ' '   2         -3.105620    -3.218772   0.113152    3.52%
     8          '\n\n'   3         -4.605620    -4.718772   0.113152    2.40%
     8        ' Hello'   1         -0.105620    -0.093772   0.011848   11.22%
     9             ' '   3         -6.010182    -6.134009   0.123827    2.02%
     9        ' world'   1         -0.010182    -0.009009   0.001173   11.52%
     9        ' Hello'   2         -5.885182    -6.009009   0.123827    2.06%

  SUMMARY: 60 comparisons, max_diff=0.246667, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[64] ---

######################################################################
COMPARISON: eager  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.933463   0.030048    1.01%
     0             '!'   3         -3.338510    -3.308463   0.030048    0.90%
     0        ' Hello'   1         -1.526010    -1.558463   0.032452    2.08%
     1             ' '   2         -4.101175    -3.978042   0.123132    3.00%
     1        ' world'   1         -0.101175    -0.103042   0.001868    1.81%
     1        ' Hello'   3         -4.476175    -4.478043   0.001868    0.04%
     2             '!'   3         -2.633208    -2.692862   0.059654    2.22%
     2             ' '   2         -2.258208    -2.317862   0.059654    2.57%
     2        ' Hello'   1         -0.883208    -0.817862   0.065346    7.40%
     3             ' '   3         -4.667778    -4.667801   0.000023    0.00%
     3        ' world'   1         -0.042778    -0.042801   0.000023    0.05%
     3        ' Hello'   2         -4.417778    -4.417801   0.000023    0.00%
     4             ' '   2         -2.624230    -2.608324   0.015907    0.61%
     4          '\n\n'   3         -3.749230    -3.858324   0.109093    2.83%
     4        ' Hello'   1         -0.249230    -0.233324   0.015906    6.38%
     5             ' '   3         -5.520310    -5.520282   0.000029    0.00%
     5        ' world'   1         -0.020310    -0.020282   0.000029    0.14%
     5        ' Hello'   2         -4.895310    -4.895282   0.000029    0.00%
     6             ' '   2         -3.008644    -3.120791   0.112147    3.59%
     6          '\n\n'   3         -4.383644    -4.370791   0.012853    0.29%
     6        ' Hello'   1         -0.133644    -0.120791   0.012853    9.62%
     7             ' '   3         -5.887640    -5.762964   0.124676    2.12%
     7        ' world'   1         -0.012640    -0.012964   0.000325    2.50%
     7        ' Hello'   2         -5.637640    -5.637964   0.000324    0.01%
     8             ' '   2         -3.337700    -3.218692   0.119007    3.57%
     8          '\n\n'   3         -4.837699    -4.718692   0.119007    2.46%
     8        ' Hello'   1         -0.087700    -0.093692   0.005993    6.40%
     9             ' '   3         -6.010201    -6.010149   0.000051    0.00%
     9        ' world'   1         -0.010200    -0.010149   0.000051    0.50%
     9        ' Hello'   2         -5.885201    -5.885149   0.000051    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.955063   0.010186    0.34%
     0             '!'   3         -3.319876    -3.330063   0.010186    0.31%
     0        ' Hello'   1         -1.569876    -1.517563   0.052314    3.33%
     1             ' '   2         -4.102213    -4.101257   0.000956    0.02%
     1        ' world'   1         -0.102213    -0.101257   0.000955    0.93%
     1        ' Hello'   3         -4.477213    -4.476257   0.000956    0.02%
     2             '!'   3         -2.635979    -2.620937   0.015042    0.57%
     2             ' '   2         -2.260979    -2.245937   0.015042    0.67%
     2        ' Hello'   1         -0.885979    -0.870937   0.015042    1.70%
     3             ' '   3         -4.668216    -4.667939   0.000277    0.01%
     3        ' world'   1         -0.043216    -0.042939   0.000277    0.64%
     3        ' Hello'   2         -4.418216    -4.417939   0.000277    0.01%
     4             ' '   2         -2.509679    -2.608213   0.098534    3.78%
     4          '\n\n'   3         -3.759679    -3.858213   0.098534    2.55%
     4        ' Hello'   1         -0.259679    -0.233213   0.026466   10.19%
     5             ' '   3         -5.519681    -5.520357   0.000676    0.01%
     5        ' world'   1         -0.019681    -0.020357   0.000676    3.32%
     5        ' Hello'   2         -5.019681    -4.895357   0.124324    2.48%
     6             ' '   2         -3.008634    -3.119641   0.111007    3.56%
     6          '\n\n'   3         -4.383634    -4.494641   0.111007    2.47%
     6        ' Hello'   1         -0.133634    -0.119641   0.013993   10.47%
     7             ' '   3         -5.888097    -5.762983   0.125113    2.12%
     7        ' world'   1         -0.013097    -0.012983   0.000114    0.87%
     7        ' Hello'   2         -5.513097    -5.637983   0.124887    2.22%
     8             ' '   2         -3.105620    -3.217211   0.111590    3.47%
     8          '\n\n'   3         -4.605620    -4.842211   0.236590    4.89%
     8        ' Hello'   1         -0.105620    -0.092211   0.013410   12.70%
     9             ' '   3         -6.010182    -6.010227   0.000045    0.00%
     9        ' world'   1         -0.010182    -0.010227   0.000045    0.44%
     9        ' Hello'   2         -5.885182    -5.885227   0.000045    0.00%

  SUMMARY: 60 comparisons, max_diff=0.236590, fail@5%=0, fail@10%=0

--- CUDA graph split=[32] vs split=[64] (the confound) ---

######################################################################
COMPARISON: graph_sp32  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.951316    -2.933463   0.017853    0.60%
     0             '!'   3         -3.326316    -3.308463   0.017853    0.54%
     0        ' Hello'   1         -1.513816    -1.558463   0.044647    2.86%
     1             ' '   2         -3.979824    -3.978042   0.001782    0.04%
     1        ' world'   1         -0.104824    -0.103042   0.001782    1.70%
     1        ' Hello'   3         -4.479825    -4.478043   0.001782    0.04%
     2             '!'   3         -2.617305    -2.692862   0.075557    2.81%
     2             ' '   2         -2.242305    -2.317862   0.075557    3.26%
     2        ' Hello'   1         -0.867305    -0.817862   0.049443    5.70%
     3             ' '   3         -4.667899    -4.667801   0.000098    0.00%
     3        ' world'   1         -0.042899    -0.042801   0.000098    0.23%
     3        ' Hello'   2         -4.417899    -4.417801   0.000098    0.00%
     4             ' '   2         -2.723199    -2.608324   0.114875    4.22%
     4          '\n\n'   3         -3.848199    -3.858324   0.010125    0.26%
     4        ' Hello'   1         -0.223199    -0.233324   0.010125    4.34%
     5             ' '   3         -5.520334    -5.520282   0.000052    0.00%
     5        ' world'   1         -0.020334    -0.020282   0.000052    0.26%
     5        ' Hello'   2         -4.895334    -4.895282   0.000052    0.00%
     6             ' '   2         -3.008428    -3.120791   0.112364    3.60%
     6          '\n\n'   3         -4.383428    -4.370791   0.012637    0.29%
     6        ' Hello'   1         -0.133428    -0.120791   0.012636    9.47%
     7             ' '   3         -5.762982    -5.762964   0.000018    0.00%
     7        ' world'   1         -0.012983    -0.012964   0.000018    0.14%
     7        ' Hello'   2         -5.637982    -5.637964   0.000018    0.00%
     8             ' '   2         -3.218273    -3.218692   0.000419    0.01%
     8          '\n\n'   3         -4.718273    -4.718692   0.000419    0.01%
     8        ' Hello'   1         -0.093273    -0.093692   0.000419    0.45%
     9             ' '   3         -6.010166    -6.010149   0.000017    0.00%
     9        ' world'   1         -0.010166    -0.010149   0.000017    0.17%
     9        ' Hello'   2         -5.885166    -5.885149   0.000017    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.954682    -2.955063   0.000381    0.01%
     0             '!'   3         -3.329682    -3.330063   0.000381    0.01%
     0        ' Hello'   1         -1.517182    -1.517563   0.000381    0.03%
     1             ' '   2         -3.979517    -4.101257   0.121740    2.97%
     1        ' world'   1         -0.104517    -0.101257   0.003260    3.12%
     1        ' Hello'   3         -4.479517    -4.476257   0.003260    0.07%
     2             '!'   3         -2.692531    -2.620937   0.071594    2.66%
     2             ' '   2         -2.317531    -2.245937   0.071594    3.09%
     2        ' Hello'   1         -0.817531    -0.870937   0.053406    6.13%
     3             ' '   3         -4.548652    -4.667939   0.119287    2.56%
     3        ' world'   1         -0.048652    -0.042939   0.005713   11.74%
     3        ' Hello'   2         -4.298652    -4.417939   0.119287    2.70%
     4             ' '   2         -2.625171    -2.608213   0.016958    0.65%
     4          '\n\n'   3         -3.750171    -3.858213   0.108042    2.80%
     4        ' Hello'   1         -0.250171    -0.233213   0.016958    6.78%
     5             ' '   3         -5.398014    -5.520357   0.122343    2.22%
     5        ' world'   1         -0.023014    -0.020357   0.002657   11.54%
     5        ' Hello'   2         -4.773014    -4.895357   0.122343    2.50%
     6             ' '   2         -3.119569    -3.119641   0.000072    0.00%
     6          '\n\n'   3         -4.494569    -4.494641   0.000072    0.00%
     6        ' Hello'   1         -0.119569    -0.119641   0.000072    0.06%
     7             ' '   3         -5.763004    -5.762983   0.000021    0.00%
     7        ' world'   1         -0.013004    -0.012983   0.000021    0.16%
     7        ' Hello'   2         -5.638004    -5.637983   0.000021    0.00%
     8             ' '   2         -3.218772    -3.217211   0.001562    0.05%
     8          '\n\n'   3         -4.718772    -4.842211   0.123439    2.55%
     8        ' Hello'   1         -0.093772    -0.092211   0.001561    1.67%
     9             ' '   3         -6.134009    -6.010227   0.123782    2.02%
     9        ' world'   1         -0.009009    -0.010227   0.001218   11.91%
     9        ' Hello'   2         -6.009009    -5.885227   0.123782    2.06%

  SUMMARY: 60 comparisons, max_diff=0.123782, fail@5%=0, fail@10%=0


======================================================================
FINAL SUMMARY
======================================================================

Description                                                MaxDiff   F@5%  F@10%
--------------------------------------------------------------------------------
Eager vs Eager (sanity: expect zero diff)                 0.246799      2      0
Eager vs CUDA graph split=[32]                            0.246667      0      0
Eager vs CUDA graph split=[64]                            0.236590      0      0
CUDA graph split=[32] vs split=[64] (the confound)        0.123782      0      0

But if you run it with VLLM_ROCM_USE_SKINNY_GEMM=0, you get:

Running with `VLLM_ROCM_USE_SKINNY_GEMM=0`

======================================================================
ALL COMPARISONS
======================================================================

--- Eager vs Eager (sanity: expect zero diff) ---

######################################################################
COMPARISON: eager  vs  eager2
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.908362   0.000000    0.00%
     0             '!'   3         -3.283362    -3.283362   0.000000    0.00%
     0        ' Hello'   1         -1.533363    -1.533363   0.000000    0.00%
     1             ' '   2         -4.092611    -4.092611   0.000000    0.00%
     1        ' world'   1         -0.092611    -0.092611   0.000000    0.00%
     1        ' Hello'   3         -4.592611    -4.592611   0.000000    0.00%
     2             '!'   3         -2.636380    -2.636380   0.000000    0.00%
     2             ' '   2         -2.261380    -2.261380   0.000000    0.00%
     2        ' Hello'   1         -0.886380    -0.886380   0.000000    0.00%
     3             ' '   3         -4.668087    -4.668087   0.000000    0.00%
     3        ' world'   1         -0.043087    -0.043087   0.000000    0.00%
     3        ' Hello'   2         -4.418087    -4.418087   0.000000    0.00%
     4             ' '   2         -2.607630    -2.607630   0.000000    0.00%
     4          '\n\n'   3         -3.857630    -3.857630   0.000000    0.00%
     4        ' Hello'   1         -0.232630    -0.232630   0.000000    0.00%
     5             ' '   3         -5.520524    -5.520524   0.000000    0.00%
     5        ' world'   1         -0.020524    -0.020524   0.000000    0.00%
     5        ' Hello'   2         -4.895524    -4.895524   0.000000    0.00%
     6             ' '   2         -3.121110    -3.121110   0.000000    0.00%
     6          '\n\n'   3         -4.371110    -4.371110   0.000000    0.00%
     6        ' Hello'   1         -0.121110    -0.121110   0.000000    0.00%
     7             ' '   3         -5.887629    -5.887629   0.000000    0.00%
     7        ' world'   1         -0.012629    -0.012629   0.000000    0.00%
     7        ' Hello'   2         -5.637629    -5.637629   0.000000    0.00%
     8             ' '   2         -3.218803    -3.218803   0.000000    0.00%
     8          '\n\n'   3         -4.718802    -4.718802   0.000000    0.00%
     8        ' Hello'   1         -0.093803    -0.093803   0.000000    0.00%
     9             ' '   3         -6.010202    -6.010202   0.000000    0.00%
     9        ' world'   1         -0.010202    -0.010202   0.000000    0.00%
     9        ' Hello'   2         -5.885202    -5.885202   0.000000    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.888613   0.000000    0.00%
     0             '!'   3         -3.326113    -3.326113   0.000000    0.00%
     0        ' Hello'   1         -1.576113    -1.576113   0.000000    0.00%
     1             ' '   2         -3.978908    -3.978908   0.000000    0.00%
     1        ' world'   1         -0.103908    -0.103908   0.000000    0.00%
     1        ' Hello'   3         -4.478908    -4.478908   0.000000    0.00%
     2             '!'   3         -2.636929    -2.636929   0.000000    0.00%
     2             ' '   2         -2.261929    -2.261929   0.000000    0.00%
     2        ' Hello'   1         -0.886929    -0.886929   0.000000    0.00%
     3             ' '   3         -4.548309    -4.548309   0.000000    0.00%
     3        ' world'   1         -0.048309    -0.048309   0.000000    0.00%
     3        ' Hello'   2         -4.298309    -4.298309   0.000000    0.00%
     4             ' '   2         -2.607722    -2.607722   0.000000    0.00%
     4          '\n\n'   3         -3.857722    -3.857722   0.000000    0.00%
     4        ' Hello'   1         -0.232722    -0.232722   0.000000    0.00%
     5             ' '   3         -5.520329    -5.520329   0.000000    0.00%
     5        ' world'   1         -0.020330    -0.020330   0.000000    0.00%
     5        ' Hello'   2         -4.895329    -4.895329   0.000000    0.00%
     6             ' '   2         -3.120406    -3.120406   0.000000    0.00%
     6          '\n\n'   3         -4.370406    -4.370406   0.000000    0.00%
     6        ' Hello'   1         -0.120406    -0.120406   0.000000    0.00%
     7             ' '   3         -5.763486    -5.763486   0.000000    0.00%
     7        ' world'   1         -0.013486    -0.013486   0.000000    0.00%
     7        ' Hello'   2         -5.513486    -5.513486   0.000000    0.00%
     8             ' '   2         -3.219019    -3.219019   0.000000    0.00%
     8          '\n\n'   3         -4.719019    -4.719019   0.000000    0.00%
     8        ' Hello'   1         -0.094020    -0.094020   0.000000    0.00%
     9             ' '   3         -6.010196    -6.010196   0.000000    0.00%
     9        ' world'   1         -0.010196    -0.010196   0.000000    0.00%
     9        ' Hello'   2         -5.885196    -5.885196   0.000000    0.00%

  SUMMARY: 60 comparisons, max_diff=0.000000, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[32] ---

######################################################################
COMPARISON: eager  vs  graph_sp32
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.927903   0.019541    0.67%
     0             '!'   3         -3.283362    -3.302903   0.019541    0.59%
     0        ' Hello'   1         -1.533363    -1.552903   0.019541    1.26%
     1             ' '   2         -4.092611    -4.100432   0.007822    0.19%
     1        ' world'   1         -0.092611    -0.100432   0.007821    7.79%
     1        ' Hello'   3         -4.592611    -4.475432   0.117178    2.55%
     2             '!'   3         -2.636380    -2.683910   0.047530    1.77%
     2             ' '   2         -2.261380    -2.308910   0.047530    2.06%
     2        ' Hello'   1         -0.886380    -0.808910   0.077470    8.74%
     3             ' '   3         -4.668087    -4.667976   0.000112    0.00%
     3        ' world'   1         -0.043087    -0.042976   0.000111    0.26%
     3        ' Hello'   2         -4.418087    -4.417976   0.000112    0.00%
     4             ' '   2         -2.607630    -2.724631   0.117002    4.29%
     4          '\n\n'   3         -3.857630    -3.849631   0.007998    0.21%
     4        ' Hello'   1         -0.232630    -0.224631   0.007998    3.44%
     5             ' '   3         -5.520524    -5.520240   0.000284    0.01%
     5        ' world'   1         -0.020524    -0.020240   0.000285    1.39%
     5        ' Hello'   2         -4.895524    -4.895240   0.000284    0.01%
     6             ' '   2         -3.121110    -3.118993   0.002117    0.07%
     6          '\n\n'   3         -4.371110    -4.493993   0.122883    2.73%
     6        ' Hello'   1         -0.121110    -0.118993   0.002117    1.75%
     7             ' '   3         -5.887629    -5.762938   0.124691    2.12%
     7        ' world'   1         -0.012629    -0.012937   0.000308    2.38%
     7        ' Hello'   2         -5.637629    -5.637938   0.000309    0.01%
     8             ' '   2         -3.218803    -3.217041   0.001761    0.05%
     8          '\n\n'   3         -4.718802    -4.842041   0.123239    2.55%
     8        ' Hello'   1         -0.093803    -0.092042   0.001761    1.88%
     9             ' '   3         -6.010202    -6.010182   0.000020    0.00%
     9        ' world'   1         -0.010202    -0.010182   0.000020    0.20%
     9        ' Hello'   2         -5.885202    -5.885182   0.000020    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.931652   0.043040    1.47%
     0             '!'   3         -3.326113    -3.369152   0.043040    1.28%
     0        ' Hello'   1         -1.576113    -1.556652   0.019461    1.23%
     1             ' '   2         -3.978908    -4.101131   0.122224    2.98%
     1        ' world'   1         -0.103908    -0.101131   0.002776    2.67%
     1        ' Hello'   3         -4.478908    -4.476131   0.002777    0.06%
     2             '!'   3         -2.636929    -2.630305   0.006624    0.25%
     2             ' '   2         -2.261929    -2.255305   0.006624    0.29%
     2        ' Hello'   1         -0.886929    -0.880305   0.006624    0.75%
     3             ' '   3         -4.548309    -4.547089   0.001220    0.03%
     3        ' world'   1         -0.048309    -0.047089   0.001220    2.53%
     3        ' Hello'   2         -4.298309    -4.422089   0.123780    2.80%
     4             ' '   2         -2.607722    -2.608346   0.000624    0.02%
     4          '\n\n'   3         -3.857722    -3.858346   0.000624    0.02%
     4        ' Hello'   1         -0.232722    -0.233346   0.000624    0.27%
     5             ' '   3         -5.520329    -5.397932   0.122397    2.22%
     5        ' world'   1         -0.020330    -0.022932   0.002602   11.35%
     5        ' Hello'   2         -4.895329    -4.772932   0.122397    2.50%
     6             ' '   2         -3.120406    -3.118514   0.001892    0.06%
     6          '\n\n'   3         -4.370406    -4.493514   0.123108    2.74%
     6        ' Hello'   1         -0.120406    -0.118514   0.001892    1.57%
     7             ' '   3         -5.763486    -5.762949   0.000536    0.01%
     7        ' world'   1         -0.013486    -0.012950   0.000536    3.98%
     7        ' Hello'   2         -5.513486    -5.637949   0.124464    2.21%
     8             ' '   2         -3.219019    -3.218347   0.000673    0.02%
     8          '\n\n'   3         -4.719019    -4.718347   0.000673    0.01%
     8        ' Hello'   1         -0.094020    -0.093347   0.000673    0.72%
     9             ' '   3         -6.010196    -6.010227   0.000031    0.00%
     9        ' world'   1         -0.010196    -0.010227   0.000030    0.30%
     9        ' Hello'   2         -5.885196    -5.885227   0.000031    0.00%

  SUMMARY: 60 comparisons, max_diff=0.124691, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[64] ---

######################################################################
COMPARISON: eager  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.927903   0.019541    0.67%
     0             '!'   3         -3.283362    -3.302903   0.019541    0.59%
     0        ' Hello'   1         -1.533363    -1.552903   0.019541    1.26%
     1             ' '   2         -4.092611    -4.100432   0.007822    0.19%
     1        ' world'   1         -0.092611    -0.100432   0.007821    7.79%
     1        ' Hello'   3         -4.592611    -4.475432   0.117178    2.55%
     2             '!'   3         -2.636380    -2.683910   0.047530    1.77%
     2             ' '   2         -2.261380    -2.308910   0.047530    2.06%
     2        ' Hello'   1         -0.886380    -0.808910   0.077470    8.74%
     3             ' '   3         -4.668087    -4.667976   0.000112    0.00%
     3        ' world'   1         -0.043087    -0.042976   0.000111    0.26%
     3        ' Hello'   2         -4.418087    -4.417976   0.000112    0.00%
     4             ' '   2         -2.607630    -2.724631   0.117002    4.29%
     4          '\n\n'   3         -3.857630    -3.849631   0.007998    0.21%
     4        ' Hello'   1         -0.232630    -0.224631   0.007998    3.44%
     5             ' '   3         -5.520524    -5.520240   0.000284    0.01%
     5        ' world'   1         -0.020524    -0.020240   0.000285    1.39%
     5        ' Hello'   2         -4.895524    -4.895240   0.000284    0.01%
     6             ' '   2         -3.121110    -3.118993   0.002117    0.07%
     6          '\n\n'   3         -4.371110    -4.493993   0.122883    2.73%
     6        ' Hello'   1         -0.121110    -0.118993   0.002117    1.75%
     7             ' '   3         -5.887629    -5.762938   0.124691    2.12%
     7        ' world'   1         -0.012629    -0.012937   0.000308    2.38%
     7        ' Hello'   2         -5.637629    -5.637938   0.000309    0.01%
     8             ' '   2         -3.218803    -3.217041   0.001761    0.05%
     8          '\n\n'   3         -4.718802    -4.842041   0.123239    2.55%
     8        ' Hello'   1         -0.093803    -0.092042   0.001761    1.88%
     9             ' '   3         -6.010202    -6.010182   0.000020    0.00%
     9        ' world'   1         -0.010202    -0.010182   0.000020    0.20%
     9        ' Hello'   2         -5.885202    -5.885182   0.000020    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.931652   0.043040    1.47%
     0             '!'   3         -3.326113    -3.369152   0.043040    1.28%
     0        ' Hello'   1         -1.576113    -1.556652   0.019461    1.23%
     1             ' '   2         -3.978908    -4.101131   0.122224    2.98%
     1        ' world'   1         -0.103908    -0.101131   0.002776    2.67%
     1        ' Hello'   3         -4.478908    -4.476131   0.002777    0.06%
     2             '!'   3         -2.636929    -2.630305   0.006624    0.25%
     2             ' '   2         -2.261929    -2.255305   0.006624    0.29%
     2        ' Hello'   1         -0.886929    -0.880305   0.006624    0.75%
     3             ' '   3         -4.548309    -4.547089   0.001220    0.03%
     3        ' world'   1         -0.048309    -0.047089   0.001220    2.53%
     3        ' Hello'   2         -4.298309    -4.422089   0.123780    2.80%
     4             ' '   2         -2.607722    -2.608346   0.000624    0.02%
     4          '\n\n'   3         -3.857722    -3.858346   0.000624    0.02%
     4        ' Hello'   1         -0.232722    -0.233346   0.000624    0.27%
     5             ' '   3         -5.520329    -5.397932   0.122397    2.22%
     5        ' world'   1         -0.020330    -0.022932   0.002602   11.35%
     5        ' Hello'   2         -4.895329    -4.772932   0.122397    2.50%
     6             ' '   2         -3.120406    -3.118514   0.001892    0.06%
     6          '\n\n'   3         -4.370406    -4.493514   0.123108    2.74%
     6        ' Hello'   1         -0.120406    -0.118514   0.001892    1.57%
     7             ' '   3         -5.763486    -5.762949   0.000536    0.01%
     7        ' world'   1         -0.013486    -0.012950   0.000536    3.98%
     7        ' Hello'   2         -5.513486    -5.637949   0.124464    2.21%
     8             ' '   2         -3.219019    -3.218347   0.000673    0.02%
     8          '\n\n'   3         -4.719019    -4.718347   0.000673    0.01%
     8        ' Hello'   1         -0.094020    -0.093347   0.000673    0.72%
     9             ' '   3         -6.010196    -6.010227   0.000031    0.00%
     9        ' world'   1         -0.010196    -0.010227   0.000030    0.30%
     9        ' Hello'   2         -5.885196    -5.885227   0.000031    0.00%

  SUMMARY: 60 comparisons, max_diff=0.124691, fail@5%=0, fail@10%=0

--- CUDA graph split=[32] vs split=[64] (the confound) ---

######################################################################
COMPARISON: graph_sp32  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.927903    -2.927903   0.000000    0.00%
     0             '!'   3         -3.302903    -3.302903   0.000000    0.00%
     0        ' Hello'   1         -1.552903    -1.552903   0.000000    0.00%
     1             ' '   2         -4.100432    -4.100432   0.000000    0.00%
     1        ' world'   1         -0.100432    -0.100432   0.000000    0.00%
     1        ' Hello'   3         -4.475432    -4.475432   0.000000    0.00%
     2             '!'   3         -2.683910    -2.683910   0.000000    0.00%
     2             ' '   2         -2.308910    -2.308910   0.000000    0.00%
     2        ' Hello'   1         -0.808910    -0.808910   0.000000    0.00%
     3             ' '   3         -4.667976    -4.667976   0.000000    0.00%
     3        ' world'   1         -0.042976    -0.042976   0.000000    0.00%
     3        ' Hello'   2         -4.417976    -4.417976   0.000000    0.00%
     4             ' '   2         -2.724631    -2.724631   0.000000    0.00%
     4          '\n\n'   3         -3.849631    -3.849631   0.000000    0.00%
     4        ' Hello'   1         -0.224631    -0.224631   0.000000    0.00%
     5             ' '   3         -5.520240    -5.520240   0.000000    0.00%
     5        ' world'   1         -0.020240    -0.020240   0.000000    0.00%
     5        ' Hello'   2         -4.895240    -4.895240   0.000000    0.00%
     6             ' '   2         -3.118993    -3.118993   0.000000    0.00%
     6          '\n\n'   3         -4.493993    -4.493993   0.000000    0.00%
     6        ' Hello'   1         -0.118993    -0.118993   0.000000    0.00%
     7             ' '   3         -5.762938    -5.762938   0.000000    0.00%
     7        ' world'   1         -0.012937    -0.012937   0.000000    0.00%
     7        ' Hello'   2         -5.637938    -5.637938   0.000000    0.00%
     8             ' '   2         -3.217041    -3.217041   0.000000    0.00%
     8          '\n\n'   3         -4.842041    -4.842041   0.000000    0.00%
     8        ' Hello'   1         -0.092042    -0.092042   0.000000    0.00%
     9             ' '   3         -6.010182    -6.010182   0.000000    0.00%
     9        ' world'   1         -0.010182    -0.010182   0.000000    0.00%
     9        ' Hello'   2         -5.885182    -5.885182   0.000000    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.931652    -2.931652   0.000000    0.00%
     0             '!'   3         -3.369152    -3.369152   0.000000    0.00%
     0        ' Hello'   1         -1.556652    -1.556652   0.000000    0.00%
     1             ' '   2         -4.101131    -4.101131   0.000000    0.00%
     1        ' world'   1         -0.101131    -0.101131   0.000000    0.00%
     1        ' Hello'   3         -4.476131    -4.476131   0.000000    0.00%
     2             '!'   3         -2.630305    -2.630305   0.000000    0.00%
     2             ' '   2         -2.255305    -2.255305   0.000000    0.00%
     2        ' Hello'   1         -0.880305    -0.880305   0.000000    0.00%
     3             ' '   3         -4.547089    -4.547089   0.000000    0.00%
     3        ' world'   1         -0.047089    -0.047089   0.000000    0.00%
     3        ' Hello'   2         -4.422089    -4.422089   0.000000    0.00%
     4             ' '   2         -2.608346    -2.608346   0.000000    0.00%
     4          '\n\n'   3         -3.858346    -3.858346   0.000000    0.00%
     4        ' Hello'   1         -0.233346    -0.233346   0.000000    0.00%
     5             ' '   3         -5.397932    -5.397932   0.000000    0.00%
     5        ' world'   1         -0.022932    -0.022932   0.000000    0.00%
     5        ' Hello'   2         -4.772932    -4.772932   0.000000    0.00%
     6             ' '   2         -3.118514    -3.118514   0.000000    0.00%
     6          '\n\n'   3         -4.493514    -4.493514   0.000000    0.00%
     6        ' Hello'   1         -0.118514    -0.118514   0.000000    0.00%
     7             ' '   3         -5.762949    -5.762949   0.000000    0.00%
     7        ' world'   1         -0.012950    -0.012950   0.000000    0.00%
     7        ' Hello'   2         -5.637949    -5.637949   0.000000    0.00%
     8             ' '   2         -3.218347    -3.218347   0.000000    0.00%
     8          '\n\n'   3         -4.718347    -4.718347   0.000000    0.00%
     8        ' Hello'   1         -0.093347    -0.093347   0.000000    0.00%
     9             ' '   3         -6.010227    -6.010227   0.000000    0.00%
     9        ' world'   1         -0.010227    -0.010227   0.000000    0.00%
     9        ' Hello'   2         -5.885227    -5.885227   0.000000    0.00%

  SUMMARY: 60 comparisons, max_diff=0.000000, fail@5%=0, fail@10%=0


======================================================================
FINAL SUMMARY
======================================================================

Description                                                MaxDiff   F@5%  F@10%
--------------------------------------------------------------------------------
Eager vs Eager (sanity: expect zero diff)                 0.000000      0      0
Eager vs CUDA graph split=[32]                            0.124691      0      0
Eager vs CUDA graph split=[64]                            0.124691      0      0
CUDA graph split=[32] vs split=[64] (the confound)        0.000000      0      0

These experiments were conducted on MI355 machine. Btw, I would like some help with revamping the skinny GEMMs test since these failures should be caught there. Can you help with those tasks?

EDIT: While the above is a custom script, the motivation for this was the V1 Test others test group on our CI, and particularly pytest -v -s tests/v1/sample/test_logprobs.py which is currently failing.

…ct#33493) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Perf tuning and expansion of cases covered for wvSplitKrc

43f6d45

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi requested review from gshtras and tjtanaa as code owners February 1, 2026 01:22

mergify bot added the rocm Related to AMD ROCm label Feb 1, 2026

github-project-automation bot added this to AMD Feb 1, 2026

github-project-automation bot moved this to Todo in AMD Feb 1, 2026

gemini-code-assist bot reviewed Feb 1, 2026

View reviewed changes

csrc/rocm/skinny_gemms.cu Show resolved Hide resolved

vllm/model_executor/layers/utils.py Outdated Show resolved Hide resolved

tests/kernels/quantization/test_rocm_skinny_gemms.py Outdated Show resolved Hide resolved

amd-hhashemi added 2 commits February 1, 2026 01:53

correction to py filter condition

bf4deeb

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

More extensive testing, and tightened conditions.

299c918

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Cleanup and expansion of conditions.

37f1062

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi added 2 commits February 5, 2026 05:35

tighten tests, bug fix

164ef49

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Move below Aiter

f08fe51

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

gshtras approved these changes Feb 5, 2026

View reviewed changes

gshtras enabled auto-merge (squash) February 5, 2026 16:41

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 5, 2026

Merge branch 'main' into wvSplitKrc3

efe7cfc

vllm-bot merged commit ed17f54 into vllm-project:main Feb 7, 2026
107 of 110 checks passed

github-project-automation bot moved this from Todo to Done in AMD Feb 7, 2026

ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026

Perf tuning and expansion of cases covered for wvSplitKrc (vllm-proje…

848855b

…ct#33493) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

amd-hhashemi mentioned this pull request Feb 20, 2026

Improvements to wvSplitKrc skinny GEMM solution #34304

Merged

5 tasks

This was referenced Feb 27, 2026

[ROCm][CI] Fix tool use test stability - disable skinny GEMM, prefix caching, eliminate batch variance #35553

Merged

[ROCm][CI] Parametrize vision score tests across attention backends with per-backend tolerances #35571

Merged

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Mar 4, 2026

Perf tuning and expansion of cases covered for wvSplitKrc (vllm-proje…

bd48b58

…ct#33493) Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf tuning and expansion of cases covered for wvSplitKrc#33493

Perf tuning and expansion of cases covered for wvSplitKrc#33493
vllm-bot merged 7 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKrc3

amd-hhashemi commented Feb 1, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 3, 2026 •

edited

Loading

Uh oh!

amd-hhashemi commented Feb 3, 2026 •

edited

Loading

Uh oh!

AndreasKaratzas commented Feb 3, 2026 •

edited

Loading

Uh oh!

amd-hhashemi commented Feb 4, 2026

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 12, 2026

Uh oh!

amd-hhashemi commented Feb 12, 2026 •

edited

Loading

Uh oh!

AndreasKaratzas commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

amd-hhashemi commented Feb 1, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-hhashemi commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amd-hhashemi commented Feb 4, 2026

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 12, 2026

Uh oh!

amd-hhashemi commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreasKaratzas commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

amd-hhashemi commented Feb 1, 2026 •

edited by github-actions bot

Loading

AndreasKaratzas commented Feb 3, 2026 •

edited

Loading

amd-hhashemi commented Feb 3, 2026 •

edited

Loading

AndreasKaratzas commented Feb 3, 2026 •

edited

Loading

amd-hhashemi commented Feb 12, 2026 •

edited

Loading

AndreasKaratzas commented Feb 16, 2026 •

edited

Loading