Skip to content

Perf tuning and expansion of cases covered for wvSplitKrc#33493

Merged
vllm-bot merged 7 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKrc3
Feb 7, 2026
Merged

Perf tuning and expansion of cases covered for wvSplitKrc#33493
vllm-bot merged 7 commits intovllm-project:mainfrom
amd-hhashemi:wvSplitKrc3

Conversation

@amd-hhashemi
Copy link
Contributor

@amd-hhashemi amd-hhashemi commented Feb 1, 2026

mi355 measurements before and after changes:
m, n, K , bfor(us), aftr (us)
128, 16, 2880, 4.55, 4.56
640, 16, 2880, 4.80, 4.83
128, 32, 2880, 3.91, 3.21
640, 32, 2880, 4.13, 4.05
128, 64, 2880, 4.42, 3.23
640, 64, 2880, 4.88, 4.43
128, 128, 2880, 4.51, 3.98
640, 128, 2880, 5.89, 5.92

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@mergify mergify bot added the rocm Related to AMD ROCm label Feb 1, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Feb 1, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces performance tuning for the wvSplitKrc kernel and expands the cases it covers. The changes are mainly in csrc/rocm/skinny_gemms.cu, with corresponding updates in the dispatch logic in vllm/model_executor/layers/utils.py and test cases in tests/kernels/quantization/test_rocm_skinny_gemms.py. While the performance optimizations seem promising, I've identified a few critical issues. There's a logic mismatch between the Python dispatch code and the C++ kernel implementation that could lead to incorrect kernel dispatching. Additionally, a crucial out-of-bounds check appears to have been incorrectly removed in the kernel, which could lead to incorrect computations. I've provided detailed comments on these issues.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 3, 2026

Is this PR related to #33527?
I am asking to know if I've got to wait before testing it.

@amd-hhashemi
Copy link
Contributor Author

amd-hhashemi commented Feb 3, 2026

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.

This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine, cases seen in gpt-oss).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 3, 2026

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.

This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

I'll launch a CI cycle tomorrow then with this one to see if there is any test regressing.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

When you post it, CC me if possible so we can prevent any possible regressions :) There has been a huge effort to keep AMD CI green.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@amd-hhashemi
Copy link
Contributor Author

Is this PR related to #33527? I am asking to know if I've got to wait before testing it.

No they are not related. They make changes to different skinny GEMMs.
This PR targets these test scenarios (where cross-wave atomic reduction is used to fill machine).
The #33527 targets these test scenarios (where raw mem bandwidth is main bottleneck). It adds padded activation support.

I'll launch a CI cycle tomorrow then with this one to see if there is any test regressing.

FYI there'll be another similar PR soon targeting padded activation in the non-quantized skinny GEMM solution.

When you post it, CC me if possible so we can prevent any possible regressions :) There has been a huge effort to keep AMD CI green.

@AndreasKaratzas This is the 3rd PR. It adds padding support to the fp16/bf16 version of skinny gemm solutions.

Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
@gshtras gshtras enabled auto-merge (squash) February 5, 2026 16:41
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 5, 2026
@vllm-bot vllm-bot merged commit ed17f54 into vllm-project:main Feb 7, 2026
107 of 110 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Feb 7, 2026
@AndreasKaratzas
Copy link
Collaborator

@amd-hhashemi Thanks for the kernel improvements in this PR.

We've been investigating a persistent test failure in test_serving_tokens.py::test_same_response_as_chat_completions on gfx950 (MI355X) and traced it back to this PR. The root cause is that the wvSplitKrc_ kernel now produces non-deterministic results across sequential invocations with identical inputs.

Estimated root cause:

The old code had a conditional direct-write path that was deterministic:

bool doRdc = (kfitsPerRdc * kFit < K);

When doRdc was false, results were written directly to C[] without atomics --- fully deterministic.

This PR changed it to:

bool doRdc = true;  // Assuming (kfitsPerRdc * kFit < K) is always true

and removed the entire if (!doRdc) direct-write branch. Now all results go through atomicAdd(&glbl[...], sum4[...]), and since floating-point atomicAdd is non-associative, the execution order of waves across kernel launches causes different rounding, producing different outputs for identical inputs.

Reproduction:

After reverting this PR, both tests passed:

Is it possible to have deterministic kernels, and only if there is an option passed, something like --fast-skinny-gemms activate non-determinism? It is critical that we stay deterministic.

@amd-hhashemi
Copy link
Contributor Author

amd-hhashemi commented Feb 12, 2026

@AndreasKaratzas Hi, can you please try with #34410? It's a one-liner. I root-caused an issue that shows up on some vLLM dockers on N<=16 GEMMs (as seen in single prompt gptoss). It was occuring only in non-eager mode prompts for me, and I was never able to get an out of threshold test on any of the GEMM sizes.
That !doRdc path is never being taken by the GEMM sizes we limit this solution too. That path was meant for K<512 sizes. But the Aiter GEMMs seem to be doing good enough on those, so i let them take it 😄.
Also the whole point of this solution is to do faster ksplit via atomic reduction. The atomic reduction is in full float32 and it should not produce run-to-run noise that shows up in tokens.
For validation purposes, I do have a deterministic path for the reduce (does traditional store-readback-and-sum). I can add it as a mode.
I'll look closer at the test you mentioned today.
But thanks for bring it up, btw. Now I'm curious if deterministic reduce (still doing fused reduction with atomic counting, just not doing the reduction itself with float atomics) actually performs better, as it avoids the atomic bottleneck.🤔 will investigate.

@AndreasKaratzas
Copy link
Collaborator

AndreasKaratzas commented Feb 16, 2026

@amd-hhashemi Back at it again 😅 So there is another inaccuracy observed

test_cudagraph_divergence.py
import math
from dataclasses import dataclass

import torch

from vllm import LLM, SamplingParams
from vllm.config import CompilationConfig
from vllm.distributed import cleanup_dist_env_and_memory

MODEL = "meta-llama/Llama-3.2-1B-Instruct"
MAX_MODEL_LEN = 256
SEED = 42
GPU_MEM_UTIL = 0.4
MAX_LOGPROBS = 5
TOP_LOGPROBS = 3
MAX_TOKENS = 10
PROMPT = "Hello world " * 50


@dataclass
class RunConfig:
    name: str
    enforce_eager: bool
    compile_ranges_split_points: list[int] | None  # None = use default


def make_sampling_params():
    normal = SamplingParams(
        temperature=0,
        logprobs=TOP_LOGPROBS,
        max_tokens=MAX_TOKENS,
        ignore_eos=False,
    )
    penalty = SamplingParams(
        temperature=0,
        logprobs=TOP_LOGPROBS,
        max_tokens=MAX_TOKENS,
        ignore_eos=False,
        presence_penalty=-1.0,
    )
    return normal, penalty


def run_config(config: RunConfig):
    print(f"\n{'='*60}")
    print(f"Running: {config.name}")
    print(f"  enforce_eager={config.enforce_eager}")
    print(f"  compile_ranges_split_points={config.compile_ranges_split_points}")
    print(f"{'='*60}")

    kwargs = dict(
        model=MODEL,
        max_logprobs=MAX_LOGPROBS,
        max_model_len=MAX_MODEL_LEN,
        seed=SEED,
        gpu_memory_utilization=GPU_MEM_UTIL,
        enable_prefix_caching=False,
        enable_chunked_prefill=True,
        max_num_batched_tokens=32,
        enforce_eager=config.enforce_eager,
    )

    if config.compile_ranges_split_points is not None and not config.enforce_eager:
        kwargs["compilation_config"] = CompilationConfig(
            compile_ranges_split_points=config.compile_ranges_split_points,
        )

    llm = LLM(**kwargs)
    normal_params, penalty_params = make_sampling_params()
    results = llm.generate(
        [PROMPT, PROMPT], [normal_params, penalty_params]
    )

    del llm
    torch.cuda.empty_cache()
    cleanup_dist_env_and_memory()

    return results


def extract_logprobs(results):
    per_request = []
    for result in results:
        positions = []
        for lp_dict in result.outputs[0].logprobs:
            positions.append(lp_dict)
        per_request.append(positions)
    return per_request


def compare(name_a, lps_a, name_b, lps_b):
    labels = ["no_penalty", "with_penalty"]
    print(f"\n{'#'*70}")
    print(f"COMPARISON: {name_a}  vs  {name_b}")
    print(f"{'#'*70}")

    max_diff = 0.0
    total = 0
    fail_5 = 0
    fail_10 = 0

    for req_idx in range(len(lps_a)):
        label = labels[req_idx]
        a = lps_a[req_idx]
        b = lps_b[req_idx]

        if len(a) != len(b):
            print(f"  [{label}] LENGTH MISMATCH: {len(a)} vs {len(b)}")
            continue

        print(f"\n  [{label}] {len(a)} positions")
        print(f"  {'pos':>4} {'token':>15} {'rank':>5} "
              f"{'lp_A':>12} {'lp_B':>12} {'diff':>10} {'rel%':>8}")
        print(f"  {'-'*72}")

        for pos in range(len(a)):
            common = set(a[pos].keys()) & set(b[pos].keys())
            for tid in sorted(common):
                la = a[pos][tid]
                lb = b[pos][tid]
                diff = abs(la.logprob - lb.logprob)
                denom = max(abs(la.logprob), abs(lb.logprob), 1e-10)
                rel = (diff / denom) * 100
                max_diff = max(max_diff, diff)
                total += 1

                c5 = math.isclose(la.logprob, lb.logprob,
                                  rel_tol=5e-2, abs_tol=1e-1)
                c10 = math.isclose(la.logprob, lb.logprob,
                                   rel_tol=1e-1, abs_tol=1e-1)
                if not c5:
                    fail_5 += 1
                if not c10:
                    fail_10 += 1

                flag = ""
                if not c5:
                    flag = " <-- FAIL@5%"
                if not c10:
                    flag = " <-- FAIL@10%"

                print(f"  {pos:>4} {la.decoded_token!r:>15} "
                      f"{la.rank:>3}      "
                      f"{la.logprob:>12.6f} {lb.logprob:>12.6f} "
                      f"{diff:>10.6f} {rel:>7.2f}%{flag}")

    print(f"\n  SUMMARY: {total} comparisons, max_diff={max_diff:.6f}, "
          f"fail@5%={fail_5}, fail@10%={fail_10}")
    return max_diff, fail_5, fail_10


def main():
    configs = [
        RunConfig("eager",       enforce_eager=True,  compile_ranges_split_points=None),
        RunConfig("eager2",      enforce_eager=True,  compile_ranges_split_points=None),
        RunConfig("graph_sp32",  enforce_eager=False, compile_ranges_split_points=[32]),
        RunConfig("graph_sp64",  enforce_eager=False, compile_ranges_split_points=[64]),
    ]

    all_results = {}
    all_lps = {}
    for cfg in configs:
        all_results[cfg.name] = run_config(cfg)
        all_lps[cfg.name] = extract_logprobs(all_results[cfg.name])

    comparisons = [
        ("eager",      "eager2",     "Eager vs Eager (sanity: expect zero diff)"),
        ("eager",      "graph_sp32", "Eager vs CUDA graph split=[32]"),
        ("eager",      "graph_sp64", "Eager vs CUDA graph split=[64]"),
        ("graph_sp32", "graph_sp64", "CUDA graph split=[32] vs split=[64] (the confound)"),
    ]

    print(f"\n\n{'='*70}")
    print("ALL COMPARISONS")
    print(f"{'='*70}")

    summary = []
    for a, b, desc in comparisons:
        print(f"\n--- {desc} ---")
        md, f5, f10 = compare(a, all_lps[a], b, all_lps[b])
        summary.append((desc, md, f5, f10))

    print(f"\n\n{'='*70}")
    print("FINAL SUMMARY")
    print(f"{'='*70}")
    print(f"\n{'Description':<55} {'MaxDiff':>10} {'F@5%':>6} {'F@10%':>6}")
    print(f"{'-'*80}")
    for desc, md, f5, f10 in summary:
        print(f"{desc:<55} {md:>10.6f} {f5:>6} {f10:>6}")

if __name__ == "__main__":
    main()

If you run this with default settings, i.e., VLLM_ROCM_USE_SKINNY_GEMM=1 you get:

Running with `VLLM_ROCM_USE_SKINNY_GEMM=1`
======================================================================
ALL COMPARISONS
======================================================================

--- Eager vs Eager (sanity: expect zero diff) ---

######################################################################
COMPARISON: eager  vs  eager2
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.948747   0.014764    0.50%
     0             '!'   3         -3.338510    -3.323747   0.014764    0.44%
     0        ' Hello'   1         -1.526010    -1.573747   0.047736    3.03%
     1             ' '   2         -4.101175    -4.104499   0.003324    0.08%
     1        ' world'   1         -0.101175    -0.104499   0.003324    3.18%
     1        ' Hello'   3         -4.476175    -4.354499   0.121676    2.72%
     2             '!'   3         -2.633208    -2.632675   0.000532    0.02%
     2             ' '   2         -2.258208    -2.257675   0.000532    0.02%
     2        ' Hello'   1         -0.883208    -0.882675   0.000532    0.06%
     3             ' '   3         -4.667778    -4.668096   0.000318    0.01%
     3        ' world'   1         -0.042778    -0.043096   0.000318    0.74%
     3        ' Hello'   2         -4.417778    -4.418096   0.000318    0.01%
     4             ' '   2         -2.624230    -2.607321   0.016910    0.64%
     4          '\n\n'   3         -3.749230    -3.857321   0.108090    2.80%
     4        ' Hello'   1         -0.249230    -0.232321   0.016909    6.78%
     5             ' '   3         -5.520310    -5.520316   0.000005    0.00%
     5        ' world'   1         -0.020310    -0.020316   0.000005    0.03%
     5        ' Hello'   2         -4.895310    -4.895316   0.000005    0.00%
     6             ' '   2         -3.008644    -3.008459   0.000185    0.01%
     6          '\n\n'   3         -4.383644    -4.383459   0.000185    0.00%
     6        ' Hello'   1         -0.133644    -0.133459   0.000185    0.14%
     7             ' '   3         -5.887640    -5.764194   0.123446    2.10%
     7        ' world'   1         -0.012640    -0.014195   0.001555   10.95%
     7        ' Hello'   2         -5.637640    -5.514194   0.123446    2.19%
     8             ' '   2         -3.337700    -3.218463   0.119236    3.57%
     8          '\n\n'   3         -4.837699    -4.718463   0.119236    2.46%
     8        ' Hello'   1         -0.087700    -0.093463   0.005764    6.17%
     9             ' '   3         -6.010201    -6.010258   0.000057    0.00%
     9        ' world'   1         -0.010200    -0.010258   0.000057    0.56%
     9        ' Hello'   2         -5.885201    -5.885258   0.000057    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.934591   0.010285    0.35%
     0             '!'   3         -3.319876    -3.309591   0.010285    0.31%
     0        ' Hello'   1         -1.569876    -1.559591   0.010285    0.66%
     1             ' '   2         -4.102213    -3.987195   0.115018    2.80%
     1        ' world'   1         -0.102213    -0.112195   0.009982    8.90%
     1        ' Hello'   3         -4.477213    -4.362195   0.115018    2.57%
     2             '!'   3         -2.635979    -2.626584   0.009395    0.36%
     2             ' '   2         -2.260979    -2.251584   0.009395    0.42%
     2        ' Hello'   1         -0.885979    -0.876584   0.009395    1.06%
     3             ' '   3         -4.668216    -4.548712   0.119503    2.56%
     3        ' world'   1         -0.043216    -0.048712   0.005497   11.28%
     3        ' Hello'   2         -4.418216    -4.298712   0.119503    2.70%
     4             ' '   2         -2.509679    -2.723243   0.213564    7.84% <-- FAIL@5%
     4          '\n\n'   3         -3.759679    -3.848243   0.088564    2.30%
     4        ' Hello'   1         -0.259679    -0.223243   0.036436   14.03%
     5             ' '   3         -5.519681    -5.397882   0.121799    2.21%
     5        ' world'   1         -0.019681    -0.022882   0.003201   13.99%
     5        ' Hello'   2         -5.019681    -4.772882   0.246799    4.92%
     6             ' '   2         -3.008634    -3.120678   0.112044    3.59%
     6          '\n\n'   3         -4.383634    -4.370678   0.012956    0.30%
     6        ' Hello'   1         -0.133634    -0.120678   0.012956    9.70%
     7             ' '   3         -5.888097    -5.762966   0.125131    2.13%
     7        ' world'   1         -0.013097    -0.012966   0.000131    1.00%
     7        ' Hello'   2         -5.513097    -5.637966   0.124869    2.21%
     8             ' '   2         -3.105620    -3.338585   0.232965    6.98% <-- FAIL@5%
     8          '\n\n'   3         -4.605620    -4.713585   0.107965    2.29%
     8        ' Hello'   1         -0.105620    -0.088585   0.017035   16.13%
     9             ' '   3         -6.010182    -6.010158   0.000024    0.00%
     9        ' world'   1         -0.010182    -0.010158   0.000025    0.24%
     9        ' Hello'   2         -5.885182    -5.885158   0.000024    0.00%

  SUMMARY: 60 comparisons, max_diff=0.246799, fail@5%=2, fail@10%=0

--- Eager vs CUDA graph split=[32] ---

######################################################################
COMPARISON: eager  vs  graph_sp32
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.951316   0.012194    0.41%
     0             '!'   3         -3.338510    -3.326316   0.012194    0.37%
     0        ' Hello'   1         -1.526010    -1.513816   0.012194    0.80%
     1             ' '   2         -4.101175    -3.979824   0.121351    2.96%
     1        ' world'   1         -0.101175    -0.104824   0.003650    3.48%
     1        ' Hello'   3         -4.476175    -4.479825   0.003650    0.08%
     2             '!'   3         -2.633208    -2.617305   0.015903    0.60%
     2             ' '   2         -2.258208    -2.242305   0.015903    0.70%
     2        ' Hello'   1         -0.883208    -0.867305   0.015903    1.80%
     3             ' '   3         -4.667778    -4.667899   0.000121    0.00%
     3        ' world'   1         -0.042778    -0.042899   0.000121    0.28%
     3        ' Hello'   2         -4.417778    -4.417899   0.000121    0.00%
     4             ' '   2         -2.624230    -2.723199   0.098969    3.63%
     4          '\n\n'   3         -3.749230    -3.848199   0.098969    2.57%
     4        ' Hello'   1         -0.249230    -0.223199   0.026031   10.44%
     5             ' '   3         -5.520310    -5.520334   0.000024    0.00%
     5        ' world'   1         -0.020310    -0.020334   0.000024    0.12%
     5        ' Hello'   2         -4.895310    -4.895334   0.000024    0.00%
     6             ' '   2         -3.008644    -3.008428   0.000216    0.01%
     6          '\n\n'   3         -4.383644    -4.383428   0.000216    0.00%
     6        ' Hello'   1         -0.133644    -0.133428   0.000216    0.16%
     7             ' '   3         -5.887640    -5.762982   0.124658    2.12%
     7        ' world'   1         -0.012640    -0.012983   0.000343    2.64%
     7        ' Hello'   2         -5.637640    -5.637982   0.000342    0.01%
     8             ' '   2         -3.337700    -3.218273   0.119427    3.58%
     8          '\n\n'   3         -4.837699    -4.718273   0.119426    2.47%
     8        ' Hello'   1         -0.087700    -0.093273   0.005573    5.98%
     9             ' '   3         -6.010201    -6.010166   0.000034    0.00%
     9        ' world'   1         -0.010200    -0.010166   0.000034    0.33%
     9        ' Hello'   2         -5.885201    -5.885166   0.000034    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.954682   0.009805    0.33%
     0             '!'   3         -3.319876    -3.329682   0.009805    0.29%
     0        ' Hello'   1         -1.569876    -1.517182   0.052694    3.36%
     1             ' '   2         -4.102213    -3.979517   0.122695    2.99%
     1        ' world'   1         -0.102213    -0.104517   0.002305    2.21%
     1        ' Hello'   3         -4.477213    -4.479517   0.002305    0.05%
     2             '!'   3         -2.635979    -2.692531   0.056552    2.10%
     2             ' '   2         -2.260979    -2.317531   0.056552    2.44%
     2        ' Hello'   1         -0.885979    -0.817531   0.068448    7.73%
     3             ' '   3         -4.668216    -4.548652   0.119564    2.56%
     3        ' world'   1         -0.043216    -0.048652   0.005437   11.17%
     3        ' Hello'   2         -4.418216    -4.298652   0.119564    2.71%
     4             ' '   2         -2.509679    -2.625171   0.115492    4.40%
     4          '\n\n'   3         -3.759679    -3.750171   0.009508    0.25%
     4        ' Hello'   1         -0.259679    -0.250171   0.009508    3.66%
     5             ' '   3         -5.519681    -5.398014   0.121667    2.20%
     5        ' world'   1         -0.019681    -0.023014   0.003333   14.48%
     5        ' Hello'   2         -5.019681    -4.773014   0.246667    4.91%
     6             ' '   2         -3.008634    -3.119569   0.110935    3.56%
     6          '\n\n'   3         -4.383634    -4.494569   0.110935    2.47%
     6        ' Hello'   1         -0.133634    -0.119569   0.014065   10.53%
     7             ' '   3         -5.888097    -5.763004   0.125093    2.12%
     7        ' world'   1         -0.013097    -0.013004   0.000093    0.71%
     7        ' Hello'   2         -5.513097    -5.638004   0.124907    2.22%
     8             ' '   2         -3.105620    -3.218772   0.113152    3.52%
     8          '\n\n'   3         -4.605620    -4.718772   0.113152    2.40%
     8        ' Hello'   1         -0.105620    -0.093772   0.011848   11.22%
     9             ' '   3         -6.010182    -6.134009   0.123827    2.02%
     9        ' world'   1         -0.010182    -0.009009   0.001173   11.52%
     9        ' Hello'   2         -5.885182    -6.009009   0.123827    2.06%

  SUMMARY: 60 comparisons, max_diff=0.246667, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[64] ---

######################################################################
COMPARISON: eager  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.963510    -2.933463   0.030048    1.01%
     0             '!'   3         -3.338510    -3.308463   0.030048    0.90%
     0        ' Hello'   1         -1.526010    -1.558463   0.032452    2.08%
     1             ' '   2         -4.101175    -3.978042   0.123132    3.00%
     1        ' world'   1         -0.101175    -0.103042   0.001868    1.81%
     1        ' Hello'   3         -4.476175    -4.478043   0.001868    0.04%
     2             '!'   3         -2.633208    -2.692862   0.059654    2.22%
     2             ' '   2         -2.258208    -2.317862   0.059654    2.57%
     2        ' Hello'   1         -0.883208    -0.817862   0.065346    7.40%
     3             ' '   3         -4.667778    -4.667801   0.000023    0.00%
     3        ' world'   1         -0.042778    -0.042801   0.000023    0.05%
     3        ' Hello'   2         -4.417778    -4.417801   0.000023    0.00%
     4             ' '   2         -2.624230    -2.608324   0.015907    0.61%
     4          '\n\n'   3         -3.749230    -3.858324   0.109093    2.83%
     4        ' Hello'   1         -0.249230    -0.233324   0.015906    6.38%
     5             ' '   3         -5.520310    -5.520282   0.000029    0.00%
     5        ' world'   1         -0.020310    -0.020282   0.000029    0.14%
     5        ' Hello'   2         -4.895310    -4.895282   0.000029    0.00%
     6             ' '   2         -3.008644    -3.120791   0.112147    3.59%
     6          '\n\n'   3         -4.383644    -4.370791   0.012853    0.29%
     6        ' Hello'   1         -0.133644    -0.120791   0.012853    9.62%
     7             ' '   3         -5.887640    -5.762964   0.124676    2.12%
     7        ' world'   1         -0.012640    -0.012964   0.000325    2.50%
     7        ' Hello'   2         -5.637640    -5.637964   0.000324    0.01%
     8             ' '   2         -3.337700    -3.218692   0.119007    3.57%
     8          '\n\n'   3         -4.837699    -4.718692   0.119007    2.46%
     8        ' Hello'   1         -0.087700    -0.093692   0.005993    6.40%
     9             ' '   3         -6.010201    -6.010149   0.000051    0.00%
     9        ' world'   1         -0.010200    -0.010149   0.000051    0.50%
     9        ' Hello'   2         -5.885201    -5.885149   0.000051    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.944876    -2.955063   0.010186    0.34%
     0             '!'   3         -3.319876    -3.330063   0.010186    0.31%
     0        ' Hello'   1         -1.569876    -1.517563   0.052314    3.33%
     1             ' '   2         -4.102213    -4.101257   0.000956    0.02%
     1        ' world'   1         -0.102213    -0.101257   0.000955    0.93%
     1        ' Hello'   3         -4.477213    -4.476257   0.000956    0.02%
     2             '!'   3         -2.635979    -2.620937   0.015042    0.57%
     2             ' '   2         -2.260979    -2.245937   0.015042    0.67%
     2        ' Hello'   1         -0.885979    -0.870937   0.015042    1.70%
     3             ' '   3         -4.668216    -4.667939   0.000277    0.01%
     3        ' world'   1         -0.043216    -0.042939   0.000277    0.64%
     3        ' Hello'   2         -4.418216    -4.417939   0.000277    0.01%
     4             ' '   2         -2.509679    -2.608213   0.098534    3.78%
     4          '\n\n'   3         -3.759679    -3.858213   0.098534    2.55%
     4        ' Hello'   1         -0.259679    -0.233213   0.026466   10.19%
     5             ' '   3         -5.519681    -5.520357   0.000676    0.01%
     5        ' world'   1         -0.019681    -0.020357   0.000676    3.32%
     5        ' Hello'   2         -5.019681    -4.895357   0.124324    2.48%
     6             ' '   2         -3.008634    -3.119641   0.111007    3.56%
     6          '\n\n'   3         -4.383634    -4.494641   0.111007    2.47%
     6        ' Hello'   1         -0.133634    -0.119641   0.013993   10.47%
     7             ' '   3         -5.888097    -5.762983   0.125113    2.12%
     7        ' world'   1         -0.013097    -0.012983   0.000114    0.87%
     7        ' Hello'   2         -5.513097    -5.637983   0.124887    2.22%
     8             ' '   2         -3.105620    -3.217211   0.111590    3.47%
     8          '\n\n'   3         -4.605620    -4.842211   0.236590    4.89%
     8        ' Hello'   1         -0.105620    -0.092211   0.013410   12.70%
     9             ' '   3         -6.010182    -6.010227   0.000045    0.00%
     9        ' world'   1         -0.010182    -0.010227   0.000045    0.44%
     9        ' Hello'   2         -5.885182    -5.885227   0.000045    0.00%

  SUMMARY: 60 comparisons, max_diff=0.236590, fail@5%=0, fail@10%=0

--- CUDA graph split=[32] vs split=[64] (the confound) ---

######################################################################
COMPARISON: graph_sp32  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.951316    -2.933463   0.017853    0.60%
     0             '!'   3         -3.326316    -3.308463   0.017853    0.54%
     0        ' Hello'   1         -1.513816    -1.558463   0.044647    2.86%
     1             ' '   2         -3.979824    -3.978042   0.001782    0.04%
     1        ' world'   1         -0.104824    -0.103042   0.001782    1.70%
     1        ' Hello'   3         -4.479825    -4.478043   0.001782    0.04%
     2             '!'   3         -2.617305    -2.692862   0.075557    2.81%
     2             ' '   2         -2.242305    -2.317862   0.075557    3.26%
     2        ' Hello'   1         -0.867305    -0.817862   0.049443    5.70%
     3             ' '   3         -4.667899    -4.667801   0.000098    0.00%
     3        ' world'   1         -0.042899    -0.042801   0.000098    0.23%
     3        ' Hello'   2         -4.417899    -4.417801   0.000098    0.00%
     4             ' '   2         -2.723199    -2.608324   0.114875    4.22%
     4          '\n\n'   3         -3.848199    -3.858324   0.010125    0.26%
     4        ' Hello'   1         -0.223199    -0.233324   0.010125    4.34%
     5             ' '   3         -5.520334    -5.520282   0.000052    0.00%
     5        ' world'   1         -0.020334    -0.020282   0.000052    0.26%
     5        ' Hello'   2         -4.895334    -4.895282   0.000052    0.00%
     6             ' '   2         -3.008428    -3.120791   0.112364    3.60%
     6          '\n\n'   3         -4.383428    -4.370791   0.012637    0.29%
     6        ' Hello'   1         -0.133428    -0.120791   0.012636    9.47%
     7             ' '   3         -5.762982    -5.762964   0.000018    0.00%
     7        ' world'   1         -0.012983    -0.012964   0.000018    0.14%
     7        ' Hello'   2         -5.637982    -5.637964   0.000018    0.00%
     8             ' '   2         -3.218273    -3.218692   0.000419    0.01%
     8          '\n\n'   3         -4.718273    -4.718692   0.000419    0.01%
     8        ' Hello'   1         -0.093273    -0.093692   0.000419    0.45%
     9             ' '   3         -6.010166    -6.010149   0.000017    0.00%
     9        ' world'   1         -0.010166    -0.010149   0.000017    0.17%
     9        ' Hello'   2         -5.885166    -5.885149   0.000017    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.954682    -2.955063   0.000381    0.01%
     0             '!'   3         -3.329682    -3.330063   0.000381    0.01%
     0        ' Hello'   1         -1.517182    -1.517563   0.000381    0.03%
     1             ' '   2         -3.979517    -4.101257   0.121740    2.97%
     1        ' world'   1         -0.104517    -0.101257   0.003260    3.12%
     1        ' Hello'   3         -4.479517    -4.476257   0.003260    0.07%
     2             '!'   3         -2.692531    -2.620937   0.071594    2.66%
     2             ' '   2         -2.317531    -2.245937   0.071594    3.09%
     2        ' Hello'   1         -0.817531    -0.870937   0.053406    6.13%
     3             ' '   3         -4.548652    -4.667939   0.119287    2.56%
     3        ' world'   1         -0.048652    -0.042939   0.005713   11.74%
     3        ' Hello'   2         -4.298652    -4.417939   0.119287    2.70%
     4             ' '   2         -2.625171    -2.608213   0.016958    0.65%
     4          '\n\n'   3         -3.750171    -3.858213   0.108042    2.80%
     4        ' Hello'   1         -0.250171    -0.233213   0.016958    6.78%
     5             ' '   3         -5.398014    -5.520357   0.122343    2.22%
     5        ' world'   1         -0.023014    -0.020357   0.002657   11.54%
     5        ' Hello'   2         -4.773014    -4.895357   0.122343    2.50%
     6             ' '   2         -3.119569    -3.119641   0.000072    0.00%
     6          '\n\n'   3         -4.494569    -4.494641   0.000072    0.00%
     6        ' Hello'   1         -0.119569    -0.119641   0.000072    0.06%
     7             ' '   3         -5.763004    -5.762983   0.000021    0.00%
     7        ' world'   1         -0.013004    -0.012983   0.000021    0.16%
     7        ' Hello'   2         -5.638004    -5.637983   0.000021    0.00%
     8             ' '   2         -3.218772    -3.217211   0.001562    0.05%
     8          '\n\n'   3         -4.718772    -4.842211   0.123439    2.55%
     8        ' Hello'   1         -0.093772    -0.092211   0.001561    1.67%
     9             ' '   3         -6.134009    -6.010227   0.123782    2.02%
     9        ' world'   1         -0.009009    -0.010227   0.001218   11.91%
     9        ' Hello'   2         -6.009009    -5.885227   0.123782    2.06%

  SUMMARY: 60 comparisons, max_diff=0.123782, fail@5%=0, fail@10%=0


======================================================================
FINAL SUMMARY
======================================================================

Description                                                MaxDiff   F@5%  F@10%
--------------------------------------------------------------------------------
Eager vs Eager (sanity: expect zero diff)                 0.246799      2      0
Eager vs CUDA graph split=[32]                            0.246667      0      0
Eager vs CUDA graph split=[64]                            0.236590      0      0
CUDA graph split=[32] vs split=[64] (the confound)        0.123782      0      0

But if you run it with VLLM_ROCM_USE_SKINNY_GEMM=0, you get:

Running with `VLLM_ROCM_USE_SKINNY_GEMM=0`
======================================================================
ALL COMPARISONS
======================================================================

--- Eager vs Eager (sanity: expect zero diff) ---

######################################################################
COMPARISON: eager  vs  eager2
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.908362   0.000000    0.00%
     0             '!'   3         -3.283362    -3.283362   0.000000    0.00%
     0        ' Hello'   1         -1.533363    -1.533363   0.000000    0.00%
     1             ' '   2         -4.092611    -4.092611   0.000000    0.00%
     1        ' world'   1         -0.092611    -0.092611   0.000000    0.00%
     1        ' Hello'   3         -4.592611    -4.592611   0.000000    0.00%
     2             '!'   3         -2.636380    -2.636380   0.000000    0.00%
     2             ' '   2         -2.261380    -2.261380   0.000000    0.00%
     2        ' Hello'   1         -0.886380    -0.886380   0.000000    0.00%
     3             ' '   3         -4.668087    -4.668087   0.000000    0.00%
     3        ' world'   1         -0.043087    -0.043087   0.000000    0.00%
     3        ' Hello'   2         -4.418087    -4.418087   0.000000    0.00%
     4             ' '   2         -2.607630    -2.607630   0.000000    0.00%
     4          '\n\n'   3         -3.857630    -3.857630   0.000000    0.00%
     4        ' Hello'   1         -0.232630    -0.232630   0.000000    0.00%
     5             ' '   3         -5.520524    -5.520524   0.000000    0.00%
     5        ' world'   1         -0.020524    -0.020524   0.000000    0.00%
     5        ' Hello'   2         -4.895524    -4.895524   0.000000    0.00%
     6             ' '   2         -3.121110    -3.121110   0.000000    0.00%
     6          '\n\n'   3         -4.371110    -4.371110   0.000000    0.00%
     6        ' Hello'   1         -0.121110    -0.121110   0.000000    0.00%
     7             ' '   3         -5.887629    -5.887629   0.000000    0.00%
     7        ' world'   1         -0.012629    -0.012629   0.000000    0.00%
     7        ' Hello'   2         -5.637629    -5.637629   0.000000    0.00%
     8             ' '   2         -3.218803    -3.218803   0.000000    0.00%
     8          '\n\n'   3         -4.718802    -4.718802   0.000000    0.00%
     8        ' Hello'   1         -0.093803    -0.093803   0.000000    0.00%
     9             ' '   3         -6.010202    -6.010202   0.000000    0.00%
     9        ' world'   1         -0.010202    -0.010202   0.000000    0.00%
     9        ' Hello'   2         -5.885202    -5.885202   0.000000    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.888613   0.000000    0.00%
     0             '!'   3         -3.326113    -3.326113   0.000000    0.00%
     0        ' Hello'   1         -1.576113    -1.576113   0.000000    0.00%
     1             ' '   2         -3.978908    -3.978908   0.000000    0.00%
     1        ' world'   1         -0.103908    -0.103908   0.000000    0.00%
     1        ' Hello'   3         -4.478908    -4.478908   0.000000    0.00%
     2             '!'   3         -2.636929    -2.636929   0.000000    0.00%
     2             ' '   2         -2.261929    -2.261929   0.000000    0.00%
     2        ' Hello'   1         -0.886929    -0.886929   0.000000    0.00%
     3             ' '   3         -4.548309    -4.548309   0.000000    0.00%
     3        ' world'   1         -0.048309    -0.048309   0.000000    0.00%
     3        ' Hello'   2         -4.298309    -4.298309   0.000000    0.00%
     4             ' '   2         -2.607722    -2.607722   0.000000    0.00%
     4          '\n\n'   3         -3.857722    -3.857722   0.000000    0.00%
     4        ' Hello'   1         -0.232722    -0.232722   0.000000    0.00%
     5             ' '   3         -5.520329    -5.520329   0.000000    0.00%
     5        ' world'   1         -0.020330    -0.020330   0.000000    0.00%
     5        ' Hello'   2         -4.895329    -4.895329   0.000000    0.00%
     6             ' '   2         -3.120406    -3.120406   0.000000    0.00%
     6          '\n\n'   3         -4.370406    -4.370406   0.000000    0.00%
     6        ' Hello'   1         -0.120406    -0.120406   0.000000    0.00%
     7             ' '   3         -5.763486    -5.763486   0.000000    0.00%
     7        ' world'   1         -0.013486    -0.013486   0.000000    0.00%
     7        ' Hello'   2         -5.513486    -5.513486   0.000000    0.00%
     8             ' '   2         -3.219019    -3.219019   0.000000    0.00%
     8          '\n\n'   3         -4.719019    -4.719019   0.000000    0.00%
     8        ' Hello'   1         -0.094020    -0.094020   0.000000    0.00%
     9             ' '   3         -6.010196    -6.010196   0.000000    0.00%
     9        ' world'   1         -0.010196    -0.010196   0.000000    0.00%
     9        ' Hello'   2         -5.885196    -5.885196   0.000000    0.00%

  SUMMARY: 60 comparisons, max_diff=0.000000, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[32] ---

######################################################################
COMPARISON: eager  vs  graph_sp32
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.927903   0.019541    0.67%
     0             '!'   3         -3.283362    -3.302903   0.019541    0.59%
     0        ' Hello'   1         -1.533363    -1.552903   0.019541    1.26%
     1             ' '   2         -4.092611    -4.100432   0.007822    0.19%
     1        ' world'   1         -0.092611    -0.100432   0.007821    7.79%
     1        ' Hello'   3         -4.592611    -4.475432   0.117178    2.55%
     2             '!'   3         -2.636380    -2.683910   0.047530    1.77%
     2             ' '   2         -2.261380    -2.308910   0.047530    2.06%
     2        ' Hello'   1         -0.886380    -0.808910   0.077470    8.74%
     3             ' '   3         -4.668087    -4.667976   0.000112    0.00%
     3        ' world'   1         -0.043087    -0.042976   0.000111    0.26%
     3        ' Hello'   2         -4.418087    -4.417976   0.000112    0.00%
     4             ' '   2         -2.607630    -2.724631   0.117002    4.29%
     4          '\n\n'   3         -3.857630    -3.849631   0.007998    0.21%
     4        ' Hello'   1         -0.232630    -0.224631   0.007998    3.44%
     5             ' '   3         -5.520524    -5.520240   0.000284    0.01%
     5        ' world'   1         -0.020524    -0.020240   0.000285    1.39%
     5        ' Hello'   2         -4.895524    -4.895240   0.000284    0.01%
     6             ' '   2         -3.121110    -3.118993   0.002117    0.07%
     6          '\n\n'   3         -4.371110    -4.493993   0.122883    2.73%
     6        ' Hello'   1         -0.121110    -0.118993   0.002117    1.75%
     7             ' '   3         -5.887629    -5.762938   0.124691    2.12%
     7        ' world'   1         -0.012629    -0.012937   0.000308    2.38%
     7        ' Hello'   2         -5.637629    -5.637938   0.000309    0.01%
     8             ' '   2         -3.218803    -3.217041   0.001761    0.05%
     8          '\n\n'   3         -4.718802    -4.842041   0.123239    2.55%
     8        ' Hello'   1         -0.093803    -0.092042   0.001761    1.88%
     9             ' '   3         -6.010202    -6.010182   0.000020    0.00%
     9        ' world'   1         -0.010202    -0.010182   0.000020    0.20%
     9        ' Hello'   2         -5.885202    -5.885182   0.000020    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.931652   0.043040    1.47%
     0             '!'   3         -3.326113    -3.369152   0.043040    1.28%
     0        ' Hello'   1         -1.576113    -1.556652   0.019461    1.23%
     1             ' '   2         -3.978908    -4.101131   0.122224    2.98%
     1        ' world'   1         -0.103908    -0.101131   0.002776    2.67%
     1        ' Hello'   3         -4.478908    -4.476131   0.002777    0.06%
     2             '!'   3         -2.636929    -2.630305   0.006624    0.25%
     2             ' '   2         -2.261929    -2.255305   0.006624    0.29%
     2        ' Hello'   1         -0.886929    -0.880305   0.006624    0.75%
     3             ' '   3         -4.548309    -4.547089   0.001220    0.03%
     3        ' world'   1         -0.048309    -0.047089   0.001220    2.53%
     3        ' Hello'   2         -4.298309    -4.422089   0.123780    2.80%
     4             ' '   2         -2.607722    -2.608346   0.000624    0.02%
     4          '\n\n'   3         -3.857722    -3.858346   0.000624    0.02%
     4        ' Hello'   1         -0.232722    -0.233346   0.000624    0.27%
     5             ' '   3         -5.520329    -5.397932   0.122397    2.22%
     5        ' world'   1         -0.020330    -0.022932   0.002602   11.35%
     5        ' Hello'   2         -4.895329    -4.772932   0.122397    2.50%
     6             ' '   2         -3.120406    -3.118514   0.001892    0.06%
     6          '\n\n'   3         -4.370406    -4.493514   0.123108    2.74%
     6        ' Hello'   1         -0.120406    -0.118514   0.001892    1.57%
     7             ' '   3         -5.763486    -5.762949   0.000536    0.01%
     7        ' world'   1         -0.013486    -0.012950   0.000536    3.98%
     7        ' Hello'   2         -5.513486    -5.637949   0.124464    2.21%
     8             ' '   2         -3.219019    -3.218347   0.000673    0.02%
     8          '\n\n'   3         -4.719019    -4.718347   0.000673    0.01%
     8        ' Hello'   1         -0.094020    -0.093347   0.000673    0.72%
     9             ' '   3         -6.010196    -6.010227   0.000031    0.00%
     9        ' world'   1         -0.010196    -0.010227   0.000030    0.30%
     9        ' Hello'   2         -5.885196    -5.885227   0.000031    0.00%

  SUMMARY: 60 comparisons, max_diff=0.124691, fail@5%=0, fail@10%=0

--- Eager vs CUDA graph split=[64] ---

######################################################################
COMPARISON: eager  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.908362    -2.927903   0.019541    0.67%
     0             '!'   3         -3.283362    -3.302903   0.019541    0.59%
     0        ' Hello'   1         -1.533363    -1.552903   0.019541    1.26%
     1             ' '   2         -4.092611    -4.100432   0.007822    0.19%
     1        ' world'   1         -0.092611    -0.100432   0.007821    7.79%
     1        ' Hello'   3         -4.592611    -4.475432   0.117178    2.55%
     2             '!'   3         -2.636380    -2.683910   0.047530    1.77%
     2             ' '   2         -2.261380    -2.308910   0.047530    2.06%
     2        ' Hello'   1         -0.886380    -0.808910   0.077470    8.74%
     3             ' '   3         -4.668087    -4.667976   0.000112    0.00%
     3        ' world'   1         -0.043087    -0.042976   0.000111    0.26%
     3        ' Hello'   2         -4.418087    -4.417976   0.000112    0.00%
     4             ' '   2         -2.607630    -2.724631   0.117002    4.29%
     4          '\n\n'   3         -3.857630    -3.849631   0.007998    0.21%
     4        ' Hello'   1         -0.232630    -0.224631   0.007998    3.44%
     5             ' '   3         -5.520524    -5.520240   0.000284    0.01%
     5        ' world'   1         -0.020524    -0.020240   0.000285    1.39%
     5        ' Hello'   2         -4.895524    -4.895240   0.000284    0.01%
     6             ' '   2         -3.121110    -3.118993   0.002117    0.07%
     6          '\n\n'   3         -4.371110    -4.493993   0.122883    2.73%
     6        ' Hello'   1         -0.121110    -0.118993   0.002117    1.75%
     7             ' '   3         -5.887629    -5.762938   0.124691    2.12%
     7        ' world'   1         -0.012629    -0.012937   0.000308    2.38%
     7        ' Hello'   2         -5.637629    -5.637938   0.000309    0.01%
     8             ' '   2         -3.218803    -3.217041   0.001761    0.05%
     8          '\n\n'   3         -4.718802    -4.842041   0.123239    2.55%
     8        ' Hello'   1         -0.093803    -0.092042   0.001761    1.88%
     9             ' '   3         -6.010202    -6.010182   0.000020    0.00%
     9        ' world'   1         -0.010202    -0.010182   0.000020    0.20%
     9        ' Hello'   2         -5.885202    -5.885182   0.000020    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.888613    -2.931652   0.043040    1.47%
     0             '!'   3         -3.326113    -3.369152   0.043040    1.28%
     0        ' Hello'   1         -1.576113    -1.556652   0.019461    1.23%
     1             ' '   2         -3.978908    -4.101131   0.122224    2.98%
     1        ' world'   1         -0.103908    -0.101131   0.002776    2.67%
     1        ' Hello'   3         -4.478908    -4.476131   0.002777    0.06%
     2             '!'   3         -2.636929    -2.630305   0.006624    0.25%
     2             ' '   2         -2.261929    -2.255305   0.006624    0.29%
     2        ' Hello'   1         -0.886929    -0.880305   0.006624    0.75%
     3             ' '   3         -4.548309    -4.547089   0.001220    0.03%
     3        ' world'   1         -0.048309    -0.047089   0.001220    2.53%
     3        ' Hello'   2         -4.298309    -4.422089   0.123780    2.80%
     4             ' '   2         -2.607722    -2.608346   0.000624    0.02%
     4          '\n\n'   3         -3.857722    -3.858346   0.000624    0.02%
     4        ' Hello'   1         -0.232722    -0.233346   0.000624    0.27%
     5             ' '   3         -5.520329    -5.397932   0.122397    2.22%
     5        ' world'   1         -0.020330    -0.022932   0.002602   11.35%
     5        ' Hello'   2         -4.895329    -4.772932   0.122397    2.50%
     6             ' '   2         -3.120406    -3.118514   0.001892    0.06%
     6          '\n\n'   3         -4.370406    -4.493514   0.123108    2.74%
     6        ' Hello'   1         -0.120406    -0.118514   0.001892    1.57%
     7             ' '   3         -5.763486    -5.762949   0.000536    0.01%
     7        ' world'   1         -0.013486    -0.012950   0.000536    3.98%
     7        ' Hello'   2         -5.513486    -5.637949   0.124464    2.21%
     8             ' '   2         -3.219019    -3.218347   0.000673    0.02%
     8          '\n\n'   3         -4.719019    -4.718347   0.000673    0.01%
     8        ' Hello'   1         -0.094020    -0.093347   0.000673    0.72%
     9             ' '   3         -6.010196    -6.010227   0.000031    0.00%
     9        ' world'   1         -0.010196    -0.010227   0.000030    0.30%
     9        ' Hello'   2         -5.885196    -5.885227   0.000031    0.00%

  SUMMARY: 60 comparisons, max_diff=0.124691, fail@5%=0, fail@10%=0

--- CUDA graph split=[32] vs split=[64] (the confound) ---

######################################################################
COMPARISON: graph_sp32  vs  graph_sp64
######################################################################

  [no_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.927903    -2.927903   0.000000    0.00%
     0             '!'   3         -3.302903    -3.302903   0.000000    0.00%
     0        ' Hello'   1         -1.552903    -1.552903   0.000000    0.00%
     1             ' '   2         -4.100432    -4.100432   0.000000    0.00%
     1        ' world'   1         -0.100432    -0.100432   0.000000    0.00%
     1        ' Hello'   3         -4.475432    -4.475432   0.000000    0.00%
     2             '!'   3         -2.683910    -2.683910   0.000000    0.00%
     2             ' '   2         -2.308910    -2.308910   0.000000    0.00%
     2        ' Hello'   1         -0.808910    -0.808910   0.000000    0.00%
     3             ' '   3         -4.667976    -4.667976   0.000000    0.00%
     3        ' world'   1         -0.042976    -0.042976   0.000000    0.00%
     3        ' Hello'   2         -4.417976    -4.417976   0.000000    0.00%
     4             ' '   2         -2.724631    -2.724631   0.000000    0.00%
     4          '\n\n'   3         -3.849631    -3.849631   0.000000    0.00%
     4        ' Hello'   1         -0.224631    -0.224631   0.000000    0.00%
     5             ' '   3         -5.520240    -5.520240   0.000000    0.00%
     5        ' world'   1         -0.020240    -0.020240   0.000000    0.00%
     5        ' Hello'   2         -4.895240    -4.895240   0.000000    0.00%
     6             ' '   2         -3.118993    -3.118993   0.000000    0.00%
     6          '\n\n'   3         -4.493993    -4.493993   0.000000    0.00%
     6        ' Hello'   1         -0.118993    -0.118993   0.000000    0.00%
     7             ' '   3         -5.762938    -5.762938   0.000000    0.00%
     7        ' world'   1         -0.012937    -0.012937   0.000000    0.00%
     7        ' Hello'   2         -5.637938    -5.637938   0.000000    0.00%
     8             ' '   2         -3.217041    -3.217041   0.000000    0.00%
     8          '\n\n'   3         -4.842041    -4.842041   0.000000    0.00%
     8        ' Hello'   1         -0.092042    -0.092042   0.000000    0.00%
     9             ' '   3         -6.010182    -6.010182   0.000000    0.00%
     9        ' world'   1         -0.010182    -0.010182   0.000000    0.00%
     9        ' Hello'   2         -5.885182    -5.885182   0.000000    0.00%

  [with_penalty] 10 positions
   pos           token  rank         lp_A         lp_B       diff     rel%
  ------------------------------------------------------------------------
     0             '1'   2         -2.931652    -2.931652   0.000000    0.00%
     0             '!'   3         -3.369152    -3.369152   0.000000    0.00%
     0        ' Hello'   1         -1.556652    -1.556652   0.000000    0.00%
     1             ' '   2         -4.101131    -4.101131   0.000000    0.00%
     1        ' world'   1         -0.101131    -0.101131   0.000000    0.00%
     1        ' Hello'   3         -4.476131    -4.476131   0.000000    0.00%
     2             '!'   3         -2.630305    -2.630305   0.000000    0.00%
     2             ' '   2         -2.255305    -2.255305   0.000000    0.00%
     2        ' Hello'   1         -0.880305    -0.880305   0.000000    0.00%
     3             ' '   3         -4.547089    -4.547089   0.000000    0.00%
     3        ' world'   1         -0.047089    -0.047089   0.000000    0.00%
     3        ' Hello'   2         -4.422089    -4.422089   0.000000    0.00%
     4             ' '   2         -2.608346    -2.608346   0.000000    0.00%
     4          '\n\n'   3         -3.858346    -3.858346   0.000000    0.00%
     4        ' Hello'   1         -0.233346    -0.233346   0.000000    0.00%
     5             ' '   3         -5.397932    -5.397932   0.000000    0.00%
     5        ' world'   1         -0.022932    -0.022932   0.000000    0.00%
     5        ' Hello'   2         -4.772932    -4.772932   0.000000    0.00%
     6             ' '   2         -3.118514    -3.118514   0.000000    0.00%
     6          '\n\n'   3         -4.493514    -4.493514   0.000000    0.00%
     6        ' Hello'   1         -0.118514    -0.118514   0.000000    0.00%
     7             ' '   3         -5.762949    -5.762949   0.000000    0.00%
     7        ' world'   1         -0.012950    -0.012950   0.000000    0.00%
     7        ' Hello'   2         -5.637949    -5.637949   0.000000    0.00%
     8             ' '   2         -3.218347    -3.218347   0.000000    0.00%
     8          '\n\n'   3         -4.718347    -4.718347   0.000000    0.00%
     8        ' Hello'   1         -0.093347    -0.093347   0.000000    0.00%
     9             ' '   3         -6.010227    -6.010227   0.000000    0.00%
     9        ' world'   1         -0.010227    -0.010227   0.000000    0.00%
     9        ' Hello'   2         -5.885227    -5.885227   0.000000    0.00%

  SUMMARY: 60 comparisons, max_diff=0.000000, fail@5%=0, fail@10%=0


======================================================================
FINAL SUMMARY
======================================================================

Description                                                MaxDiff   F@5%  F@10%
--------------------------------------------------------------------------------
Eager vs Eager (sanity: expect zero diff)                 0.000000      0      0
Eager vs CUDA graph split=[32]                            0.124691      0      0
Eager vs CUDA graph split=[64]                            0.124691      0      0
CUDA graph split=[32] vs split=[64] (the confound)        0.000000      0      0

These experiments were conducted on MI355 machine. Btw, I would like some help with revamping the skinny GEMMs test since these failures should be caught there. Can you help with those tasks?

EDIT: While the above is a custom script, the motivation for this was the V1 Test others test group on our CI, and particularly pytest -v -s tests/v1/sample/test_logprobs.py which is currently failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants