Skip to content

[hipDNN]: Initial RFC for Golden Reference Validation#7082

Open
bghimireamd wants to merge 107 commits into
developfrom
users/bghimire/rfc-0010-golden-reference-validation
Open

[hipDNN]: Initial RFC for Golden Reference Validation#7082
bghimireamd wants to merge 107 commits into
developfrom
users/bghimire/rfc-0010-golden-reference-validation

Conversation

@bghimireamd
Copy link
Copy Markdown
Contributor

@bghimireamd bghimireamd commented May 5, 2026

Introduces golden reference validation as a third verification mode for the integration test suite.

ALMIOPEN-1925

bghimireamd and others added 3 commits May 5, 2026 15:15
Introduces golden reference validation as a third verification mode
for the integration test suite. Covers the 7-step pipeline, manifest
format, graph-level correctness, external reference support, and
CI integration strategy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reword regression detection bullet for clarity
- Split 7-step table into two pipeline-specific tables
- Remove redundant step-by-step lists (tables cover it)
- Drop misleading "3 core verbs" claim
- Fix slowness bullet to be CPU-specific
- Simplify customer personas into "Who This Serves" bullets
- Reword prior-work reference to avoid "limitations"
- Clarify plugin-agnostic integration
- Note that directory naming mirrors existing test conventions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reword "No C++ ref kernel needed" to "Unblocks testing"
- Spell out "both sides" in regression detection bullet
- Merge two pipeline tables into one with Generation/Validation columns
- Remove all line-number references, link to executor files directly
- Fix slowness bullet to CPU-specific

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bghimireamd and others added 7 commits May 5, 2026 21:04
- Drop "7 steps" hardcoded framing throughout
- Rename "The 7 Steps" section to "Pipeline Overview"
- Soften intro line to "follows these steps"
- Reword spine reference to "same set of steps"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace dense paragraph with numbered list of all three modes
- Makes it immediately clear what exists vs what's new

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Soften "everything after it would be meaningless" to acknowledge the
check is conservative (may over-reject when only UIDs change). Note the
determinism assumption on graph.to_binary() and the semantic-comparison
fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Describe the function signature and dependency (OpenSSL EVP) without
inlining the full implementation -- keeps the RFC focused on design
decisions rather than boilerplate.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Require generator to refuse writing golden data if any tensor name is
empty or duplicated, protecting the name-based identity contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Duplicate names cause silent data loss in both the JSON manifest
(last key wins) and the name-to-UID map (overwrite). Document this
risk inline with the tensor identity contract.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 2 was the only pipeline step without acceptance criteria. Add
checks for reference executor flags, external reference file handling,
manifest metadata, and NaN/Inf output validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
bghimireamd and others added 11 commits May 6, 2026 17:22
- Generalize "PyTorch" to "agreed-upon known-good source" with link to
  Reference Sources section
- Merge regression detection into deterministic baselines bullet
- Add data-driven test addition as a new benefit
- Clarify graph is rebuilt from buildGraph(), graph.bin is for
  fingerprint only
- Merge Problem Statement into Executive Summary as "Why This Is Needed"
- Update table of contents

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reviewer feedback: Problem Statement should follow Executive Summary
as a separate top-level section, not a subsection. Restore the
standard RFC structure and renumber TOC.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace time-relative "same as today" with explicit buildGraph() reference
- Fix GoldenReferenceGpu.hpp relative path (was missing one parent level)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cases

- Overwrite protection: generator refuses to clobber existing golden data
  without --force-regenerate
- Atomic write protocol: generate into .tmp dir, rename on success
- Known Limitations section: document what comparison testing structurally
  cannot catch (correlated spec bugs, spec ambiguity, small-size gaps)
- Future Work items 9-11: invariant checks, hand-verified micro cases,
  statistical masking detection
- NaN/NaN and -0/+0 handling in Step 7 comparison
- Format version loading guard: refuse unrecognized manifest versions
- Four new Risk Register rows (frozen ref bug, crash, CRLF, silent overwrite)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strides are derived deterministically from dims via buildGraph(), so
storing them in the manifest creates a redundant source of truth with
no safety benefit. Also regenerate all three diagrams with Pillow:
straight arrows, proper spacing, no overlapping text, and correct
Mismatch/Match flow from the hash-comparison diamond.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The overview section was mixing pipeline-level explanation with
implementation details. Keep the fingerprint gate description in the
overview (it explains the diagram), move the graph.to_binary()
determinism requirement into Step 1 where graph construction lives,
and drop the redundant "shared by both pipelines" paragraph already
covered by Step 1's "Who calls it" field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No other RFC in the project has this section. The audience is already
obvious from the RFC content itself.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The graph fingerprint concept is already fully covered in the
Graph-Level Correctness section (Layer 1) and Step 4. The Pipeline
Overview should just show the diagrams without repeating details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
to_binary() delegates to the backend — hipDNN does not control the
serialization format. Rewrite the determinism requirement to cover
both non-determinism and format changes (e.g., flatbuffers removal),
with a concrete fallback (property-level comparison) instead of the
vague "semantic comparison".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the fingerprint discussion out of Step 1 (which is about graph
construction) into Layer 1 (where the fingerprint check is defined).
Rewrite Layer 1 with clear structure: Goal, How it works, Why
to_binary() is fragile (backend coupling), Phase 1 (accepted cost),
Phase 2 (graph-property hash that eliminates backend dependency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FlatBuffers officially does not guarantee deterministic serialization.
In practice, same binary + same graph = same bytes (Graph::Pack uses
fixed field order), which is sufficient for Phase 1. Add a table of
what can break the fingerprint without a real graph change, with
likelihood and impact. Acknowledge this is the first consumer of
to_binary() byte stability in the test suite. Step 1 now cross-refs
the Layer 1 discussion instead of repeating it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bghimireamd and others added 2 commits May 12, 2026 13:21
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md
Comment thread projects/hipdnn/docs/rfcs/0010_GoldenReferenceValidation.md Outdated
bghimireamd and others added 2 commits May 14, 2026 13:36
- Shorten quick/smoke tier note
- Remove broken pipeline image references
- Add higher-precision generation note to pipeline
- Soften architecture portability language
- Clarify test discovery with verification modes cross-reference
- Drop per-tier minimum test count (use code review instead)
- Add error handling acceptance criteria (per-test fail, not suite)
- Replace computed/both modes with explicit gpu/cpu modes
- Clarify tolerance stopgap (graph provides lookup key, not tolerance)
- Drop SHA-256 checksums (DVC handles integrity)
- Add safetensors to Alternatives Considered
- Replace in-graph metadata with companion bundle.meta.json file

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@bghimireamd bghimireamd requested a review from adickin-amd May 20, 2026 20:54
Golden data is generated by Python scripts in [`reference_data_scripts/`](../../reference_data_scripts/). The reference source is configurable per operation (see [Reference Sources](#reference-sources)); the existing batchnorm generators use PyTorch. Each generator follows the same three-step pattern:

1. **Define** -- create graph and input tensors
2. **Compute** -- run the reference source in **higher precision** than the target data type (e.g., fp64 for fp32 bundles, fp32 for fp16 bundles) to minimize reference variance, then cast outputs down to the target type
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Note] Generating golden data at higher precision to eliminate variance would work well for shallow operations (batchnorm, pointwise), but is counterproductive for deep, nonlinear operations like SDPA backward.

SDPA backward has 5+ rounding events through softmax (exp, sum, div, matmul, accumulate). An FP64 reference rounds once at the end; a BF16 kernel accumulates rounding at every stage. The delta between them is dominated by accumulated precision loss, not kernel bugs — requiring significantly looser tolerance than simple operations to avoid false failures.

PyTorch's math backend with BF16 inputs auto-upcasts intermediates to FP32, which is a like-for-like match for AITER's a32 accumulator kernels. Using this instead of FP64 produces a reference that follows the same rounding trajectory as the kernel, enabling tighter tolerances.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that we need the golden data to match the kernel’s rounding behavior closely enough to be a useful comparison. For SDPA backward, I don’t think an FP64 reference that rounds once at the end is the right comparison for BF16/a32 kernels, because it can turn expected accumulated precision loss into apparent error.

The distinction I’d make is that internal compute can still be higher precision than the output, but it should match the implementation contract. For example, BF16 inputs with FP32 intermediates/accumulation is a reasonable reference for a32 accumulator kernels, while preserving the relevant casts at semantic boundaries.

I agree we need to be careful about how golden data is generated per operation. The reference path and tolerances should be tied to the operation details, accumulator mode, and expected rounding path, rather than using one FP64 golden strategy for all ops.

3. **Bundle-to-bundle comparison**: A standalone tool that loads two bundles and diffs their output tensors directly, matched by graph content rather than filename.
4. **Reproducible generation**: Fixed seeds for random input generation so that regenerating a bundle produces the same inputs, isolating output differences to the reference source change.
5. **Auto-tier classification**: The generator suggests the appropriate tier folder based on tensor element counts, matching the existing size conventions.
6. **KnobSettings coverage**: Validate the same golden bundles under different engine KnobSettings configurations to catch regressions from tuning changes.
Copy link
Copy Markdown
Contributor

@AnaghaRaoAMD AnaghaRaoAMD May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-deterministic engines

The golden validation model assumes: given input X, engine always produces output Y (within tolerance). This breaks for engines using non-deterministic accumulation — e.g., AITER's v3_atomic_fp32 dQ path uses atomic FP32 additions where thread-block scheduling varies run-to-run, producing bit-different results from floating-point non-associativity.

A single golden snapshot captures one realization of the output distribution. Comparing against it produces flaky tests — sometimes passing, sometimes failing depending on GPU scheduling.

Common resolutions: (1) statistical tolerance banding from N runs, (2) envelope min/max bounds

Suggestions:

  1. Add a deterministic field to meta.json (boolean or enum) so the test runner can apply different validation strategies
  2. Recommend determinism enforcement as the default — generate golden data with the deterministic kernel variant, validate CI against the same variant
  3. Add "golden comparison fails intermittently for non-deterministic engines" to the Risk Register with the chosen mitigation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think forcing the deterministic variant is a valid mitigation if the goal is to validate the non-deterministic engine. It could be useful for generating a stable reference, but if the production/selected engine uses atomic accumulation, then CI needs to exercise that path directly. The mitigation should be tolerance/error bands that cover the valid run-to-run variance, not swapping the implementation under test.

For non-deterministic accumulation, single-output comparison is valid only if the error bands account for the output distribution. That could mean derived tolerances from repeated runs, or an envelope/min-max policy, but the important part is that the validation tolerance covers expected scheduling/order variation.

I think tracking determinism on the bundle could still be useful for helping triage test flakiness, though. I just wouldn’t treat that metadata as the mitigation by itself.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — I struck out the deterministic variant suggestion since that defeats the purpose of validating the actual engine path. To clarify, my concern is about the golden reference data itself, not the engine under test. If golden snapshots are captured using a non-deterministic GPU reference, the golden data is just one sample from an output distribution, introducing flakiness from the reference side.

Today this isn't an issue — our references are CPU/PyTorch-based and deterministic. But if we ever add GPU-based references to the golden data pipeline, we should be wary of non-deterministic accumulation contaminating the reference. I do not see this happening but it's worth flagging now. At that point we'd either need to constrain golden generation to deterministic paths (CPU or order-enforcing GPU variants), or use statistical tolerance banding (envelope/min-max from N runs).

```

Only batchnorm has generators and data today. Generators for the remaining operations will be added incrementally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it seems Forward and backward are independent bundles (ConvFwd, ConvBwd), but backward operations consume saved tensors from the forward pass (O, LSE for SDPA; saved mean/variance for batchnorm; mask for dropout). These inputs must come from a mathematically consistent forward computation — random values produce invalid backward bundles.

For SDPA backward, O and LSE generation has two options: (A) PyTorch forward — architecture-independent but CPU-rounded (inflates tolerance for complex operators), or (B) GPU forward — production-realistic but ties bundles to a specific architecture (SDPAhas platform specific fwd kernels).

Suggestions:

  1. Document the forward-backward coupling pattern — backward bundles have a generation constraint that applies to any op with a backward pass
  2. Add a forward_source field to meta.json for backward bundles (e.g., {"method": "pytorch_math_backend", "version": "2.3.0"})
  3. Recommend Option A (PyTorch forward) as default for architecture-independent bundles; document Option B as an advanced alternative
  4. Have the pre-commit verifier check that backward bundles include forward_source in metadata for respective operators

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will add a "Forward-Backward Generation Constraint" section to the RFC covering:

  1. Backward bundles' saved-tensor inputs (O, LSE for SDPA; saved_mean, inv_variance for batchnorm) are actual outputs of the forward reference, not random values
  2. Generation script must produce both bundles in a single pipeline run, regenerating forward alone silently stales the backward bundle (passes CI but validates against a dead reference)
  3. forward_source metadata in backward bundles with CI cross-check against the forward bundle's metadata for enforcement

Good call on the forward_source field.

Comment on lines +623 to +624

**Decided: DVC pull at CI time.** DVC is already in the repo for other large binary assets, content-addressing provides integrity guarantees, and selective fetch by path avoids pulling data for operations not under test. CI jobs run `dvc pull golden_reference_data/{tier}/` before test execution to fetch only the data needed for the tier under test. Fallback: `--verification-mode cpu` (or `gpu`) runs without golden data if DVC is unavailable, so CI is never fully blocked by a storage outage.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bundle sizes could vary dramatically across operations. Batchnorm bundles (Phase 1) are under 1 MB. SDPA backward bundles contain 9 tensors and scale as O(B×H×S×D):

Shape Bundle Size
{1, 1, 256, 128} ~513 KB
{2, 8, 1024, 128} ~32 MB
{4, 32, 4096, 128} ~1 GB
{8, 64, 8192, 128} ~8 GB
  1. Is there a per-bundle size cap for quick/ tier? (e.g., < 10 MB to avoid DVC dependency in pre-submit CI)
  2. What is the total storage budget per tier as operations are onboarded? (6 ops × 2 directions × 3 dtypes × 2 layouts = 72+ bundles)
  3. Should the auto-tier classifier enforce size limits or just warn?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is less of a cap on bundle size and more of a cap on the runtime for each category.

Here is the doc for the test filtering the rock is bringing in. https://github.com/ROCm/TheRock/blob/main/docs/development/test_filtering.md#types-of-filters

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im also making the assumption right now that were going to put all these tests into dvc rather than checking them into the repo since its a ton of binary files. I guess it might be better to check in the graph/meta files normally and just DVC the .bin files.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think if we have a ton of graphs that have 8GB bundles that might be kinda nuts so maybe we need some kind of cap so we stay within a reasonable limit. If we need these super large bundles, we probably want to explore hosting these elsewhere since if you have to pull these down for local development you simply wont have the disc space available. If we had a ton of ~8GB tests wed quickly use up 1TB of storage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this isn't an immediate concern. We've been limiting tensor sizes for SDPA since CPU reference generation is slow, so we can stick with medium-sized tensors for golden data as well. That said, even with moderate shapes, SDPA backward has 9 tensors — bundle sizes in the hundreds of MB to ~1 GB range are something we'll realistically hit as we onboard full test coverage. Worth keeping the storage budget in mind as more operations come online.


For a **graph-only bundle** (`.json` only, no `.bin` files -- transitional): the runner loads the graph, generates inputs, runs the engine under test and a reference source, and compares their outputs at runtime.

Tolerance is looked up at runtime from the graph content: the runner reads the operation type and data type from the JSON, then follows the [tolerance priority chain](#two-questions-two-levels) (TOML override -> per-operation default). No per-operation test class needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m increasingly thinking tolerance needs to live in the test bundle as graph-specific metadata.

A global tolerance model for all operations seems unlikely to work well. It will either be conservative enough that some checks become too loose, or tight enough that numerically valid cases become flaky. Instead, we should embed tolerances with the graph and have a generation/calibration tool set them by running the relevant reference and engine paths repeatedly to measure expected variance.

That gives us tight tolerances for simple ops and calibrated tolerances for deeper or non-deterministic graphs like SDPA backward.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems easy enough to add tolerances to the metadata file. This also makes it so the metadata file is actually used and valuable and not just a way to trace where binary data came from.

For tolerances we currently have: Per engine toml tolerance(if set) > hardcoded tolerance in c++ .

This would make it so we have: Per engine toml tolerance(if set) > New Bundle Tolerance(metadata file) > hardcoded tolerance in c++ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants