[hipDNN]: Initial RFC for Golden Reference Validation#7082
[hipDNN]: Initial RFC for Golden Reference Validation#7082bghimireamd wants to merge 107 commits into
Conversation
Introduces golden reference validation as a third verification mode for the integration test suite. Covers the 7-step pipeline, manifest format, graph-level correctness, external reference support, and CI integration strategy. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reword regression detection bullet for clarity - Split 7-step table into two pipeline-specific tables - Remove redundant step-by-step lists (tables cover it) - Drop misleading "3 core verbs" claim - Fix slowness bullet to be CPU-specific - Simplify customer personas into "Who This Serves" bullets - Reword prior-work reference to avoid "limitations" - Clarify plugin-agnostic integration - Note that directory naming mirrors existing test conventions Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Reword "No C++ ref kernel needed" to "Unblocks testing" - Spell out "both sides" in regression detection bullet - Merge two pipeline tables into one with Generation/Validation columns - Remove all line-number references, link to executor files directly - Fix slowness bullet to CPU-specific Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Drop "7 steps" hardcoded framing throughout - Rename "The 7 Steps" section to "Pipeline Overview" - Soften intro line to "follows these steps" - Reword spine reference to "same set of steps" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace dense paragraph with numbered list of all three modes - Makes it immediately clear what exists vs what's new Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Soften "everything after it would be meaningless" to acknowledge the check is conservative (may over-reject when only UIDs change). Note the determinism assumption on graph.to_binary() and the semantic-comparison fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Describe the function signature and dependency (OpenSSL EVP) without inlining the full implementation -- keeps the RFC focused on design decisions rather than boilerplate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Require generator to refuse writing golden data if any tensor name is empty or duplicated, protecting the name-based identity contract. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Duplicate names cause silent data loss in both the JSON manifest (last key wins) and the name-to-UID map (overwrite). Document this risk inline with the tensor identity contract. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Step 2 was the only pipeline step without acceptance criteria. Add checks for reference executor flags, external reference file handling, manifest metadata, and NaN/Inf output validation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Generalize "PyTorch" to "agreed-upon known-good source" with link to Reference Sources section - Merge regression detection into deterministic baselines bullet - Add data-driven test addition as a new benefit - Clarify graph is rebuilt from buildGraph(), graph.bin is for fingerprint only - Merge Problem Statement into Executive Summary as "Why This Is Needed" - Update table of contents Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reviewer feedback: Problem Statement should follow Executive Summary as a separate top-level section, not a subsection. Restore the standard RFC structure and renumber TOC. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace time-relative "same as today" with explicit buildGraph() reference - Fix GoldenReferenceGpu.hpp relative path (was missing one parent level) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cases - Overwrite protection: generator refuses to clobber existing golden data without --force-regenerate - Atomic write protocol: generate into .tmp dir, rename on success - Known Limitations section: document what comparison testing structurally cannot catch (correlated spec bugs, spec ambiguity, small-size gaps) - Future Work items 9-11: invariant checks, hand-verified micro cases, statistical masking detection - NaN/NaN and -0/+0 handling in Step 7 comparison - Format version loading guard: refuse unrecognized manifest versions - Four new Risk Register rows (frozen ref bug, crash, CRLF, silent overwrite) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strides are derived deterministically from dims via buildGraph(), so storing them in the manifest creates a redundant source of truth with no safety benefit. Also regenerate all three diagrams with Pillow: straight arrows, proper spacing, no overlapping text, and correct Mismatch/Match flow from the hash-comparison diamond. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The overview section was mixing pipeline-level explanation with implementation details. Keep the fingerprint gate description in the overview (it explains the diagram), move the graph.to_binary() determinism requirement into Step 1 where graph construction lives, and drop the redundant "shared by both pipelines" paragraph already covered by Step 1's "Who calls it" field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
No other RFC in the project has this section. The audience is already obvious from the RFC content itself. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The graph fingerprint concept is already fully covered in the Graph-Level Correctness section (Layer 1) and Step 4. The Pipeline Overview should just show the diagrams without repeating details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
to_binary() delegates to the backend — hipDNN does not control the serialization format. Rewrite the determinism requirement to cover both non-determinism and format changes (e.g., flatbuffers removal), with a concrete fallback (property-level comparison) instead of the vague "semantic comparison". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the fingerprint discussion out of Step 1 (which is about graph construction) into Layer 1 (where the fingerprint check is defined). Rewrite Layer 1 with clear structure: Goal, How it works, Why to_binary() is fragile (backend coupling), Phase 1 (accepted cost), Phase 2 (graph-property hash that eliminates backend dependency). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
FlatBuffers officially does not guarantee deterministic serialization. In practice, same binary + same graph = same bytes (Graph::Pack uses fixed field order), which is sufficient for Phase 1. Add a table of what can break the fingerprint without a real graph change, with likelihood and impact. Acknowledge this is the first consumer of to_binary() byte stability in the test suite. Step 1 now cross-refs the Layer 1 discussion instead of repeating it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ghimire/rfc-0010-golden-reference-validation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Shorten quick/smoke tier note - Remove broken pipeline image references - Add higher-precision generation note to pipeline - Soften architecture portability language - Clarify test discovery with verification modes cross-reference - Drop per-tier minimum test count (use code review instead) - Add error handling acceptance criteria (per-test fail, not suite) - Replace computed/both modes with explicit gpu/cpu modes - Clarify tolerance stopgap (graph provides lookup key, not tolerance) - Drop SHA-256 checksums (DVC handles integrity) - Add safetensors to Alternatives Considered - Replace in-graph metadata with companion bundle.meta.json file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| Golden data is generated by Python scripts in [`reference_data_scripts/`](../../reference_data_scripts/). The reference source is configurable per operation (see [Reference Sources](#reference-sources)); the existing batchnorm generators use PyTorch. Each generator follows the same three-step pattern: | ||
|
|
||
| 1. **Define** -- create graph and input tensors | ||
| 2. **Compute** -- run the reference source in **higher precision** than the target data type (e.g., fp64 for fp32 bundles, fp32 for fp16 bundles) to minimize reference variance, then cast outputs down to the target type |
There was a problem hiding this comment.
[Note] Generating golden data at higher precision to eliminate variance would work well for shallow operations (batchnorm, pointwise), but is counterproductive for deep, nonlinear operations like SDPA backward.
SDPA backward has 5+ rounding events through softmax (exp, sum, div, matmul, accumulate). An FP64 reference rounds once at the end; a BF16 kernel accumulates rounding at every stage. The delta between them is dominated by accumulated precision loss, not kernel bugs — requiring significantly looser tolerance than simple operations to avoid false failures.
PyTorch's math backend with BF16 inputs auto-upcasts intermediates to FP32, which is a like-for-like match for AITER's a32 accumulator kernels. Using this instead of FP64 produces a reference that follows the same rounding trajectory as the kernel, enabling tighter tolerances.
There was a problem hiding this comment.
Agreed that we need the golden data to match the kernel’s rounding behavior closely enough to be a useful comparison. For SDPA backward, I don’t think an FP64 reference that rounds once at the end is the right comparison for BF16/a32 kernels, because it can turn expected accumulated precision loss into apparent error.
The distinction I’d make is that internal compute can still be higher precision than the output, but it should match the implementation contract. For example, BF16 inputs with FP32 intermediates/accumulation is a reasonable reference for a32 accumulator kernels, while preserving the relevant casts at semantic boundaries.
I agree we need to be careful about how golden data is generated per operation. The reference path and tolerances should be tied to the operation details, accumulator mode, and expected rounding path, rather than using one FP64 golden strategy for all ops.
| 3. **Bundle-to-bundle comparison**: A standalone tool that loads two bundles and diffs their output tensors directly, matched by graph content rather than filename. | ||
| 4. **Reproducible generation**: Fixed seeds for random input generation so that regenerating a bundle produces the same inputs, isolating output differences to the reference source change. | ||
| 5. **Auto-tier classification**: The generator suggests the appropriate tier folder based on tensor element counts, matching the existing size conventions. | ||
| 6. **KnobSettings coverage**: Validate the same golden bundles under different engine KnobSettings configurations to catch regressions from tuning changes. |
There was a problem hiding this comment.
Non-deterministic engines
The golden validation model assumes: given input X, engine always produces output Y (within tolerance). This breaks for engines using non-deterministic accumulation — e.g., AITER's v3_atomic_fp32 dQ path uses atomic FP32 additions where thread-block scheduling varies run-to-run, producing bit-different results from floating-point non-associativity.
A single golden snapshot captures one realization of the output distribution. Comparing against it produces flaky tests — sometimes passing, sometimes failing depending on GPU scheduling.
Common resolutions: (1) statistical tolerance banding from N runs, (2) envelope min/max bounds
Suggestions:
- Add a
deterministicfield tometa.json(boolean or enum) so the test runner can apply different validation strategies - Recommend determinism enforcement as the default — generate golden data with the deterministic kernel variant,
validate CI against the same variant - Add "golden comparison fails intermittently for non-deterministic engines" to the Risk Register with the chosen mitigation
There was a problem hiding this comment.
I don’t think forcing the deterministic variant is a valid mitigation if the goal is to validate the non-deterministic engine. It could be useful for generating a stable reference, but if the production/selected engine uses atomic accumulation, then CI needs to exercise that path directly. The mitigation should be tolerance/error bands that cover the valid run-to-run variance, not swapping the implementation under test.
For non-deterministic accumulation, single-output comparison is valid only if the error bands account for the output distribution. That could mean derived tolerances from repeated runs, or an envelope/min-max policy, but the important part is that the validation tolerance covers expected scheduling/order variation.
I think tracking determinism on the bundle could still be useful for helping triage test flakiness, though. I just wouldn’t treat that metadata as the mitigation by itself.
There was a problem hiding this comment.
Good point — I struck out the deterministic variant suggestion since that defeats the purpose of validating the actual engine path. To clarify, my concern is about the golden reference data itself, not the engine under test. If golden snapshots are captured using a non-deterministic GPU reference, the golden data is just one sample from an output distribution, introducing flakiness from the reference side.
Today this isn't an issue — our references are CPU/PyTorch-based and deterministic. But if we ever add GPU-based references to the golden data pipeline, we should be wary of non-deterministic accumulation contaminating the reference. I do not see this happening but it's worth flagging now. At that point we'd either need to constrain golden generation to deterministic paths (CPU or order-enforcing GPU variants), or use statistical tolerance banding (envelope/min-max from N runs).
| ``` | ||
|
|
||
| Only batchnorm has generators and data today. Generators for the remaining operations will be added incrementally. | ||
|
|
There was a problem hiding this comment.
Right now, it seems Forward and backward are independent bundles (ConvFwd, ConvBwd), but backward operations consume saved tensors from the forward pass (O, LSE for SDPA; saved mean/variance for batchnorm; mask for dropout). These inputs must come from a mathematically consistent forward computation — random values produce invalid backward bundles.
For SDPA backward, O and LSE generation has two options: (A) PyTorch forward — architecture-independent but CPU-rounded (inflates tolerance for complex operators), or (B) GPU forward — production-realistic but ties bundles to a specific architecture (SDPAhas platform specific fwd kernels).
Suggestions:
- Document the forward-backward coupling pattern — backward bundles have a generation constraint that applies to any op with a backward pass
- Add a
forward_sourcefield tometa.jsonfor backward bundles (e.g.,{"method": "pytorch_math_backend", "version": "2.3.0"}) - Recommend Option A (PyTorch forward) as default for architecture-independent bundles; document Option B as an advanced alternative
- Have the pre-commit verifier check that backward bundles include
forward_sourcein metadata for respective operators
There was a problem hiding this comment.
Agreed. Will add a "Forward-Backward Generation Constraint" section to the RFC covering:
- Backward bundles' saved-tensor inputs (O, LSE for
SDPA; saved_mean, inv_variance forbatchnorm) are actual outputs of the forward reference, not random values - Generation script must produce both bundles in a single pipeline run, regenerating forward alone silently stales the backward bundle (passes CI but validates against a dead reference)
forward_sourcemetadata in backward bundles with CI cross-check against the forward bundle's metadata for enforcement
Good call on the forward_source field.
|
|
||
| **Decided: DVC pull at CI time.** DVC is already in the repo for other large binary assets, content-addressing provides integrity guarantees, and selective fetch by path avoids pulling data for operations not under test. CI jobs run `dvc pull golden_reference_data/{tier}/` before test execution to fetch only the data needed for the tier under test. Fallback: `--verification-mode cpu` (or `gpu`) runs without golden data if DVC is unavailable, so CI is never fully blocked by a storage outage. |
There was a problem hiding this comment.
Bundle sizes could vary dramatically across operations. Batchnorm bundles (Phase 1) are under 1 MB. SDPA backward bundles contain 9 tensors and scale as O(B×H×S×D):
| Shape | Bundle Size |
|---|---|
| {1, 1, 256, 128} | ~513 KB |
| {2, 8, 1024, 128} | ~32 MB |
| {4, 32, 4096, 128} | ~1 GB |
| {8, 64, 8192, 128} | ~8 GB |
- Is there a per-bundle size cap for
quick/tier? (e.g., < 10 MB to avoid DVC dependency in pre-submit CI) - What is the total storage budget per tier as operations are onboarded? (6 ops × 2 directions × 3 dtypes × 2 layouts = 72+ bundles)
- Should the auto-tier classifier enforce size limits or just warn?
There was a problem hiding this comment.
there is less of a cap on bundle size and more of a cap on the runtime for each category.
Here is the doc for the test filtering the rock is bringing in. https://github.com/ROCm/TheRock/blob/main/docs/development/test_filtering.md#types-of-filters
There was a problem hiding this comment.
Im also making the assumption right now that were going to put all these tests into dvc rather than checking them into the repo since its a ton of binary files. I guess it might be better to check in the graph/meta files normally and just DVC the .bin files.
There was a problem hiding this comment.
I also think if we have a ton of graphs that have 8GB bundles that might be kinda nuts so maybe we need some kind of cap so we stay within a reasonable limit. If we need these super large bundles, we probably want to explore hosting these elsewhere since if you have to pull these down for local development you simply wont have the disc space available. If we had a ton of ~8GB tests wed quickly use up 1TB of storage.
There was a problem hiding this comment.
Agreed this isn't an immediate concern. We've been limiting tensor sizes for SDPA since CPU reference generation is slow, so we can stick with medium-sized tensors for golden data as well. That said, even with moderate shapes, SDPA backward has 9 tensors — bundle sizes in the hundreds of MB to ~1 GB range are something we'll realistically hit as we onboard full test coverage. Worth keeping the storage budget in mind as more operations come online.
|
|
||
| For a **graph-only bundle** (`.json` only, no `.bin` files -- transitional): the runner loads the graph, generates inputs, runs the engine under test and a reference source, and compares their outputs at runtime. | ||
|
|
||
| Tolerance is looked up at runtime from the graph content: the runner reads the operation type and data type from the JSON, then follows the [tolerance priority chain](#two-questions-two-levels) (TOML override -> per-operation default). No per-operation test class needed. |
There was a problem hiding this comment.
I’m increasingly thinking tolerance needs to live in the test bundle as graph-specific metadata.
A global tolerance model for all operations seems unlikely to work well. It will either be conservative enough that some checks become too loose, or tight enough that numerically valid cases become flaky. Instead, we should embed tolerances with the graph and have a generation/calibration tool set them by running the relevant reference and engine paths repeatedly to measure expected variance.
That gives us tight tolerances for simple ops and calibrated tolerances for deeper or non-deterministic graphs like SDPA backward.
There was a problem hiding this comment.
seems easy enough to add tolerances to the metadata file. This also makes it so the metadata file is actually used and valuable and not just a way to trace where binary data came from.
For tolerances we currently have: Per engine toml tolerance(if set) > hardcoded tolerance in c++ .
This would make it so we have: Per engine toml tolerance(if set) > New Bundle Tolerance(metadata file) > hardcoded tolerance in c++ .
Introduces golden reference validation as a third verification mode for the integration test suite.
ALMIOPEN-1925