AI Notes for my development [Do not merge] by csarofeen · Pull Request #4296 · NVIDIA/Fuser

csarofeen · 2025-04-23T19:25:49Z

As I develop with cursor I find it helpful to take careful notes of a system to help it implement things thoughtfully and correctly. I intend to keep this as a collection of notes as I develop.

…setup.

github-actions · 2025-04-23T19:26:30Z

Review updated until commit 11086b5

Description

Added detailed documentation for nvFuser, including compiler concepts and C++ API example.
Provided test instructions for building and running nvFuser tests and examples within Docker.
Documented crash analysis for pre-segmenter during parallel test runs, identifying potential race conditions.
Explained the pre-segmenter pass infrastructure and execution flow.
Detailed changes and validation for scalar segmentation in the fusion process.

Changes walkthrough 📝

Relevant files

Documentation

nvfuser_description.md `Add nvFuser description and C++ API example` ai_agent_notes/nvfuser_description.md Introduced detailed description of nvFuser, its key concepts, and integration with PyTorch. Included a C++ API example demonstrating fusion kernel creation and scheduling.	+161/-0
nvfuser_test_notes.md `Add nvFuser test instructions` ai_agent_notes/nvfuser_test_notes.md Provided step-by-step instructions for building and running nvFuser tests and examples within Docker.	+50/-0
presegmenter_crash_analysis.md `Document pre-segmenter crash analysis` ai_agent_notes/presegmenter_crash_analysis.md Documented analysis of pre-segmenter crashes during parallel test runs, identifying potential causes.	+43/-0
presegmenter_pass_infra.md `Document pre-segmenter pass infrastructure` ai_agent_notes/presegmenter_pass_infra.md Explained the infrastructure for pre-segmenter passes, including key classes and execution flow.	+67/-0
scalar_segmentation_changes.md `Document scalar segmentation changes` ai_agent_notes/scalar_segmentation_changes.md Detailed changes and validation for scalar segmentation in the fusion process. Included notes on modifications to `deriveSchedulerType`, `buildInitialSegments`, and `inferOutputSizes`.	+112/-0

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 No relevant tests
⚡ Recommended focus areas for review Container Name Placeholder The instructions for finding the container name use a placeholder (`nvfuser-dev:csarofeen`) that may not be applicable to all users. Consider making this more generic or providing a way to dynamically find the correct container name. 1. Find container name: ```bash docker ps # Look for container running nvfuser-dev:csarofeen image </details> <details><summary><a href='https://github.com/NVIDIA/Fuser/pull/4296/files#diff-3568b03ac7201bff34079e8a95190082b083f80173c4e2d0916a0204acd3f307R6-R29'><strong>Race Condition Suspected</strong></a> The crash analysis suggests a race condition or memory corruption issue related to concurrency. Further investigation is needed to identify the root cause, especially given the inconsistent occurrence across different GPUs and runs.</summary> ```markdown When running the test suite with the filter `Scheduler` in parallel across 4 GPUs using the `run_multiple_times.sh` script, segfaults (SIGSEGV) or bus errors (SIGBUS) were observed intermittently on GPUs 1, 2, and 3. GPU 0 consistently completed without crashing. The crashes consistently occurred during the execution of the `ResizeTest.SliceReduceScheduler2` test case. Based on the added debug logging, the crash point was isolated to occur during the execution of the pre-segmenter passes, specifically within the call stack originating from: ```c++ // In FusionKernelRuntime::FusionKernelRuntime constructor preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(fusion.get()); The crash happens after the `[RUNTIME CONSTRUCTOR] After NVF_ERROR` log and before the first `[PreSegmenter] Running ...` log message, indicating the failure is either in the setup of `OptimizationPass::runPass` or very early in the first pass executed by `PreSegmenter`. Initial Crash Point: In the first iteration observed, the crash on GPUs 1, 2, and 3 consistently occurred during the execution of the `TranslateRepeatToExpand` pass (i.e., after `[PreSegmenter] Running TranslateRepeatToExpand...` but before `[PreSegmenter] Finished TranslateRepeatToExpand.`). Consistent Crash Point: A second observation confirmed that the crash on GPUs 1, 2, and 3 again occurred during the `TranslateRepeatToExpand` pass. This strongly suggests the issue lies within this specific pass or its interaction with concurrent execution. Shifted Crash Point (Run 3): After adding detailed logging within `TranslateRepeatToExpand`, the logs from GPU 1 (which crashed) showed that all pre-segmenter passes, including `TranslateRepeatToExpand`, completed successfully for the `ResizeTest.SliceReduceScheduler2` fusion. The crash occurred after the line `[RUNTIME CONSTRUCTOR] After preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass` but before the next major step logged (`[RUNTIME CONSTRUCTOR] Preparing runtime order.`). This pinpoints the issue to the transition between the pre-segmenter phase and the runtime preparation phase within the `FusionKernelRuntime` constructor. Increased Variability (Run 3): In this run, GPU 3 passed the `ResizeTest.SliceReduceScheduler2` test. GPU 2 failed with an assertion in a different, earlier test (`ResizeSchedulerTest.PropagateMultipleSlicesToInputs6`) and did not reach the target test. Analysis Inconsistent Occurrence: The crash does not happen on every GPU or every run, suggesting a race condition or memory corruption issue related to concurrency. The variability increased in the latest run. Non-Parallel Test Failure: The specific test `ResizeTest.SliceReduceScheduler2` likely has only one segment, meaning it does not utilize the intra-fusion parallel compilation thread pool. However, the crash still occurs when the global parallel compilation setting is enabled. Inter-Process Interference: Since tests run in separate processes for each GPU, direct shared memory between the tests is unlikely. However, the concurrency might be causing issues through: </details> <details><summary><a href='https://github.com/NVIDIA/Fuser/pull/4296/files#diff-b254a7e392319157c418b06dca94d1d804ba5c8935a9f642b703f8cbb84d307fR51-R53'><strong>Orphaned Placeholder Groups</strong></a> The presence of orphaned placeholder groups for original scalar inputs is noted as technical debt. Implementing logic to remove these groups should be considered to clean up the segmentation process.</summary> ```markdown * Error Handling in `inferOutputSizes`: The current approach throws an error if a scalar output cannot be evaluated during `inferOutputSizes`. While correct for this test (where inputs are concrete), consider if a fallback to a default value with a warning might be more robust in scenarios with unevaluated symbolic inputs, or if the error is acceptable. * Broader Testing: Validate with more complex fusions involving different scalar types and interactions.

…nt_notes

…wip docs PR and moved to doc dir. Update test instructions to show how to build and run the stand alone example.

csarofeen added 5 commits April 8, 2025 10:40

Add an ai agent note about how to build and run tests with my docker …

776e1ad

…setup.

Updated understanding of SegmentedEdge removal challenges

652b03f

Update notes based on segmenter_helpers branch.

6f797d9

Update notes for next phase of refactoring segmenter.

e5ab0c1

Update segmentation notes.

f4e5f92

csarofeen added 7 commits April 23, 2025 12:38

update.

1837f72

Updating segmentation for scalar outputs.

7065b6e

Update notes.

e77d8f5

Merge branch 'main' of https://github.com/NVIDIA/Fuser into cs_ai_age…

750cec3

…nt_notes

Update notes from working on segmentation.

bf3fe09

Update.

3ae505d

Remove segmenter and polymorphic value notes as they're moved to the …

11086b5

…wip docs PR and moved to doc dir. Update test instructions to show how to build and run the stand alone example.

csarofeen closed this Jan 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Notes for my development [Do not merge]#4296

AI Notes for my development [Do not merge]#4296
csarofeen wants to merge 12 commits intomainfrom
cs_ai_agent_notes

csarofeen commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025 •

edited

Loading

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

csarofeen commented Apr 23, 2025

Uh oh!

github-actions bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes walkthrough 📝

PR Reviewer Guide 🔍

Analysis

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Apr 23, 2025 •

edited

Loading