Skip to content

AI Notes for my development [Do not merge]#4296

Closed
csarofeen wants to merge 12 commits intomainfrom
cs_ai_agent_notes
Closed

AI Notes for my development [Do not merge]#4296
csarofeen wants to merge 12 commits intomainfrom
cs_ai_agent_notes

Conversation

@csarofeen
Copy link
Collaborator

As I develop with cursor I find it helpful to take careful notes of a system to help it implement things thoughtfully and correctly. I intend to keep this as a collection of notes as I develop.

@github-actions
Copy link

github-actions bot commented Apr 23, 2025

Review updated until commit 11086b5

Description

  • Added detailed documentation for nvFuser, including compiler concepts and C++ API example.

  • Provided test instructions for building and running nvFuser tests and examples within Docker.

  • Documented crash analysis for pre-segmenter during parallel test runs, identifying potential race conditions.

  • Explained the pre-segmenter pass infrastructure and execution flow.

  • Detailed changes and validation for scalar segmentation in the fusion process.


Changes walkthrough 📝

Relevant files
Documentation
nvfuser_description.md
Add nvFuser description and C++ API example                           

ai_agent_notes/nvfuser_description.md

  • Introduced detailed description of nvFuser, its key concepts, and
    integration with PyTorch.
  • Included a C++ API example demonstrating fusion kernel creation and
    scheduling.
  • +161/-0 
    nvfuser_test_notes.md
    Add nvFuser test instructions                                                       

    ai_agent_notes/nvfuser_test_notes.md

  • Provided step-by-step instructions for building and running nvFuser
    tests and examples within Docker.
  • +50/-0   
    presegmenter_crash_analysis.md
    Document pre-segmenter crash analysis                                       

    ai_agent_notes/presegmenter_crash_analysis.md

  • Documented analysis of pre-segmenter crashes during parallel test
    runs, identifying potential causes.
  • +43/-0   
    presegmenter_pass_infra.md
    Document pre-segmenter pass infrastructure                             

    ai_agent_notes/presegmenter_pass_infra.md

  • Explained the infrastructure for pre-segmenter passes, including key
    classes and execution flow.
  • +67/-0   
    scalar_segmentation_changes.md
    Document scalar segmentation changes                                         

    ai_agent_notes/scalar_segmentation_changes.md

  • Detailed changes and validation for scalar segmentation in the fusion
    process.
  • Included notes on modifications to deriveSchedulerType,
    buildInitialSegments, and inferOutputSizes.
  • +112/-0 

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review

    Container Name Placeholder

    The instructions for finding the container name use a placeholder (nvfuser-dev:csarofeen) that may not be applicable to all users. Consider making this more generic or providing a way to dynamically find the correct container name.

    1. Find container name:
    ```bash
    docker ps
    # Look for container running nvfuser-dev:csarofeen image
    
    </details>
    
    <details><summary><a href='https://github.com/NVIDIA/Fuser/pull/4296/files#diff-3568b03ac7201bff34079e8a95190082b083f80173c4e2d0916a0204acd3f307R6-R29'><strong>Race Condition Suspected</strong></a>
    
    The crash analysis suggests a race condition or memory corruption issue related to concurrency. Further investigation is needed to identify the root cause, especially given the inconsistent occurrence across different GPUs and runs.</summary>
    
    ```markdown
    
    When running the test suite with the filter `*Scheduler*` in parallel across 4 GPUs using the `run_multiple_times.sh` script, segfaults (SIGSEGV) or bus errors (SIGBUS) were observed intermittently on GPUs 1, 2, and 3. GPU 0 consistently completed without crashing.
    
    The crashes consistently occurred during the execution of the `ResizeTest.SliceReduceScheduler2` test case.
    
    Based on the added debug logging, the crash point was isolated to occur *during* the execution of the pre-segmenter passes, specifically within the call stack originating from:
    
    ```c++
    // In FusionKernelRuntime::FusionKernelRuntime constructor
    preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(fusion.get());
    

    The crash happens after the [RUNTIME CONSTRUCTOR] After NVF_ERROR log and before the first [PreSegmenter] Running ... log message, indicating the failure is either in the setup of OptimizationPass::runPass or very early in the first pass executed by PreSegmenter.

    • Initial Crash Point: In the first iteration observed, the crash on GPUs 1, 2, and 3 consistently occurred during the execution of the TranslateRepeatToExpand pass (i.e., after [PreSegmenter] Running TranslateRepeatToExpand... but before [PreSegmenter] Finished TranslateRepeatToExpand.).
    • Consistent Crash Point: A second observation confirmed that the crash on GPUs 1, 2, and 3 again occurred during the TranslateRepeatToExpand pass. This strongly suggests the issue lies within this specific pass or its interaction with concurrent execution.
    • Shifted Crash Point (Run 3): After adding detailed logging within TranslateRepeatToExpand, the logs from GPU 1 (which crashed) showed that all pre-segmenter passes, including TranslateRepeatToExpand, completed successfully for the ResizeTest.SliceReduceScheduler2 fusion. The crash occurred after the line [RUNTIME CONSTRUCTOR] After preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass but before the next major step logged ([RUNTIME CONSTRUCTOR] Preparing runtime order.). This pinpoints the issue to the transition between the pre-segmenter phase and the runtime preparation phase within the FusionKernelRuntime constructor.
    • Increased Variability (Run 3): In this run, GPU 3 passed the ResizeTest.SliceReduceScheduler2 test. GPU 2 failed with an assertion in a different, earlier test (ResizeSchedulerTest.PropagateMultipleSlicesToInputs6) and did not reach the target test.

    Analysis

    • Inconsistent Occurrence: The crash does not happen on every GPU or every run, suggesting a race condition or memory corruption issue related to concurrency. The variability increased in the latest run.
    • Non-Parallel Test Failure: The specific test ResizeTest.SliceReduceScheduler2 likely has only one segment, meaning it does not utilize the intra-fusion parallel compilation thread pool. However, the crash still occurs when the global parallel compilation setting is enabled.
    • Inter-Process Interference: Since tests run in separate processes for each GPU, direct shared memory between the tests is unlikely. However, the concurrency might be causing issues through:
    
    </details>
    
    <details><summary><a href='https://github.com/NVIDIA/Fuser/pull/4296/files#diff-b254a7e392319157c418b06dca94d1d804ba5c8935a9f642b703f8cbb84d307fR51-R53'><strong>Orphaned Placeholder Groups</strong></a>
    
    The presence of orphaned placeholder groups for original scalar inputs is noted as technical debt. Implementing logic to remove these groups should be considered to clean up the segmentation process.</summary>
    
    ```markdown
    *   **Error Handling in `inferOutputSizes`:** The current approach throws an error if a scalar output cannot be evaluated during `inferOutputSizes`. While correct for this test (where inputs are concrete), consider if a fallback to a default value with a warning might be more robust in scenarios with unevaluated symbolic inputs, or if the error is acceptable.
    *   **Broader Testing:** Validate with more complex fusions involving different scalar types and interactions.
    
    

    @csarofeen csarofeen closed this Jan 12, 2026
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    1 participant