NVIDIA · csarofeen · Apr 8, 2025 · Apr 9, 2025 · Apr 9, 2025 · Apr 11, 2025
diff --git a/ai_agent_notes/nvfuser_description.md b/ai_agent_notes/nvfuser_description.md
@@ -0,0 +1,161 @@
+**nvFuser**: A Deep Learning Compiler for PyTorch on NVIDIA GPUs
+
+**nvFuser** is a specialized compiler developed by NVIDIA to accelerate PyTorch deep learning models running on NVIDIA GPUs (Volta architecture and newer). It achieves this by automatically generating highly optimized, custom "fusion" kernels using just-in-time (JIT) compilation.
+
+**Key Concepts:**
+
+1.  **Compiler:** `nvFuser` translates sequences of PyTorch operations into efficient low-level GPU code.
+2.  **Operation Fusion:** It intelligently combines multiple compatible PyTorch operations (e.g., element-wise math, reductions, normalizations) into a single, unified GPU kernel. This fusion process minimizes the overhead associated with launching multiple kernels and reduces data transfers between the GPU's global memory and its processing units. The result is often significant performance gains, particularly for operations limited by memory bandwidth.
+3.  **Just-in-Time (JIT) Compilation:** The optimized kernels are compiled during the program's execution ("just-in-time"). This allows `nvFuser` to create kernels specifically tailored to the actual input shapes, data types, and hardware characteristics encountered at runtime, providing flexibility for models with dynamic inputs.
+4.  **PyTorch Integration:** Modern high-performance PyTorch frameworks utilize `nvFuser` as a backend optimization engine. A prominent example is **`Lightning Thunder`**. `Thunder` acts as a source-to-source compiler for PyTorch, analyzing Python code, capturing the computational graph, and dispatching segments of this graph to specialized backends like `nvFuser` for optimized execution on NVIDIA GPUs.
+
+**In Summary:**
+
+`nvFuser` is the core NVIDIA technology providing JIT fusion capabilities to accelerate PyTorch workloads on GPUs. High-level frameworks like `Lightning Thunder` leverage `nvFuser` (and other tools) to automatically optimize PyTorch programs for maximum performance.
+
+**Source Code:** [https://github.com/NVIDIA/Fuser](https://github.com/NVIDIA/Fuser)
+
+## C++ API Example
+
+Below is a simplified C++ example demonstrating how to define and schedule a simple element-wise operation using the nvFuser API. This example is adapted from the test suite.
+
+```cpp
+#include <fusion.h>
+#include <ir/builder.h>
+#include <ops/all_ops.h>
+#include <scheduler/utils.h>
+
+// ... other necessary includes ...
+
+void simple_pointwise_example() {
+    // 1. Create a Fusion object to hold the computation graph.
+    nvfuser::Fusion fusion;
+    // 2. Use FusionGuard to set this fusion as the active one for IR building.
+    nvfuser::FusionGuard fg(&fusion);
+
+    // 3. Define Input Tensors:
+    //    Create symbolic tensor views representing the inputs.
+    //    'makeContigTensor(nDims)' creates a contiguous tensor of nDims with default float32 type.
+    int nDims = 2;
+    nvfuser::TensorView* tv0 = nvfuser::makeContigTensor(nDims);
+    nvfuser::TensorView* tv1 = nvfuser::makeContigTensor(nDims);
+
+    // 4. Register Inputs:
+    //    Mark the created tensors as inputs to the fusion.
+    fusion.addInput(tv0);
+    fusion.addInput(tv1);
+
+    // 5. Define Computation:
+    //    Perform element-wise operations. 'add' creates an addition node.
+    //    'IrBuilder::create<Val>(2.0)' creates a scalar constant.
+    nvfuser::TensorView* tv2 = nvfuser::add(tv1, nvfuser::IrBuilder::create<nvfuser::Val>(2.0));
+    nvfuser::TensorView* tv3 = nvfuser::add(tv0, tv2);
+
+    // 6. Register Output:
+    //    Mark the final tensor as an output of the fusion.
+    fusion.addOutput(tv3);
+
+    // 7. Scheduling Transformations:
+    //    Apply transformations to optimize the execution on the GPU.
+    //    Transformations operate on the domains of the TensorViews.
+
+    //    Merge the two dimensions of the output tensor tv3 into one.
+    //    Original: [I0, I1] -> After merge(0): [I0*I1]
+    tv3->merge(0);
+
+    //    Split the merged dimension. Let's say we want blocks of size 128.
+    //    [I0*I1] -> split(0, 128): [ceilDiv(I0*I1, 128), 128]
+    tv3->split(0, 128);
+
+    //    Optionally, apply further splits, e.g., for unrolling.
+    //    [ceilDiv(I0*I1, 128), 128] -> split(0, 4): [ceilDiv(I0*I1, 128*4), 4, 128]
+    tv3->split(0, 4);
+
+    // 8. Apply ComputeAt and Inlining (Common Scheduling Steps):
+    //    Define where intermediate tensors are computed relative to the output.
+    //    This helps control loop nesting and fusion.
+    //    'ComputeAtMode::MostInlined' attempts to compute producers as late as possible.
+    nvfuser::TensorView* tv_inputs[] = {tv0, tv1};
+    for(auto tv_input : tv_inputs) {
+        tv_input->computeAt(tv3, -1, nvfuser::ComputeAtMode::MostInlined);
+    }
+
+    //    Inline element-wise operations where possible.
+    nvfuser::inlineMost();
+
+    // 9. Parallelization:
+    //    Map tensor axes to GPU hardware dimensions (BlockIdx, ThreadIdx).
+    tv3->axis(0)->parallelize(nvfuser::ParallelType::BIDx); // Outer dimension to BlockIdx.x
+    tv3->axis(1)->parallelize(nvfuser::ParallelType::Unroll); // Middle dimension for loop unrolling
+    tv3->axis(2)->parallelize(nvfuser::ParallelType::TIDx);  // Inner dimension to ThreadIdx.x
+
+    //    Propagate parallelization to producers (often needed after computeAt).
+    nvfuser::scheduler_utils::parallelizeAllLike(tv3);
+
+    // 10. Compilation and Execution (Conceptual):
+    //    The fusion definition is now ready. The next steps (not shown here)
+    //    would involve:
+    //    - Lowering the Fusion IR to Kernel IR (GpuLower).
+    //    - Generating CUDA code (codegen::generateCudaKernel).
+    //    - Compiling the CUDA code (nvrtc).
+    //    - Executing the compiled kernel with actual input tensors (KernelExecutor).
+}
+
+### How Scheduling Optimizes the Code
+
+The scheduling steps (`merge`, `split`, `computeAt`, `inlineMost`, `parallelize`) are crucial for performance. They transform the initial, straightforward representation of the computation into a structure optimized for GPU execution. Here's a conceptual breakdown:
+
+*   **Without Scheduling (Conceptual):** nvFuser might generate separate, simple kernels for each operation:
+
+    ```cuda
+    // Kernel 1: Compute tv2 = tv1 + 2.0
+    __global__ void kernel1(float* tv1, float* tv2, /*...sizes...*/) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < total_elements) {
+            tv2[idx] = tv1[idx] + 2.0f;
+        }
+    }
+
+    // Kernel 2: Compute tv3 = tv0 + tv2
+    __global__ void kernel2(float* tv0, float* tv2, float* tv3, /*...sizes...*/) {
+        int idx = blockIdx.x * blockDim.x + threadIdx.x;
+        if (idx < total_elements) {
+            tv3[idx] = tv0[idx] + tv2[idx];
+        }
+    }
+
+    // Host code would launch kernel1, wait, then launch kernel2.
+    // tv2 is written to and read from global memory.
+    ```
+
+*   **With Scheduling (Conceptual):** The scheduling steps aim to create a single, optimized, fused kernel:
+
+    ```cuda
+    __global__ void fused_kernel(float* tv0, float* tv1, float* tv3, /*...sizes...*/) {
+        // blockIdx.x determined by tv3->axis(0)->parallelize(BIDx)
+        // threadIdx.x determined by tv3->axis(2)->parallelize(TIDx)
+        // The loop for the unrolled axis (tv3->axis(1)) is expanded by the compiler.
+
+        for (int unroll_idx = 0; unroll_idx < 4; ++unroll_idx) { // Loop from tv3->axis(1) (Unroll)
+            // Calculate global index based on blockIdx, threadIdx, and unroll_idx
+            int idx = calculate_global_index(blockIdx.x, threadIdx.x, unroll_idx, /*...sizes...*/);
+
+            if (idx < total_elements) {
+                // Inlining/Fusion via computeAt:
+                // tv2 = tv1 + 2.0 is computed directly here, potentially using registers.
+                float tv2_val = tv1[idx] + 2.0f;
+                // tv3 = tv0 + tv2 is computed using the register value.
+                tv3[idx] = tv0[idx] + tv2_val;
+            }
+        }
+    }
+
+    // Host code launches just one kernel.
+    // Intermediate tv2 potentially lives only in registers, avoiding global memory.
+    ```
+
+**Summary of Impacts:**
+
+1.  **`merge`/`split`:** Directly define the **loop structure** (nesting, bounds) in the CUDA kernel.
+2.  **`computeAt`/`inlineMost`:** Enable **operator fusion**, putting multiple operations within the same loop nest. This reduces kernel launches and keeps intermediate data in **registers/shared memory**, minimizing slow global memory access.
+3.  **`parallelize`:** Maps the abstract loops defined by merge/split onto the physical **GPU threads and blocks** (`threadIdx`, `blockIdx`) and utilizes hardware features like **vectorization** (`float4`) or **loop unrolling**. 
diff --git a/ai_agent_notes/nvfuser_test_notes.md b/ai_agent_notes/nvfuser_test_notes.md
@@ -0,0 +1,50 @@
+# NVFuser Test Process
+
+## Three Required Steps
+
+1. Find container name:
+```bash
+docker ps
+# Look for container running nvfuser-dev:csarofeen image
+```
+
+2. Build (if needed):
+```bash
+docker exec nvfuser-dev:csarofeen /bin/bash -c "cd /opt/pytorch/Fuser && pip install . -v"
+```
+
+3. Run test with output capture:
+```bash
+docker exec nvfuser-dev:csarofeen /bin/bash -c "/opt/pytorch/Fuser/bin/test_nvfuser --gtest_filter='TEST_FILTER' ; echo '=== TEST COMPLETE ==='" > local_test_log.txt 2>&1
+```
+
+## Building and Running the sinh_libtorch Example
+
+1. **Find the Docker Container**:
+   ```bash
+   docker ps
+   # Look for container running nvfuser-dev:csarofeen image
+   ```
+
+2. **Navigate to the Example Directory**:
+   ```bash
+   docker exec <container_name> /bin/bash -c "cd /opt/pytorch/Fuser/examples/sinh_libtorch"
+   ```
+
+3. **Build the Example**:
+   ```bash
+   docker exec <container_name> /bin/bash -c "cd /opt/pytorch/Fuser/examples/sinh_libtorch && make"
+   ```
+
+4. **Run the Example**:
+   ```bash
+   docker exec <container_name> /bin/bash -c "cd /opt/pytorch/Fuser/examples/sinh_libtorch && ./sinh_example"
+   ```
+
+5. **Verify Output**:
+   - Ensure the output matches the expected results as shown in the console.
+
+## Notes
+- Replace `TEST_FILTER` with the specific test name (e.g. `NVFuserTest.FusionMagicSchedulerInstanceNormalization_CUDA`)
+- Output will be saved to `local_test_log.txt`
+- Ensure the correct installation paths for Torch and nvFuser are set in the environment. This may involve checking the `CMAKE_PREFIX_PATH` or `Torch_DIR` and `Nvfuser_DIR` variables in the CMake configuration. 
diff --git a/ai_agent_notes/presegmenter_crash_analysis.md b/ai_agent_notes/presegmenter_crash_analysis.md
@@ -0,0 +1,43 @@
+# Analysis of Pre-Segmenter Crash during Parallel Test Runs
+
+Date: 2024-07-16
+
+## Observation
+
+When running the test suite with the filter `*Scheduler*` in parallel across 4 GPUs using the `run_multiple_times.sh` script, segfaults (SIGSEGV) or bus errors (SIGBUS) were observed intermittently on GPUs 1, 2, and 3. GPU 0 consistently completed without crashing.
+
+The crashes consistently occurred during the execution of the `ResizeTest.SliceReduceScheduler2` test case.
+
+Based on the added debug logging, the crash point was isolated to occur *during* the execution of the pre-segmenter passes, specifically within the call stack originating from:
+
+```c++
+// In FusionKernelRuntime::FusionKernelRuntime constructor
+preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(fusion.get());
+```
+
+The crash happens after the `[RUNTIME CONSTRUCTOR] After NVF_ERROR` log and before the first `[PreSegmenter] Running ...` log message, indicating the failure is either in the setup of `OptimizationPass::runPass` or very early in the first pass executed by `PreSegmenter`.
+
+*   **Initial Crash Point:** In the first iteration observed, the crash on GPUs 1, 2, and 3 consistently occurred *during* the execution of the `TranslateRepeatToExpand` pass (i.e., after `[PreSegmenter] Running TranslateRepeatToExpand...` but before `[PreSegmenter] Finished TranslateRepeatToExpand.`).
+*   **Consistent Crash Point:** A second observation confirmed that the crash on GPUs 1, 2, and 3 again occurred during the `TranslateRepeatToExpand` pass. This strongly suggests the issue lies within this specific pass or its interaction with concurrent execution.
+*   **Shifted Crash Point (Run 3):** After adding detailed logging within `TranslateRepeatToExpand`, the logs from GPU 1 (which crashed) showed that all pre-segmenter passes, including `TranslateRepeatToExpand`, completed successfully for the `ResizeTest.SliceReduceScheduler2` fusion. The crash occurred *after* the line `[RUNTIME CONSTRUCTOR] After preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass` but *before* the next major step logged (`[RUNTIME CONSTRUCTOR] Preparing runtime order.`). This pinpoints the issue to the transition between the pre-segmenter phase and the runtime preparation phase within the `FusionKernelRuntime` constructor.
+*   **Increased Variability (Run 3):** In this run, GPU 3 *passed* the `ResizeTest.SliceReduceScheduler2` test. GPU 2 failed with an assertion in a different, earlier test (`ResizeSchedulerTest.PropagateMultipleSlicesToInputs6`) and did not reach the target test.
+
+## Analysis
+
+*   **Inconsistent Occurrence:** The crash does not happen on every GPU or every run, suggesting a race condition or memory corruption issue related to concurrency. The variability increased in the latest run.
+*   **Non-Parallel Test Failure:** The specific test `ResizeTest.SliceReduceScheduler2` likely has only one segment, meaning it does *not* utilize the intra-fusion parallel compilation thread pool. However, the crash still occurs when the *global* parallel compilation setting is enabled.
+*   **Inter-Process Interference:** Since tests run in separate processes for each GPU, direct shared memory between the tests is unlikely. However, the concurrency might be causing issues through:
+    *   **Driver/Runtime Contention:** Concurrent interactions with the CUDA driver, NVRTC, or CUDA context management might be non-thread-safe or expose driver bugs.
+    *   **Filesystem/Resource Contention:** Less likely, but contention for temporary files or other system resources could play a role.
+    *   **Shared State:** Although processes are separate, there might be unforeseen shared state at a lower level (e.g., within the driver, system libraries, or potentially static initialization issues within nvFuser itself if not handled carefully across processes, though this is less common).
+    *   **Initialization Order / Timing:** The act of enabling the parallel compile feature globally (even if not used by the specific crashing test) might alter the initialization order or timing of certain components, exposing a latent bug in the pre-segmenter passes or the resources they access.
+*   **State Corruption:** The latest logs strongly suggest that the pre-segmenter passes themselves complete, but potentially leave the `Fusion` object or related state (e.g., memory managed by the `Fusion` object) in a corrupted state. This corruption leads to a crash when subsequent operations in the runtime constructor (like `prepareRuntimeOrder`) attempt to use this state.
+
+## Next Steps
+
+1.  **Instrument `FusionKernelRuntime` Constructor:** Add logging immediately *before* and *after* the call to `preseg_passes::OptimizationPass<preseg_passes::PreSegmenter>::runPass(fusion.get());`.
+2.  **Instrument `prepareRuntimeOrder`:** Add logging at the very beginning of the `prepareRuntimeOrder` method within the `FusionKernelRuntime` constructor. This will help confirm if the crash occurs exactly between the pre-segmenter execution and the start of runtime preparation.
+
+## Separate Issue
+
+The tests `NVFuserTest.FusionMagicSchedulerLayerNormalization_CUDA` and `NVFuserTest.FusionMagicSchedulerRMSNormalization_CUDA` consistently fail on all GPUs with a scheduling logic error (`Could not schedule fusion with the SchedulerType: inner_persistent`). This appears unrelated to the segfault and needs separate investigation.