terrykong · terrykong · Aug 21, 2025 · Aug 21, 2025 · Aug 21, 2025 · Aug 22, 2025
diff --git a/.coderabbit.yaml b/.coderabbit.yaml
@@ -0,0 +1,101 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
+# https://docs.coderabbit.ai/getting-started/configure-coderabbit/
+# Validator https://docs.coderabbit.ai/configuration/yaml-validator#yaml-validator
+# In PR, comment "@coderabbitai configuration" to get the full config including defaults
+# Set the language for reviews by using the corresponding ISO language code.
+# Default: "en-US"
+language: "en-US"
+# Settings related to reviews.
+# Default: {}
+reviews:
+  # Set the profile for reviews. Assertive profile yields more feedback, that may be considered nitpicky.
+  # Options: chill, assertive
+  # Default: "chill"
+  profile: chill
+  # Add this keyword in the PR/MR title to auto-generate the title.
+  # Default: "@coderabbitai"
+  auto_title_placeholder: '@coderabbitai title'
+  # Auto Title Instructions - Custom instructions for auto-generating the PR/MR title.
+  # Default: ""
+  auto_title_instructions: 'Format: "<category>: <title>". Category must be one of: feat, fix, docs, style, refactor, perf, test, build, ci, chore, revert, cp. The category must be followed by a colon. Title should be concise (<= 80 chars). Example: "feat: Add logit_bias support".' # current: ''
+  # Set the commit status to 'pending' when the review is in progress and 'success' when it is complete.
+  # Default: true
+  commit_status: false
+  # Generate walkthrough in a markdown collapsible section.
+  # Default: false
+  collapse_walkthrough: true
+  # Generate an assessment of how well the changes address the linked issues in the walkthrough.
+  # Default: true
+  assess_linked_issues: true
+  # Include possibly related issues in the walkthrough.
+  # Default: true
+  related_issues: true
+  # Related PRs - Include possibly related pull requests in the walkthrough.
+  # Default: true
+  related_prs: true
+  # Suggest labels based on the changes in the pull request in the walkthrough.
+  # Default: true
+  suggested_labels: true
+  # Suggest reviewers based on the changes in the pull request in the walkthrough.
+  # Default: true
+  suggested_reviewers: true
+  # Generate a poem in the walkthrough comment.
+  # Default: true
+  poem: false # current: true
+  # Post review details on each review. Additionally, post a review status when a review is skipped in certain cases.
+  # Default: true
+  review_status: false # current: true
+  # Configuration for pre merge checks
+  # Default: {}
+  pre_merge_checks:
+    # Custom Pre-merge Checks - Add unique checks to enforce your team's standards before merging a pull request. Each check must have a unique name (up to 50 characters) and clear instructions (up to 10000 characters). Use these to automatically verify coding, security, documentation, or business rules and maintain code quality.
+    # Default: []
+    custom_checks:
+      - name: "Test Results for Major Changes"
+        mode: "warning"  # or "error" to block merges
+        instructions: |
+          If this PR contains major changes (such as new features, breaking changes, or significant refactoring), verify that the PR description includes test results or testing information.
+          If a change could affect numerics or convergence, the PR description should include information demonstrating that there is no regression.
+          If a change could affect performance, the PR description should include before-and-after performance numbers, as well as the configuration and context in which they apply.
+          Pass if test results are documented or if the changes are minor.
+  auto_review:
+    # Configuration for auto review
+    # Default: {}
+    # Automatic Incremental Review - Automatic incremental code review on each push
+    # Default: true
+    auto_incremental_review: false # current: true
+    # Review draft PRs/MRs.
+    # Default: false
+    drafts: false
+    # Base branches (other than the default branch) to review. Accepts regex patterns. Use '.*' to match all branches.
+    # Default: []
+    base_branches: ["main", "r[0-9].*"] # current: []
+# Configuration for knowledge base
+# Default: {}
+knowledge_base:
+  code_guidelines:
+    # CodeRabbit will analyse and learn from your organization's code guidelines, which you can mention in the file patterns section. These guidelines will then be used to conduct thorough code reviews.
+    # Default: {}
+    enabled: true
+    # Enabled - Enable CodeRabbit to enforce your organization's coding standards during reviews.
+    # Default: true
+    filePatterns: # current: []
+      # File Patterns - Specify files for your coding guideline documents in this section. CodeRabbit will scan these files to understand your team's standards and apply them during code reviews. Multiple files supported. File names are case-sensitive. Common files like: (**/.cursorrules, .github/copilot-instructions.md, .github/instructions/*.instructions.md, **/CLAUDE.md, **/GEMINI.md, **/.cursor/rules/*, **/.windsurfrules, **/.clinerules/*, **/.rules/*, **/AGENT.md, **/AGENTS.md) are included by default.
+      # Default: []
+      - "**/CODING_GUIDELINES.md"
+      - "**/.cursor/rules/*"
diff --git a/.dockerignore b/.dockerignore
@@ -1,6 +1,8 @@
 # Adding to .gitignore helps reduce the size of your working_dir
 
-.git
+# Note: removing .git from .dockerignore since it is valuable to have the git history to
+#       know where this container was built
+# .git
 *.out
 *.log
 *.tar

diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
diff --git a/.github/RENOVATE_SETUP.md b/.github/RENOVATE_SETUP.md
@@ -0,0 +1,180 @@
+# Renovate Setup Documentation
+
+This repository uses [Renovate](https://docs.renovatebot.com/) to automatically update dependencies, including git submodules and Python packages managed in `pyproject.toml`.
+
+## What Renovate Does
+
+Renovate automatically:
+1. **Updates git submodules** by tracking the configured branches
+2. **Updates a small allowlist of Python dependencies** in `pyproject.toml`:
+   - `vllm`, `torch`, and `ray` for the core training stack
+   - `transformer-engine` and `flash-attn` for xformers compatibility
+   - `transformers` so we can track upstream releases
+   - _Everything else is frozen unless explicitly requested._
+3. **Syncs `3rdparty/*/setup.py` files** with their corresponding submodule dependencies
+4. **Regenerates `uv.lock`** after dependency updates
+5. **Pre-clones git submodules with full history** so Renovate can checkout new commits (works around `shallow=true` in `.gitmodules`)
+6. **Creates a single PR** that automatically triggers the full CI pipeline (`cicd-main.yml`)
+
+## Setup Requirements
+
+You need to set up authentication for Renovate. Choose one of the following options:
+
+### Option 1: Personal Access Token (PAT) - Quick Start
+
+**This is the easiest way to get started:**
+
+1. Create a GitHub Personal Access Token (PAT):
+   - Go to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
+   - Click "Generate new token (classic)"
+   - Give it a descriptive name (e.g., "Renovate Bot")
+   - Select scopes:
+     - ✅ `repo` (Full control of private repositories)
+     - ✅ `workflow` (Update GitHub Action workflows - required for github-actions manager)
+   - Click "Generate token" and copy it
+
+2. Add the token as a repository secret:
+   - Go to your repository → Settings → Secrets and variables → Actions
+   - Click "New repository secret"
+   - Name: `RENOVATE_TOKEN`
+   - Value: Paste your PAT
+   - Click "Add secret"
+
+3. You're done! The workflow will use the PAT automatically.
+
+### Option 2: GitHub App (Recommended for Organizations)
+
+**Better for rate limits and security, but requires more setup:**
+
+1. Create a GitHub App:
+   - Go to Organization Settings → Developer settings → GitHub Apps → New GitHub App
+   - Or use an existing Renovate GitHub App
+
+2. Configure the app with these permissions:
+   - Repository permissions:
+     - Contents: Read & Write
+     - Pull requests: Read & Write
+     - Workflows: Read & Write (if using github-actions manager)
+     - Metadata: Read-only
+
+3. Install the app on your repository
+
+4. Add these secrets to your repository:
+   - `RENOVATE_APP_ID`: The app ID (found on the app's settings page)
+   - `RENOVATE_APP_PRIVATE_KEY`: The app's private key (PEM format)
+
+5. The workflow will automatically detect and use the GitHub App token
+
+### 2. Grant Workflow Permissions
+
+Ensure the Renovate workflow has permission to:
+- Create and update pull requests
+- Read and write to the repository
+- Access secrets
+
+This can be configured in: `Settings` → `Actions` → `General` → `Workflow permissions`
+
+## Configuration Files
+
+### `.github/renovate.json`
+Main configuration file that defines:
+- Update schedule (daily during business hours PST)
+- Package grouping rules
+- Branch naming conventions
+- PR labels (`dependencies`, `CI:L2`)
+
+### `.github/workflows/renovate.yml`
+GitHub Actions workflow that:
+- Runs daily at 9 AM UTC (1 AM PST / 2 AM PDT)
+- Can be manually triggered with `workflow_dispatch`
+- Sets up the environment (Python, uv)
+- Executes Renovate with proper credentials
+
+### `.github/scripts/sync_submodule_dependencies.py`
+Python script that:
+- Reads dependencies from `3rdparty/*/pyproject.toml` files in submodules
+- Updates `CACHED_DEPENDENCIES` in corresponding `setup.py` files
+- Ensures consistency between submodule requirements and wrapper packages
+
+### `.github/scripts/renovate_post_update.sh`
+Bash script that runs after Renovate updates dependencies:
+1. Syncs submodule dependencies to setup.py files
+2. Runs `uv lock` to regenerate the lock file
+3. Stages changes for commit
+
+## Manual Workflow Trigger
+
+You can manually trigger Renovate at any time:
+
+1. Go to `Actions` → `Renovate` in GitHub
+2. Click `Run workflow`
+3. Optional parameters:
+   - **Log level**: Set to `debug` for verbose output
+   - **Dry run**: Enable to preview changes without creating PRs
+
+## Update Strategy
+
+Renovate now produces **one consolidated PR at a time**:
+
+| Branch prefix | Contents | Notes |
+|---------------|----------|-------|
+| `renovate/allowlist-…` | Git submodules, Docker/GitHub Action updates, and the allowlisted Python packages above | Runs on the configured weekday schedule; no other dependencies are touched until explicitly re-enabled. Renovate's built-in vulnerability PRs are disabled so everything funnels through this branch. |
+
+## Debug vs. Production Settings
+
+- `prHourlyLimit` is currently `0` **only while debugging** so Renovate can recreate PRs immediately. Set it back to `1` once we're satisfied with the configuration to avoid noisy PR bursts.
+- `prConcurrentLimit` stays at `1` to preserve the "one PR at a time" contract; raise it temporarily if you ever need parallel testing.
+
+## CI Integration
+
+When Renovate creates a PR:
+1. The PR is automatically labeled with `CI:L2` to trigger full CI testing
+2. `cicd-main.yml` runs the complete test suite
+3. All L2 tests must pass before the PR can be merged
+4. The lock file and setup.py changes are included in the PR
+
+## Troubleshooting
+
+### Renovate workflow fails
+- Check that secrets `RENOVATE_APP_ID` and `RENOVATE_APP_PRIVATE_KEY` are set
+- Verify the GitHub App is installed on the repository
+- Check workflow logs for specific error messages
+
+### Dependencies not syncing
+- Ensure submodules are properly initialized
+- Check `.github/scripts/sync_submodule_dependencies.py` logs
+- Verify that submodule `pyproject.toml` files exist and are valid
+
+### uv lock fails
+- Ensure `uv` version in workflow matches project requirements
+- Check for dependency conflicts in the update
+- Review the post-update script logs
+
+### PRs not triggering CI
+- Verify PR has the `CI:L2` label
+- Check `cicd-main.yml` configuration
+- Ensure PR is targeting the `main` branch
+
+## Customization
+
+To modify Renovate behavior:
+1. Edit `.github/renovate.json` for scheduling, grouping, or update rules
+2. Update `.github/workflows/renovate.yml` for workflow settings
+3. Modify `.github/scripts/renovate_post_update.sh` for custom post-update logic
+
+## Testing Changes
+
+Before committing Renovate config changes:
+1. Use the workflow's dry-run mode to test
+2. Check the Renovate logs for validation errors
+3. Test the post-update script locally:
+   ```bash
+   .github/scripts/renovate_post_update.sh
+   ```
+
+## References
+
+- [Renovate Documentation](https://docs.renovatebot.com/)
+- [Renovate Configuration Options](https://docs.renovatebot.com/configuration-options/)
+- [GitHub Action for Renovate](https://github.com/renovatebot/github-action)
+