himanshushukla12 · himanshushukla12 · May 7, 2026 · Mar 17, 2026 · Mar 17, 2026 · Mar 17, 2026
diff --git a/.ai/AGENTS.md b/.ai/AGENTS.md
@@ -0,0 +1,97 @@
+# AGENTS.md
+
+## Repository-specific guidance
+
+### Main code vs experimental code
+
+The repository is separated into **main code** and **experimental code**.
+
+* **Main code** should remain stable, consistent, and well-tested.
+* **Experimental code** may be less stable and may contain inconsistent patterns or limited testing.
+
+Small non-invasive improvements that make experimental code more consistent with the main codebase are encouraged, but avoid large refactors.
+
+### Paper implementations
+
+If a PR implements a method, algorithm, or training approach from a research paper, it must also add a corresponding subsection to `paper_index.md`.
+
+When reviewing such PRs, ensure that `paper_index.md` was updated.
+
+### Code duplication and consistency
+
+Trainers in this repository are **self-contained by design**. Shared logic (generation, reward computation, metric logging, weight syncing, etc.) is deliberately duplicated across trainers rather than abstracted into a shared base class.
+
+This is intentional: each trainer must be readable, modifiable, and evolvable in isolation. The base class (`_BaseTrainer`) provides only minimal utilities (model card generation). Everything else — vLLM generation paths, `_get_per_token_logps_and_entropies`, `_calculate_rewards`, `_prepare_inputs`, metric logging — is copied in full.
+
+**The tradeoff**: duplication is accepted, but **consistency is mandatory**. When the same logic appears in multiple trainers, the duplicated blocks must stay aligned:
+
+- Same variable names (`self._last_loaded_step`, `self._metrics[mode]`, …)
+- Same control flow structure (if/elif/else branches in the same order)
+- Same comments (word-for-word when the logic is identical)
+- Divergences only where the trainer's semantics require it (e.g., GRPO extracts logprobs from vLLM, RLOO discards them)
+
+**Consistency over correctness**: this is a strong requirement. When duplicating code, reproduce it exactly — even if you believe the original has a bug. Do not silently fix the issue in your copy. Instead, keep your copy consistent with the source and report the problem so it can be fixed across all trainers in a dedicated PR. A correct-but-inconsistent codebase is harder to maintain than a consistently-wrong one that can be fixed in a single sweep.
+
+**When modifying duplicated code**: if you change a pattern that exists in multiple trainers (e.g., the vLLM generation path in `_generate_single_turn`), apply the same change to all other trainers. A fix in GRPO often implies the same fix in RLOO, and vice versa. Not propagating a change is a bug.
+
+**When reviewing**: if a PR touches duplicated logic, verify that all copies are updated consistently. A common mistake is fixing one trainer and forgetting the others.
+
+### Simplicity
+
+This codebase values **leanness and simplicity above all**. Prefer straightforward, inline code over abstractions, helpers, or utilities — even at the cost of some robustness or generality.
+
+Concretely:
+
+- Do not add layers of indirection (registries, factory patterns, plugin systems). A contributor should be able to read a trainer top to bottom and understand the full flow.
+- Prefer a simple implementation that covers 90% of cases over a complex one that covers 100%. A function that handles the common path in 20 lines is better than a catch-all that handles every edge case in 80.
+- Do not add defensive code, fallback paths, or configuration options "just in case". Only handle cases that actually exist today.
+- Avoid `hasattr` and `getattr`. Their use is almost always a symptom of overly defensive programming or a disguised version check (e.g., "this attribute was added in version X"). Instead, either drop the conditional entirely or express the version check explicitly with a version comparison. There is nearly always a cleaner alternative.
+- When in doubt, prefer less code. Every new function, parameter, or branch is maintenance burden. The best abstraction is often no abstraction.
+
+## Documentation
+
+### Docstrings
+
+Docstrings must follow the repository format below. Do **not** convert docstrings to other styles (Google, NumPy, etc.).
+
+Rules:
+
+* Types appear in backticks inside parentheses: (`str`)
+* Optional parameters are marked with `*optional*`
+* Defaults are written as: `defaults to <value>`
+* When the default is `None`, prefer ```(`str`, *optional*)``` instead of ```(`str` or `None`, *optional*, defaults to `None`)```
+* Union types use `or`: `str` or `None`
+* References to classes use the format: [`~transformers.PreTrainedModel`]
+* Class docstrings may group parameters using headers such as: `> Parameters for X:`
+
+Example:
+
+````python
+def method(self, param1: str, param2: int = 1, param3: float | None = None):
+    """
+    Brief one-line description of what this does.
+
+    Args:
+        param1 (`str`):
+            Description of required param.
+        param2 (`int`, *optional*, defaults to `1`):
+            Description of optional param with default.
+        param3 (`float`, *optional*):
+            Description of optional param without explicit default.
+
+    Returns:
+        `dict` with keys:
+            - `key1` (`list[int]`):
+                Description of this key.
+
+    Examples:
+
+    ```python
+    >>> my_func("hello")
+    ```
+    """
+````
+
+### Links to papers
+
+When linking to papers, use `https://huggingface.co/papers/<id>` instead of `https://arxiv.org/abs/<id>` (same ID suffix system).
diff --git a/.cursor/BUGBOT.md b/.cursor/BUGBOT.md
@@ -0,0 +1 @@
+../.ai/AGENTS.md
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -14,18 +14,22 @@ Once you're done, someone will review your PR shortly. They may suggest changes
 
 Fixes # (issue)
 
-
 ## Before submitting
+
 - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
-- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request),
-      Pull Request section?
-- [ ] Was this discussed/approved via a GitHub issue? Please add a link
-      to it if that's the case.
+- [ ] Did you read the [contributor guideline](https://github.com/huggingface/trl/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section?
+- [ ] Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case.
 - [ ] Did you make sure to update the documentation with your changes?
 - [ ] Did you write any new necessary tests?
 
+## AI writing disclosure
+
+We welcome the use of AI tools to help with contributions. For transparency and to help us improve our review process, please indicate the level of AI involvement in this PR.
+
+- [ ] No AI usage: the PR was written entirely by a human.
+- [ ] AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
+- [ ] AI-generated: the PR was mostly or fully generated by an AI tool.
 
 ## Who can review?
 
-Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
-members/contributors who may be interested in your PR.
+Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml
@@ -7,13 +7,15 @@ on:
       - doc-builder*
       - v*-release
 
+env:
+  TRL_EXPERIMENTAL_SILENCE: 1
+
 jobs:
    build:
-    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391  # main
     with:
       commit_sha: ${{ github.sha }}
       package: trl
       version_tag_suffix: ""
-      custom_container: huggingface/transformers-doc-builder
     secrets:
       hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
@@ -3,17 +3,19 @@ name: Build PR Documentation
 on:
   pull_request:
 
+env:
+  TRL_EXPERIMENTAL_SILENCE: 1
+
 concurrency:
   group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
   cancel-in-progress: true
 
 jobs:
   build:
     if: github.event.pull_request.draft == false
-    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@2430c1ec91d04667414e2fa31ecfc36c153ea391  # main
     with:
       commit_sha: ${{ github.event.pull_request.head.sha }}
       pr_number: ${{ github.event.number }}
       package: trl
       version_tag_suffix: ""
-      custom_container: huggingface/transformers-doc-builder
diff --git a/.github/workflows/clear_cache.yml b/.github/workflows/clear_cache.yml
@@ -10,7 +10,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Check out code
-        uses: actions/checkout@v4
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
 
       - name: Cleanup
         run: |

diff --git a/.github/workflows/codeQL.yml b/.github/workflows/codeQL.yml
@@ -14,13 +14,13 @@ jobs:
 
     steps:
       - name: "Checkout repository"
-        uses: actions/checkout@v4
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
 
       - name: "Initialize CodeQL"
-        uses: github/codeql-action/init@v2
+        uses: github/codeql-action/init@b8d3b6e8af63cde30bdc382c0bc28114f4346c88  # v2
         with:
           languages: "yaml"
           queries: +security-and-quality, ./.github/codeql/custom-queries.qls
 
       - name: "Perform CodeQL Analysis"
-        uses: github/codeql-action/analyze@v2
+        uses: github/codeql-action/analyze@b8d3b6e8af63cde30bdc382c0bc28114f4346c88  # v2
diff --git a/.github/workflows/docker-build.yml b/.github/workflows/docker-build.yml
@@ -1,95 +1,86 @@
-name: Build Docker images (scheduled)
+name: Build TRL Docker image
 
 on:
+  push:
+    branches:
+      - main
   workflow_dispatch:
-  workflow_call:
-  schedule:
-    - cron: "0 1 * * *"
 
 concurrency:
   group: docker-image-builds
   cancel-in-progress: false
 
-env:
-  CI_SLACK_CHANNEL: ${{ secrets.CI_DOCKER_CHANNEL }}
-
 jobs:
-  trl-latest:
-    name: "Latest TRL GPU"
-    runs-on: ubuntu-latest
+  trl:
+    name: "Build and push TRL Docker image"
+    runs-on:
+      group: aws-general-8-plus
     steps:
-      - name: Cleanup disk
+      - name: Checkout code
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
+
+      - name: Get TRL version from PyPI
         run: |
-          sudo ls -l /usr/local/lib/
-          sudo ls -l /usr/share/
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
-          sudo rm -rf /usr/local/lib/android
-          sudo rm -rf /usr/share/dotnet
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
+          VERSION=$(curl -s https://pypi.org/pypi/trl/json | jq -r .info.version)
+          echo "VERSION=$VERSION" >> $GITHUB_ENV
+
       - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v1
-      - name: Check out code
-        uses: actions/checkout@v4
+        uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f  # v3
+
       - name: Login to DockerHub
-        uses: docker/login-action@v1
+        uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9  # v3
         with:
           username: ${{ secrets.DOCKERHUB_USERNAME }}
           password: ${{ secrets.DOCKERHUB_PASSWORD }}
 
-      - name: Build and Push GPU
-        uses: docker/build-push-action@v4
+      - name: Build and Push
+        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8  # v6
         with:
-          context: ./docker/trl-latest-gpu
+          context: docker/trl
           push: true
-          tags: huggingface/trl-latest-gpu
+          tags: |
+            huggingface/trl:${{ env.VERSION }}
+            huggingface/trl
 
       - name: Post to Slack
         if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67  # main
         with:
-          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
-          title: 🤗 Results of the trl-latest-gpu Docker Image build
+          slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
+          title: 🤗 Results of the TRL Dev Docker Image build
           status: ${{ job.status }}
           slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
 
-  trl-source:
-    name: "Latest TRL + HF ecosystem from source"
-    runs-on: ubuntu-latest
+  trl-dev:
+    name: "Build and push TRL Dev Docker image"
+    runs-on:
+      group: aws-general-8-plus
     steps:
-      - name: Cleanup disk
-        run: |
-          sudo ls -l /usr/local/lib/
-          sudo ls -l /usr/share/
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
-          sudo rm -rf /usr/local/lib/android
-          sudo rm -rf /usr/share/dotnet
-          sudo du -sh /usr/local/lib/
-          sudo du -sh /usr/share/
+      - name: Checkout code
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
+
       - name: Set up Docker Buildx
-        uses: docker/setup-buildx-action@v1
-      - name: Check out code
-        uses: actions/checkout@v4
+        uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f  # v3
+
       - name: Login to DockerHub
-        uses: docker/login-action@v1
+        uses: docker/login-action@c94ce9fb468520275223c153574b00df6fe4bcc9  # v3
         with:
           username: ${{ secrets.DOCKERHUB_USERNAME }}
           password: ${{ secrets.DOCKERHUB_PASSWORD }}
 
-      - name: Build and Push GPU
-        uses: docker/build-push-action@v4
+      - name: Build and Push
+        uses: docker/build-push-action@10e90e3645eae34f1e60eeb005ba3a3d33f178e8  # v6
         with:
-          context: ./docker/trl-source-gpu
+          context: docker/trl-dev
           push: true
-          tags: huggingface/trl-source-gpu
+          tags: |
+            huggingface/trl:dev
 
       - name: Post to Slack
         if: always()
-        uses: huggingface/hf-workflows/.github/actions/post-slack@main
+        uses: huggingface/hf-workflows/.github/actions/post-slack@a88e7fa2eaee28de5a4d6142381b1fb792349b67  # main
         with:
-          slack_channel: ${{ env.CI_SLACK_CHANNEL }}
-          title: 🤗 Results of the trl-source-gpu Docker Image build
+          slack_channel: ${{ secrets.CI_DOCKER_CHANNEL }}
+          title: 🤗 Results of the TRL Dev Docker Image build
           status: ${{ job.status }}
-          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}  
+          slack_token: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
diff --git a/.github/workflows/issue_auto_labeller.yml b/.github/workflows/issue_auto_labeller.yml
@@ -9,7 +9,7 @@ jobs:
     permissions:
       issues: write
     steps:
-      - uses: actions/checkout@v3
-      - uses: August-murr/auto-labeler@main
+      - uses: actions/checkout@v6
+      - uses: August-murr/auto-labeler@0.0.1
         with:
             hf-api-key: ${{ secrets.CI_HF_API_TOKEN }}
diff --git a/.github/workflows/pr_style_bot.yml b/.github/workflows/pr_style_bot.yml
@@ -18,7 +18,7 @@ jobs:
     steps:
       - name: Extract PR details
         id: pr_info
-        uses: actions/github-script@v6
+        uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd  # v8
         with:
           script: |
             const prNumber = context.payload.issue.number;
@@ -35,7 +35,7 @@ jobs:
             core.setOutput("headRepoFullName", pr.head.repo.full_name);
 
       - name: Check out PR branch
-        uses: actions/checkout@v3
+        uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd  # v6.0.2
         env: 
           HEADREPOFULLNAME: ${{ steps.pr_info.outputs.headRepoFullName }}
           HEADREF: ${{ steps.pr_info.outputs.headRef }}
@@ -58,7 +58,7 @@ jobs:
           echo "Head Repo Full Name: ${{ env.HEADREPOFULLNAME }}"
 
       - name: Set up Python
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405  # v6
 
       - name: Install dependencies
         run: |
@@ -111,7 +111,7 @@ jobs:
 
       - name: Comment on PR with workflow run link
         if: steps.commit_and_push.outputs.changes_pushed == 'true'
-        uses: actions/github-script@v6
+        uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd  # v8
         with:
           script: |
             const prNumber = parseInt(process.env.prNumber, 10);