Allow het servers for nemo-rl jobs by smahdavi4 · Pull Request #1223 · NVIDIA-NeMo/Skills

smahdavi4 · 2026-02-09T19:47:29Z

Allow heterogenous servers for nemo-rl jobs

Summary by CodeRabbit

New Features
- Added LLM-as-a-judge support to GRPO workflow with flexible server configuration.
- New CLI options for judge server (model, address, type, GPUs, nodes, args) and multi-server control.
- Support for local or remote judge hosting with automatic port handling.
- Scheduler now launches and orders judge server tasks to reserve GPU resources before training.
- Multi-server scheduling and improved sandbox/overlap behavior for GPU-aware runs.

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-09T19:49:27Z

nemo_skills/pipeline/utils/exp.py

+        for server_idx in range(n_servers):
+            server_cmd, num_server_tasks = get_server_command(**server_config, cluster_config=cluster_config)
+            server_executor = get_executor(
+                cluster_config=cluster_config,
+                container=server_container,
+                num_nodes=server_config["num_nodes"],
+                tasks_per_node=num_server_tasks,
+                gpus_per_node=server_config["num_gpus"],
+                partition=partition,
+                dependencies=dependencies,
+                job_name=task_name,
+                log_dir=log_dir,
+                log_prefix=f"server_{server_idx}" if n_servers > 1 else "server",
+                extra_package_dirs=extra_package_dirs,
+                sbatch_kwargs=sbatch_kwargs,
+                heterogeneous=heterogeneous,
+                het_group=het_group,
+                total_het_groups=total_het_groups,
+                overlap=(not client_num_gpus),  # Only overlap when the main task does not have gpus
+                with_ray=False,
+                ray_template=ray_template,
+            )
+            cmd_to_add = server_cmd
+            if cluster_config["executor"] != "slurm" and num_server_tasks > 1:
+                cmd_to_add = f"mpirun --allow-run-as-root -np {num_server_tasks} bash -c {shlex.quote(server_cmd)}"
+            commands.append(cmd_to_add)
+            executors.append(server_executor)
+            het_group_indices.append(het_group)
+            het_group += 1
+            LOG.info("Server %d command: %s", server_idx, server_cmd)


all servers launched in the loop use the same server_port from server_config, causing port conflicts when n_servers > 1

each server instance needs a unique port

Suggested change

for server_idx in range(n_servers):

server_cmd, num_server_tasks = get_server_command(**server_config, cluster_config=cluster_config)

server_executor = get_executor(

cluster_config=cluster_config,

container=server_container,

num_nodes=server_config["num_nodes"],

tasks_per_node=num_server_tasks,

gpus_per_node=server_config["num_gpus"],

partition=partition,

dependencies=dependencies,

job_name=task_name,

log_dir=log_dir,

log_prefix=f"server_{server_idx}" if n_servers > 1 else "server",

extra_package_dirs=extra_package_dirs,

sbatch_kwargs=sbatch_kwargs,

heterogeneous=heterogeneous,

het_group=het_group,

total_het_groups=total_het_groups,

overlap=(not client_num_gpus), # Only overlap when the main task does not have gpus

with_ray=False,

ray_template=ray_template,

)

cmd_to_add = server_cmd

if cluster_config["executor"] != "slurm" and num_server_tasks > 1:

cmd_to_add = f"mpirun --allow-run-as-root -np {num_server_tasks} bash -c {shlex.quote(server_cmd)}"

commands.append(cmd_to_add)

executors.append(server_executor)

het_group_indices.append(het_group)

het_group += 1

LOG.info("Server %d command: %s", server_idx, server_cmd)

for server_idx in range(n_servers):

# Get a unique port for each server if launching multiple

current_server_config = server_config.copy()

if n_servers > 1:

current_server_config["server_port"] = get_free_port(strategy="random")

server_cmd, num_server_tasks = get_server_command(**current_server_config, cluster_config=cluster_config)

server_executor = get_executor(

cluster_config=cluster_config,

container=server_container,

num_nodes=current_server_config["num_nodes"],

tasks_per_node=num_server_tasks,

gpus_per_node=current_server_config["num_gpus"],

partition=partition,

dependencies=dependencies,

job_name=task_name,

log_dir=log_dir,

log_prefix=f"server_{server_idx}" if n_servers > 1 else "server",

extra_package_dirs=extra_package_dirs,

sbatch_kwargs=sbatch_kwargs,

heterogeneous=heterogeneous,

het_group=het_group,

total_het_groups=total_het_groups,

overlap=(not client_num_gpus), # Only overlap when the main task does not have gpus

with_ray=False,

ray_template=ray_template,

)

coderabbitai · 2026-02-09T19:53:52Z

📝 Walkthrough

Walkthrough

Adds LLM-as-a-judge server support to the GRPO workflow: new CLI server options, server_config/client_server_args construction for local or remote hosting, port allocation, environment injection, and multi-server scheduling integrated into task creation and ordering.

Changes

Cohort / File(s)	Summary
GRPO Server Configuration `nemo_skills/pipeline/nemo_rl/grpo.py`	Adds server CLI options (`server_model`, `server_address`, `server_type`, `server_gpus`, `server_nodes`, `n_servers`, `server_args`), imports (`json`, `SupportedServers`, `get_free_port`, `should_get_random_port`), and logic to build `server_config` or `client_server_args`, allocate ports for local hosting, inject `JUDGE_SERVER_ARGS`, and pass server info into task submission.
Task Scheduling & Multi-Server Support `nemo_skills/pipeline/utils/exp.py`	Extends `add_task(...)` signature with `n_servers`. Adds `add_server_tasks` helper, multi-server command/executor construction, GPU-aware server-first scheduling when server needs GPUs and client lacks them, overlap/het_group adjustments, sandbox tweaks, and accounting for `n_servers` in het group totals.

Sequence Diagram

sequenceDiagram
    participant CLI as GRPO CLI
    participant Scheduler as Task Scheduler
    participant GPU as GPU Allocator
    participant Server as Judge Server
    participant Trainer as Training Task

    CLI->>CLI: Parse server CLI params
    CLI->>Scheduler: add_task(server_config, n_servers=N)

    Scheduler->>GPU: Query client_num_gpus
    GPU-->>Scheduler: GPU availability

    alt Server needs GPUs & Client has none
        Scheduler->>Scheduler: schedule server tasks FIRST
        loop N servers
            Scheduler->>Server: create server executor (port/model/gpus)
            Server->>Server: allocate GPUs
        end
        Scheduler->>Trainer: schedule training after servers
    else Client has GPUs
        Scheduler->>Trainer: schedule training task
        loop N servers
            Scheduler->>Server: add server tasks (post-main)
        end
    end

    Scheduler->>Scheduler: set log_prefix, het_groups, overlap flags
    Scheduler-->>CLI: return configured tasks

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding support for heterogeneous servers in nemo-rl jobs, which aligns with the new server configuration parameters and logic introduced in GRPO workflow and task scheduling.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch smahdavi/het-job-judg

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@nemo_skills/pipeline/nemo_rl/grpo.py`:
- Around line 405-420: When server_type is provided but you intend to host the
model (server_address is None), ensure server_model is required and non-empty to
avoid passing model_path=None into get_server_command; add a validation check
(e.g., assert or raise ValueError) before building server_config that
server_model is not None/empty, referencing the variables server_type,
server_address, and server_model and the block that constructs server_config so
the code fails fast with a clear message instead of producing "None" in the
command.

In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 530-534: The code should fail-fast on a missing num_gpus key:
replace int(server_config.get("num_gpus", 0)) with
int(server_config["num_gpus"]) in the server_needs_gpus calculation (keep the
existing server_config is not None check), so server_needs_gpus and
server_goes_first reflect the required presence of server_config["num_gpus"]
used later by get_server_command; this ensures a clear KeyError rather than
silently using 0.
- Around line 536-572: The add_server_tasks function currently mutates the
shared server_config by calling server_config.pop("container", ...); instead,
make a shallow copy of server_config at the start of add_server_tasks (e.g.,
local_server_cfg = server_config.copy()), read the container with
local_server_cfg.get("container") and if you need to remove it from what you
pass onward, delete it from the copy (del local_server_cfg["container"]) before
calling get_server_command(**local_server_cfg, cluster_config=cluster_config)
and passing local_server_cfg to other helpers (or pass container separately) so
the original server_config remains unchanged and unexpected keys are not
forwarded to get_server_command.

🧹 Nitpick comments (3)

nemo_skills/pipeline/utils/exp.py (2)

594-612: client_num_gpus is reassigned with different semantics — consider a distinct variable name.

Line 531 sets client_num_gpus = num_gpus or 0 to decide server ordering/overlap. Line 594 reassigns it to 0 when server_config is not None and num_nodes == 1, changing the meaning from "does the client need GPUs at all" to "how many GPUs to allocate for the main SLURM srun." This shadowing makes the control flow hard to follow — a reader (or future editor) must track which assignment is live at each usage site.

Consider using a distinct name like main_task_gpus for the line 594 assignment.

436-436: Consider validating n_servers >= 1 when server_config is provided.

If n_servers=0 is passed with a non-None server_config, no server tasks are created, yet server_config is still consumed (e.g., popping "container", computing server_goes_first). This would silently produce a misconfigured job. A simple guard early in the function would prevent this.

nemo_skills/pipeline/nemo_rl/grpo.py (1)

409-411: Consider using raise ValueError instead of assert for argument validation.

assert statements can be disabled with python -O. While unlikely in a CLI context, using explicit raise ValueError(...) or typer.BadParameter(...) is more robust and consistent with the existing validation patterns in this file (e.g., line 399).

Also applies to: 428-429

coderabbitai · 2026-02-09T19:53:55Z

nemo_skills/pipeline/nemo_rl/grpo.py

+    # Server configuration for LLM-as-a-judge
+    server_config = None
+    if server_type is not None:
+        get_random_port = should_get_random_port(server_gpus, exclusive)
+        if server_address is None:  # we need to host the model
+            assert server_gpus is not None, "Need to specify server_gpus if hosting the model"
+            server_port = get_free_port(strategy="random") if get_random_port else 5000
+
+            server_config = {
+                "model_path": server_model,
+                "server_type": server_type,
+                "num_gpus": server_gpus,
+                "num_nodes": server_nodes,
+                "server_args": server_args,
+                "server_port": server_port,
+            }


⚠️ Potential issue | 🟠 Major

Missing validation: server_model should be required when server_type is specified.

When server_type is provided but server_model is omitted (defaults to None), model_path=None is passed into get_server_command, producing a command string containing the literal string "None". This would fail at runtime with a confusing error.

Proposed fix

server_config = None if server_type is not None: + if server_model is None: + raise ValueError("server_model is required when server_type is specified") get_random_port = should_get_random_port(server_gpus, exclusive)

🤖 Prompt for AI Agents

In `@nemo_skills/pipeline/nemo_rl/grpo.py` around lines 405 - 420, When server_type is provided but you intend to host the model (server_address is None), ensure server_model is required and non-empty to avoid passing model_path=None into get_server_command; add a validation check (e.g., assert or raise ValueError) before building server_config that server_model is not None/empty, referencing the variables server_type, server_address, and server_model and the block that constructs server_config so the code fails fast with a clear message instead of producing "None" in the command.

nemo_skills/pipeline/utils/exp.py

Kipok · 2026-02-10T00:33:44Z

nemo_skills/pipeline/utils/exp.py

@@ -433,6 +433,7 @@ def add_task(
    keep_mounts_for_sandbox=False,


@gwarmstrong do we need to update the declarative code path to reflect these changes?

It will need to be updated when we want to use the feature on the declarative path, but at the moment I'm not sure there is value to adding it to the declarative path purely for parity sake.

Is there any way to ensure it is covered by some test case (gpu or slurm probably?) that way when we convert to declarative, we can make sure the functionality isn't dropped?

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-10T23:17:37Z

nemo_skills/pipeline/utils/exp.py

            with temporary_env_update(cluster_config, {"NEMO_SKILLS_SANDBOX_PORT": sandbox_port}):
                cur_cmd = install_packages_wrap(cur_cmd, installation_command)
                commands.append(cur_cmd)
+                client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0


client_num_gpus is calculated here inside the loop, but was already defined at line 531. This shadows the outer variable and is calculated inside the wrong scope (should be outside the for loop at lines 588-617).

Suggested change

client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0

client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0

Move this line before line 588 (before the for cur_idx, (cur_cmd... loop starts).

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

nemo_skills/pipeline/utils/exp.py (1)
541-541: ⚠️ Potential issue | 🟠 Major

Use direct key access for required server_config["num_gpus"].

Line 541 currently masks missing num_gpus and can route scheduling logic incorrectly before failing later.
Proposed fix
-    server_needs_gpus = server_config is not None and int(server_config.get("num_gpus", 0)) > 0
+    server_needs_gpus = server_config is not None and int(server_config["num_gpus"]) > 0
As per coding guidelines, "Don't use .get() for accessing dictionary keys if the code expects them to be present; use direct access data[key_name] to fail with a clear error instead of silently corrupting data".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/exp.py` at line 541, The code currently uses
server_config.get("num_gpus", 0) which silently treats missing keys as zero;
change the expression that computes server_needs_gpus to use direct key access
so missing data fails fast: keep the None check on server_config, then use
int(server_config["num_gpus"]) > 0 when computing server_needs_gpus (variable
name: server_needs_gpus, object: server_config) so a missing "num_gpus" raises a
KeyError instead of masking the problem.

🧹 Nitpick comments (1)

nemo_skills/pipeline/utils/exp.py (1)
539-546: Add a SLURM regression test for server ordering with n_servers > 1.

Given the new ordering and heterogeneous-group logic, add coverage for:

server-first path (client_num_gpus=0, server GPUs > 0),

server-last path (client GPUs > 0),

optional sandbox enabled.

This should prevent future regressions in het-group ordering and resource assignment.

Based on learnings, "When enabling new modality or adding complicated evaluation/metrics logic in benchmarks, consider adding the dataset into slurm tests for comprehensive evaluation".

Also applies to: 589-698
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemo_skills/pipeline/utils/exp.py` around lines 539 - 546, Add a SLURM
regression test that validates het-group ordering when n_servers > 1 by
exercising both server-first and server-last paths and toggling sandbox: create
tests that set server_config with num_gpus>0 and client num_gpus=0 (to assert
server_goes_first true), set client num_gpus>0 with server_config num_gpus>0 (to
assert server_goes_first false), and run each with sandbox enabled/disabled;
verify the resulting job heterogenous-group ordering and resource allocation
match expectations (use the same code paths that compute server_needs_gpus,
client_num_gpus, and server_goes_first) and ensure the test fails on
regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 589-592: The server-first insertion (server_goes_first ->
add_server_tasks) mutates executors so later code that computes sandbox
num_nodes from executors[0] can pick the server's node count; compute the
sandbox node count from the intended main-task executor before potentially
calling add_server_tasks (or locate the first non-server/main executor instead
of using executors[0]) so num_nodes is derived from the main task; update the
logic around server_goes_first, add_server_tasks, and the sandbox num_nodes
calculation to use that precomputed/main-task executor reference (referencing
symbols: server_goes_first, add_server_tasks, executors, and num_nodes).
- Line 447: The parameter n_servers is accepted but can be zero/negative or
ignored when server_config is None (e.g., code using range(n_servers)), so add
input validation at the start of the function that uses n_servers: if n_servers
is not a positive int raise a ValueError; additionally, if n_servers > 0 ensure
server_config is not None and raise a ValueError if it is, so user-specified
servers are never silently ignored; update any code paths that iterate with
range(n_servers) to rely on this validation.
- Around line 542-546: Compute the effective client GPU count once (e.g.,
rename/assign client_num_gpus -> effective_client_gpus using the same
num_gpus/server_needs_gpus inputs) and reuse that variable everywhere instead of
recomputing; update the server ordering boolean (server_goes_first) to use
effective_client_gpus and replace the later GPU-allocation logic that currently
recomputes client GPUs (the block around the "main executor GPU allocation"
code) to reference effective_client_gpus so ordering, overlap, and allocation
decisions are consistent across client_num_gpus, server_goes_first,
server_needs_gpus and the main executor allocation logic.

---

Duplicate comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Line 541: The code currently uses server_config.get("num_gpus", 0) which
silently treats missing keys as zero; change the expression that computes
server_needs_gpus to use direct key access so missing data fails fast: keep the
None check on server_config, then use int(server_config["num_gpus"]) > 0 when
computing server_needs_gpus (variable name: server_needs_gpus, object:
server_config) so a missing "num_gpus" raises a KeyError instead of masking the
problem.

---

Nitpick comments:
In `@nemo_skills/pipeline/utils/exp.py`:
- Around line 539-546: Add a SLURM regression test that validates het-group
ordering when n_servers > 1 by exercising both server-first and server-last
paths and toggling sandbox: create tests that set server_config with num_gpus>0
and client num_gpus=0 (to assert server_goes_first true), set client num_gpus>0
with server_config num_gpus>0 (to assert server_goes_first false), and run each
with sandbox enabled/disabled; verify the resulting job heterogenous-group
ordering and resource allocation match expectations (use the same code paths
that compute server_needs_gpus, client_num_gpus, and server_goes_first) and
ensure the test fails on regressions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f322e1c-a31f-4ad6-aea2-6254af182f0e

📥 Commits

Reviewing files that changed from the base of the PR and between 4486321 and 3ba53cd.

📒 Files selected for processing (1)

nemo_skills/pipeline/utils/exp.py

coderabbitai · 2026-03-04T20:50:24Z

nemo_skills/pipeline/utils/exp.py

    keep_mounts_for_sandbox=False,
    sandbox_port: int | None = None,
    server_config=None,
+    n_servers: int = 1,


⚠️ Potential issue | 🟠 Major

Validate n_servers inputs to avoid silent no-op behavior.

range(n_servers) silently skips server creation for 0/negative values, and n_servers is currently accepted even when server_config is None.

Proposed fix

def add_task( @@ ray_template: str | None = None, ): @@ + if n_servers < 1: + raise ValueError("n_servers must be >= 1") + if server_config is None and n_servers != 1: + raise ValueError("n_servers is only supported when server_config is provided")

As per coding guidelines, "Avoid cases where user-passed parameters are unused; code should fail if user specifies an unsupported argument or if a required argument is missing."

Also applies to: 557-557

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/exp.py` at line 447, The parameter n_servers is accepted but can be zero/negative or ignored when server_config is None (e.g., code using range(n_servers)), so add input validation at the start of the function that uses n_servers: if n_servers is not a positive int raise a ValueError; additionally, if n_servers > 0 ensure server_config is not None and raise a ValueError if it is, so user-specified servers are never silently ignored; update any code paths that iterate with range(n_servers) to rely on this validation.

coderabbitai · 2026-03-04T20:50:24Z

nemo_skills/pipeline/utils/exp.py

+    client_num_gpus = num_gpus or 0
+    # For ray heterogenous jobs, nemo-run assumes the first het group is the main task
+    # So we send the server last if the job needs gpus
+    server_goes_first = server_needs_gpus and not client_num_gpus
+


⚠️ Potential issue | 🟠 Major

Compute effective client GPU count once and reuse it.

server_goes_first/server overlap are decided from Line 542, but main executor GPU allocation is recomputed at Line 609 with different logic. This can make ordering and overlap inconsistent with actual resource requests.

Proposed fix

- client_num_gpus = num_gpus or 0 + client_num_gpus = (num_gpus or 0) if (server_config is None or num_nodes > 1) else 0 # For ray heterogenous jobs, nemo-run assumes the first het group is the main task # So we send the server last if the job needs gpus server_goes_first = server_needs_gpus and not client_num_gpus ... - client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0 executors.append( get_executor( ... gpus_per_node=client_num_gpus, ... overlap=(not client_num_gpus), # Only when the main task does not have gpus

As per coding guidelines, "Keep code simple and elegant; reuse/extend existing functionality when possible, minimize conditional checks..."

Also applies to: 576-576, 609-629

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/exp.py` around lines 542 - 546, Compute the effective client GPU count once (e.g., rename/assign client_num_gpus -> effective_client_gpus using the same num_gpus/server_needs_gpus inputs) and reuse that variable everywhere instead of recomputing; update the server ordering boolean (server_goes_first) to use effective_client_gpus and replace the later GPU-allocation logic that currently recomputes client GPUs (the block around the "main executor GPU allocation" code) to reference effective_client_gpus so ordering, overlap, and allocation decisions are consistent across client_num_gpus, server_goes_first, server_needs_gpus and the main executor allocation logic.

coderabbitai · 2026-03-04T20:50:24Z

nemo_skills/pipeline/utils/exp.py

+    # If client doesn't need GPUs but server does, add server first so SLURM allocates GPU partition
+    if server_goes_first:
+        add_server_tasks()
+


⚠️ Potential issue | 🟠 Major

Server-first ordering breaks sandbox node selection assumptions.

When Line 590 runs, executors[0] becomes a server executor, but Line 659 still derives sandbox num_nodes from executors[0]. That can size sandbox by server nodes instead of main-task nodes.

Proposed fix

- num_nodes=executors[0].nodes if cluster_config["executor"] == "slurm" else 1, + num_nodes=num_nodes if cluster_config["executor"] == "slurm" else 1,

Also applies to: 659-659

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@nemo_skills/pipeline/utils/exp.py` around lines 589 - 592, The server-first insertion (server_goes_first -> add_server_tasks) mutates executors so later code that computes sandbox num_nodes from executors[0] can pick the server's node count; compute the sandbox node count from the intended main-task executor before potentially calling add_server_tasks (or locate the first non-server/main executor instead of using executors[0]) so num_nodes is derived from the main task; update the logic around server_goes_first, add_server_tasks, and the sandbox num_nodes calculation to use that precomputed/main-task executor reference (referencing symbols: server_goes_first, add_server_tasks, executors, and num_nodes).

commit a5da597 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Mar 6 12:13:36 2026 -0800 Revert "Eval kit support (#1239)" (#1294) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit b237e33 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Mar 6 20:25:37 2026 +0400 Eval kit support (#1239) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> commit dc28bbf Author: George Armstrong <georgea@nvidia.com> Date: Thu Mar 5 10:17:44 2026 -0800 Python direct tool calling without MCP (#1286) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 12454dd Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Mar 4 13:06:21 2026 -0800 Allow het servers for nemo-rl jobs (#1223) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 8884a68 Author: Prasoon Varshney <prasoon1995@gmail.com> Date: Wed Mar 4 10:24:02 2026 -0800 Support source_lang param for translation recipe (#1290) Signed-off-by: Prasoon Varshney <prasoonv@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 4618b19 Author: Meriem B. <113170426+ka00ri@users.noreply.github.com> Date: Wed Mar 4 18:59:28 2026 +0100 Add MMLU-Pro 10% optimized subset for checkpoint selection (#1285) Signed-off-by: Meriem Boubdir <mboubdir@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 5ac8609 Author: Talor Abramovich <talor19@gmail.com> Date: Wed Mar 4 02:30:06 2026 +0200 Add SPEED-Bench (within repo) (#1279) Signed-off-by: Talor Abramovich <talora@nvidia.com> Signed-off-by: talora <talora@nvidia.com> Signed-off-by: Talor Abramovich <talor19@gmail.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Igor Gitman <igor.a.gitman@gmail.com> commit c31eec5 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 12:18:15 2026 -0800 Fix os.getlogin() crash in ns setup (#1289) Signed-off-by: George Armstrong <georgea@nvidia.com> commit c228e66 Author: George Armstrong <georgea@nvidia.com> Date: Tue Mar 3 11:04:54 2026 -0800 Fix streaming TypeError when delta.content is None (#1267) (#1288) Signed-off-by: George Armstrong <georgea@nvidia.com> commit aa47923 Author: Matvei Novikov <mnovikov@nvidia.com> Date: Mon Mar 2 16:28:41 2026 -0800 Add LibTrace recipe for generating domain-specific reasoning data (#1224) Signed-off-by: jubick1337 <mnovikov@nvidia.com> Signed-off-by: mnovikov <mnovikov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 313cad7 Author: Stephen Ge <stepheng@nvidia.com> Date: Mon Mar 2 18:28:49 2026 -0500 fix: clean parse-failure retries in prover (#1284) Signed-off-by: Stephen Ge <stepheng@nvidia.com> commit 813cfa3 Author: George Armstrong <georgea@nvidia.com> Date: Mon Mar 2 15:10:08 2026 -0800 tst: rollback inference-api to integrate (#1287) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 31735f9 Author: Valentin Mendelev <vmendelev@nvidia.com> Date: Mon Mar 2 23:11:25 2026 +0100 Add backend-agnostic unified inference server with NeMo ASR and TTS backends (#1250) Signed-off-by: Valentin Mendelev <vmendelev@nvidia.com> commit d4ef8c0 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Fri Feb 27 23:58:54 2026 +0400 Update promt_config to working with openai format + inline setup (#1210) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit e879cbc Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:41:23 2026 -0800 Update noc tutorial (#1282) Signed-off-by: George Armstrong <georgea@nvidia.com> commit f6e3505 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 27 10:17:33 2026 -0800 Add noc reasoning tutorial (#1278) Signed-off-by: Amparo Canaveras <acanaveras@nvidia.com> Signed-off-by: rajeshwarid179 <rdevaramani@nvidia.com> Signed-off-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Amparo Canaveras <acanaveras@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: acanaveras <142839082+acanaveras@users.noreply.github.com> Co-authored-by: rajeshwarid179 <rdevaramani@nvidia.com> commit fc2072a Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 27 10:10:25 2026 -0800 CritPt generation add prompt_format=None (#1280) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit c8abe5d Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 27 09:31:26 2026 -0800 New slurm customization parameters (account, containers) (#1209) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 2b38cce Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 25 17:59:52 2026 -0800 Add nemo-skills-core subpackage for lightweight installs (#1229) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 9fa8e83 Author: Dheeraj Peri <peri.dheeraj@gmail.com> Date: Wed Feb 25 12:56:35 2026 -0800 feat: add custom judge type support for external repo integration (#1274) Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Dheeraj Peri <dperi@nvidia.com> Signed-off-by: suriya <sgunasekar@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Yongqiang Wang <yongqiang.seagull@gmail.com> Co-authored-by: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> commit 8a32b13 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 24 15:24:42 2026 -0800 Exclude numb3rs form test_eval.py (#1275) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6da2219 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Mon Feb 23 18:37:46 2026 +0400 Numb3rs ds addition (#1174) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> commit ad034b5 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Sun Feb 22 11:55:24 2026 -0800 Add DSBench-DA evaluation (#1254) Squash merge of changes during code-review. Signed-off-by: suriya <sgunasekar@nvidia.com> commit 7593ab3 Author: Jiacheng Xu <jcxu@utexas.edu> Date: Fri Feb 20 16:42:01 2026 -0800 Add CritPt benchmark (#1200) Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 58c31b2 Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 20 16:19:22 2026 -0800 Fix no_answer metric overcounting in _compute_pass_at_k (#1245) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 1f1a2e7 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 15:58:40 2026 -0800 Fix incorrect prompt tokens count due to HF api update (#1264) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8ebc6f5 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 20 09:05:33 2026 -0800 Remove deprecated dataset group (#1263) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit ea4177f Author: Yongqiang Wang <yongqiang.seagull@gmail.com> Date: Thu Feb 19 19:57:25 2026 -0500 fix deps (#1258) commit 60905a7 Author: Minho Ryu <ryumin93@gmail.com> Date: Fri Feb 20 09:39:39 2026 +0900 Add aime26 (#1256) Signed-off-by: bzantium <ryumin93@gmail.com> commit b28afc5 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:18:25 2026 -0800 Rename custom -> external benchmarks (#1262) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 6cc9c45 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:10:33 2026 -0800 Add reference to internal benchmarks repo (#1261) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 5202af6 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 16:08:05 2026 -0800 Remove incorrect presence-penalty setting (#1259) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 144c70b Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 19 15:26:33 2026 -0800 Adding an option to store benchmarks in external repo (#1240) Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> commit 10e6e39 Author: George <37293288+Jorjeous@users.noreply.github.com> Date: Thu Feb 19 19:57:21 2026 +0400 update vllm miltimodal for api calls convenience (#1213) Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Co-authored-by: mmkrtchyan <mmkrtchyan@nvidia.com> commit 1ba4219 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Wed Feb 18 03:28:23 2026 +0400 Fix --server_container not being applied to dependent jobs (#1244) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit 9517614 Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Mon Feb 16 11:13:24 2026 -0800 Support mini-swe-agent as agent harness (#1212) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> Signed-off-by: i-vainn <imoshkov@nvidia.com> Signed-off-by: George Armstrong <georgea@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Stephen Ge <stepheng@nvidia.com> Signed-off-by: Jiacheng Xu <jiachengx@nvidia.com> Signed-off-by: George Zelenfroind <gzelenfroind@nvidia.com> Signed-off-by: Mateusz Winiarek <mwiniarek@nvidia.com> Signed-off-by: mmkrtchyan <mmkrtchyan@nvidia.com> Signed-off-by: Wei Du <wedu@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: George <37293288+Jorjeous@users.noreply.github.com> Signed-off-by: SeanNaren <snarenthiran@nvidia.com> Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> Co-authored-by: Ivan <imoshkov@nvidia.com> Co-authored-by: George Armstrong <georgea@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Nick Ludwig <nliudvig@nvidia.com> Co-authored-by: Wojciech Prazuch <wojciechprazuch3@gmail.com> Co-authored-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Co-authored-by: Minho Ryu <ryumin93@gmail.com> Co-authored-by: Stephen Ge <stepheng@nvidia.com> Co-authored-by: Jiacheng Xu <jcxu@utexas.edu> Co-authored-by: Jiacheng Xu <jiachengx@nvidia.com> Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Sanyam Kapoor <sanyamk@nvidia.com> Co-authored-by: Mateusz Winiarek <72758259+Froxyy-dev@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Meline Mkrtchyan <72409758+melllinia@users.noreply.github.com> Co-authored-by: Wei Du <wedu@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sean Naren <snarenthiran@nvidia.com> Co-authored-by: Mehrzad Samadi <mehrzadsamadi@gmail.com> Co-authored-by: anowaczynski-nvidia <anowaczynski@nvidia.com> Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com> commit a3d44dc Author: Suriya Gunasekar <sgunasekar@users.noreply.github.com> Date: Fri Feb 13 22:32:15 2026 -0800 Add --installation_command support to prepare_data (#1243) Signed-off-by: suriya <sgunasekar@nvidia.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> commit e80d524 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 12 17:26:00 2026 -0800 Fix CI disk space for Docker image builds (#1241) Signed-off-by: George Armstrong <georgea@nvidia.com> commit d22236c Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Wed Feb 11 17:55:00 2026 -0800 Fix answerbench prompt parsing (#1235) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 2401628 Author: George Armstrong <georgea@nvidia.com> Date: Wed Feb 11 14:56:43 2026 -0800 feat: add lockfiles for reproducible sandbox builds (#1233) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5a0a84d Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Wed Feb 11 13:30:03 2026 -0800 removing datasets version restriction for LCB eval (#1230) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit ef0a890 Author: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> Date: Wed Feb 11 12:03:16 2026 +0400 Gnalbandyan/add physics (#1214) Signed-off-by: Grigor Nalbandyan <gnalbandyan@nvidia.com> Signed-off-by: gnalbandyan <153070076+gnalbandyan@users.noreply.github.com> commit bd9d30c Author: Wasi Ahmad <wasiahmad@ucla.edu> Date: Tue Feb 10 15:13:27 2026 -0800 LCB generic prompting (#1215) Signed-off-by: wasiahmad <wasiahmad@ucla.edu> commit 7d6c49a Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Sat Feb 7 08:45:46 2026 -0800 Add support for different variations of nemo-rl (#1220) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit b19ba96 Author: George Armstrong <georgea@nvidia.com> Date: Fri Feb 6 21:40:56 2026 -0800 Add multi-node sandbox support for SLURM clusters (#1218) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 8950bb0 Author: anowaczynski-nvidia <anowaczynski@nvidia.com> Date: Sat Feb 7 01:38:00 2026 +0100 support structured outputs in hle judge for optional AA compatibility (#1186) Signed-off-by: Arkadiusz Nowaczynski <anowaczynski@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b84f7a2 Author: Igor Gitman <igitman@nvidia.com> Date: Fri Feb 6 14:51:02 2026 -0800 A small update on running tests docs (#1219) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 8e838e1 Author: George Armstrong <georgea@nvidia.com> Date: Thu Feb 5 18:01:35 2026 -0800 feat: add flag to disable sandbox replay (#1217) Signed-off-by: George Armstrong <georgea@nvidia.com> commit 5fd9085 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Feb 5 15:57:01 2026 -0800 Add an option to limit number of tool calls (#1216) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit d820200 Author: Igor Gitman <igitman@nvidia.com> Date: Tue Feb 3 10:43:55 2026 -0800 Add arena-hard v2 (#1205) Signed-off-by: bzantium <ryumin93@gmail.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: bzantium <ryumin93@gmail.com> commit a30920e Author: Igor Gitman <igitman@nvidia.com> Date: Mon Feb 2 10:53:55 2026 -0800 Fix mkdocs warnings (#1204) Signed-off-by: Igor Gitman <igitman@nvidia.com> commit 19d7788 Author: Ivan <imoshkov@nvidia.com> Date: Mon Feb 2 23:25:13 2026 +0500 Fix infinite wait in sandbox.wait_for_sandbox (#1206) Signed-off-by: i-vainn <imoshkov@nvidia.com> commit 3e65fbf Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Fri Jan 30 19:38:38 2026 -0800 Improve tts (#1203) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 250c862 Author: Nick Ludwig <nliudvig@nvidia.com> Date: Fri Jan 30 22:12:29 2026 +0400 SWE-bench: fix SWE-agent hanging, adjust expected scores (#1202) Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> commit 7ded756 Author: Ivan <imoshkov@nvidia.com> Date: Fri Jan 30 09:57:41 2026 +0500 Add proper token counting to code execution model (#1184) Signed-off-by: i-vainn <imoshkov@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> commit b986304 Author: Igor Gitman <igitman@nvidia.com> Date: Thu Jan 29 17:57:07 2026 -0800 Upgrade containers (#1198) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Sadegh Mahdavi <smahdavi@nvidia.com> commit 3b44f02 Author: Dan Lord <blahblahasdf@gmail.com> Date: Thu Jan 29 16:40:47 2026 -0800 Fix incorrect string format (#1199) Signed-off-by: dlord <dlord@nvidia.com> commit c4854b8 Author: Sadegh Mahdavi <smahdavi4@gmail.com> Date: Thu Jan 29 13:43:36 2026 -0800 Update nemo-rl to latest (#1087) Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Signed-off-by: Igor Gitman <igitman@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

smahdavi4 added 4 commits February 9, 2026 10:58

allow het servers for nemo-rl jobs

1380152

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

clean

3f11af1

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

more comments

db64541

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

fix

4486321

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

smahdavi4 requested review from Kipok and wedu-nvidia February 9, 2026 19:47

greptile-apps bot reviewed Feb 9, 2026

View reviewed changes

coderabbitai bot reviewed Feb 9, 2026

View reviewed changes

Kipok reviewed Feb 10, 2026

View reviewed changes

Kipok added the run GPU tests label Feb 10, 2026

Merge branch 'main' into smahdavi/het-job-judg

253525e

Kipok added run GPU tests and removed run GPU tests labels Feb 10, 2026

greptile-apps bot reviewed Feb 10, 2026

View reviewed changes

Merge with main

3ba53cd

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com>

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

gwarmstrong approved these changes Mar 4, 2026

View reviewed changes

gwarmstrong merged commit 12454dd into main Mar 4, 2026
5 checks passed

gwarmstrong deleted the smahdavi/het-job-judg branch March 4, 2026 21:06

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Allow het servers for nemo-rl jobs (#1223)

68f6c82

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

Allow het servers for nemo-rl jobs (#1223)

dba261a

Signed-off-by: Sadegh Mahdavi <smahdavi@nvidia.com> Co-authored-by: Igor Gitman <igitman@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

		@@ -433,6 +433,7 @@ def add_task(
		keep_mounts_for_sandbox=False,

	client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0
	client_num_gpus = num_gpus if (server_config is None or num_nodes > 1) else 0

Conversation

smahdavi4 commented Feb 9, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Kipok Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

gwarmstrong Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

smahdavi4 commented Feb 9, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 9, 2026 •

edited

Loading