Skip to content

Conversation

@nnshah1
Copy link
Contributor

@nnshah1 nnshah1 commented Sep 8, 2025

Overview:

K8s based fault injection tests for recovery and resilience.

Details:

Adds in test infrastructure for deploying dynamo on K8s and injecting faults (pod delete, process kill).

Where should the reviewer start?

managed_deployment.py
test_deployment.py
readme.md

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Added a fault-tolerance test suite with configurable scenarios, load generation, and failure injection for Kubernetes deployments.
    • Introduced a log parsing CLI to summarize latency, success rates, SLA violations, and recovery times from test outputs.
    • Provided a Kubernetes deployment helper to configure, deploy, monitor, and collect logs/metrics from services.
  • Documentation

    • Added a README detailing the test architecture, scenarios, workflows, artifacts, and quick start.
  • Tests

    • Added end-to-end tests that spin up deployments, run concurrent clients, inject failures, and produce summarized results.

Signed-off-by: nnshah1 <[email protected]>
@nnshah1 nnshah1 requested review from a team as code owners September 8, 2025 23:20
@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 8, 2025

Walkthrough

Adds a Kubernetes-based fault-tolerance testing harness under tests/, including a managed deployment utility, pytest fixtures, scenario definitions, a multiprocessing client with port-forwarding, failure injection logic, result parsing/reporting CLI, and documentation. Also adds SPDX headers to init.py.

Changes

Cohort / File(s) Summary of Changes
Fault-tolerance docs
tests/fault_tolerance/deploy/README.md
New README detailing test architecture, scenarios, workflow, logs, and result interpretation.
Module init/metadata
tests/fault_tolerance/deploy/__init__.py
Added SPDX license headers; no behavior changes.
Pytest CLI and fixtures
tests/fault_tolerance/deploy/conftest.py
New pytest options (--image, --namespace) and fixtures (image, namespace).
K8s managed deployment utility
tests/utils/managed_deployment.py
New utility with DeploymentSpec/ServiceSpec, PodProcess, and ManagedDeployment for CR lifecycle, readiness, logs, metrics, and port-forwarding.
Test harness: scenarios & client
tests/fault_tolerance/deploy/scenarios.py, tests/fault_tolerance/deploy/client.py
New dataclasses for Load/Failure/Scenario; predefined deployment+failure matrices. Client sends paced requests via rotating port-forwards to frontend pods with retries and per-request logging.
E2E test and failure injection
tests/fault_tolerance/deploy/test_deployment.py
New pytest e2e test: deploys scenario, spawns clients, injects failures (process kill or pod delete), aggregates logs, and renders tables via parser.
Results parsing/reporting
tests/fault_tolerance/deploy/parse_results.py
New parser CLI: ingests logs, computes success/failure, latency, SLA, and recovery; prints grouped tables.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PyTest as PyTest (test_deployment)
  participant MD as ManagedDeployment
  participant K8s as Kubernetes API
  participant Pods as Frontend/Workers Pods
  participant Clients as Client Processes
  participant Parser as parse_results CLI

  PyTest->>MD: init & __aenter__()
  MD->>K8s: create Custom Resource
  MD-->>PyTest: wait for ready (status)
  PyTest->>Clients: spawn N clients (multiprocessing)
  loop For each request
    Clients->>MD: port-forward to a ready Frontend pod
    Clients->>Pods: POST /v1/chat/completions
    Pods-->>Clients: response (timing, status)
    Clients-->>Clients: throttle to max_request_rate
  end
  par Fault injection
    PyTest->>K8s: delete pod(s) OR
    PyTest->>Pods: kill process (signal)
  and Logging
    MD->>K8s: fetch pod logs/metrics
    MD-->>PyTest: store logs in log_dir
  end
  PyTest->>MD: __aexit__() (cleanup)
  PyTest->>Parser: parse logs -> summary tables
  Parser-->>PyTest: results (latency, success/failure, recovery)
Loading
sequenceDiagram
  autonumber
  participant Client as Client worker
  participant PF as Port-forward tunnel
  participant FE as Frontend Pod
  rect rgb(240,248,255)
  note right of Client: Per-request with retries
  Client->>PF: ensure tunnel to FE:port
  Client->>FE: POST completion
  alt success
    FE-->>Client: 200 + JSON
  else transient error
    FE-->>Client: error / timeout
    loop up to max_retries
      Client->>FE: retry after delay
      FE-->>Client: response
    end
  end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

Thump-thump goes my tester’s heart, so brave,
I launch the pods, then surf the wave—
A sudden zap! a pod goodnight—
I nibble logs by moonlit light.
Requests hop on, retries in tow,
Recovery blooms—onwards we go! 🐇🚀

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.88% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "test: fault injection tests for k8s" concisely reflects the main change — adding Kubernetes fault-injection tests — and is specific enough for a reviewer scanning history to understand the primary purpose. It uses a conventional "test:" prefix and avoids extraneous file lists or emojis. It could be marginally clearer by using "Kubernetes" instead of the abbreviation "k8s", but that is optional.
Description Check ✅ Passed The PR description follows the repository template and includes the required sections: Overview, Details, Where should the reviewer start, and Related Issues, and it succinctly summarizes the addition of Kubernetes fault-injection tests and points reviewers to the core files. However the "Where should the reviewer start?" references "readme.md" while the new doc is at tests/fault_tolerance/deploy/README.md, and the Related Issues line contains a placeholder "#xxx", so those should be corrected for clarity. Overall the description is sufficient to understand intent and scope.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/fault_tolerance/deploy/parse_results.py (1)

438-454: CLI bug: wrong argument order when calling main() (TypeError at runtime).

main(args.log_dir, tablefmt, args.sla) passes a float as log_paths, then iterates it. Must pass sla by name and fix default for log_paths.

Apply this patch (also fixes mutable default arg for log_paths):

-def main(logs_dir, tablefmt, log_paths=[], sla=None):
+def main(logs_dir, tablefmt, sla=None, log_paths=None):
@@
-    if log_paths:
+    if log_paths:
         for log_path in log_paths:
             result = process_test_directory(log_path, sla)
             if result:
                 results.append(result)
@@
-if __name__ == "__main__":
+if __name__ == "__main__":
@@
-    main(args.log_dir, tablefmt, args.sla)
+    main(args.log_dir, tablefmt, sla=args.sla)
🧹 Nitpick comments (22)
tests/fault_tolerance/deploy/README.md (6)

20-35: Fix typos and clarity in intro/architecture.

  • resilency → resiliency
  • unforseen → unforeseen
  • environemnt → environment
  • “Test pass / failure … It only indicates is the test was executed…” → grammar pass

Apply this minimal patch:

-As a large scale distributed inference serving framework in addition
-to providing high throughput and low latency, Dynamo needs to
-provide fault detection, resilency, and quick recovery in the face of
-unforseen failures. In order to test Dynamo we are developing a test
+As a large-scale distributed inference serving framework, in addition
+to providing high throughput and low latency, Dynamo needs to
+provide fault detection, resiliency, and quick recovery in the face of
+unforeseen failures. To evaluate this, we are developing a test
 suite to inject and measure the impact of different types of failure
 conditions.
@@
-configurations that launch typical dynamo deployments in a Kubernetes
-environemnt and then inject failures by terminating processes or
+configurations that launch typical Dynamo deployments in a Kubernetes
+environment and then inject failures by terminating processes or
 pods. To test the recovery time and impact of failures a set number of
 clients are launched in parallel. Each client sends a set number of
 synchronous requests. Log files are stored for each pod as well as for
 each client and inspected using a post-processing script.
@@
-> Test pass / failure is not an indication of SLA for recovery or resilience
-> It only indicates is the test was executed and data was collected
+> Test pass/fail is not an indication of an SLA for recovery or resilience.
+> It only indicates the test executed and data was collected.

41-44: Remove extra space after heading marker.

“### Test Sequence Diagram” has two spaces after the hashes, tripping MD019.

-###  Test Sequence Diagram
+### Test Sequence Diagram

121-126: Add fenced code languages and fix minor typos in results section.

  • Add languages to code fences (bash, text) to satisfy MD040.
  • Correct “Succes/nBefore”, “Numoer”, “Aftere”, “VllmDecodeWroer”.
-```bash
+```bash
 pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE}

@@
- +text
test_fault_scenario[agg-tp-1-dp-1-none]
.
├── client_0.log.txt
@@

@@
-| **Succes/nBefore**   | Numoer of client requests that succeeded before fault injection                                   |
+| **Success Before**   | Number of client requests that succeeded before fault injection                                   |
-| **Failed/nBefore**    | Number of client requests that failed or were invalid before fault injection                      |
+| **Failed Before**    | Number of client requests that failed or were invalid before fault injection                      |
-| **Success/nAftere**   | Number of client requests that  succeeded after fault injection |
+| **Success After**    | Number of client requests that succeeded after fault injection |
@@
-| **{Service}/*.log                  | Container log for pod at end of test (Frontend, VllmDecodeWroer, etc.)                                                                         |
+| **{Service}/*.log**                | Container log for pod at end of test (Frontend, VllmDecodeWorker, etc.)                                 |
-| **{Service}/*.previous.log**       | Previous container log for pod in case of crash / exit. (Frontend, VllmDecodeWroer, etc.). Empty if N/A.               |
+| **{Service}/*.previous.log**       | Previous container log for pod in case of crash/exit (Frontend, VllmDecodeWorker, etc.). Empty if N/A.   |

Also applies to: 131-156, 171-188


235-256: Deduplicate and adjust headings; add code fence languages.

  • Duplicate “Results:”/“Summary:” headings (MD024) and trailing colon (MD026).
  • Add language to code fences (MD040).
-#### Results:
+#### Results
-```
+```text
 ...

-#### Summary:
+#### Summary



Also applies to: 295-316, 454-482

---

`338-346`: **Replace hard tabs with spaces in Mermaid blocks.**

Tabs trigger MD010; convert to spaces for consistent rendering.

```diff
-   		end
+        end
@@
-	Frontend_2["Frontend 2"]
+    Frontend_2["Frontend 2"]
@@
-	Frontend_2 <--> DecodePool
+    Frontend_2 <--> DecodePool
@@
-		end
+        end
@@
-	DecodePool --> PrefillPool
+    DecodePool --> PrefillPool

Also applies to: 410-452


490-512: Add languages to Quick Start code fences.

Specify bash to satisfy MD040.

-```
+```bash
 pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE} --image ${IMAGE}

@@
- +bash

In case you have multiple configs

export KUBECONFIG=~/.kube/dynamo-kubeconfig



Also applies to: 521-529

</blockquote></details>
<details>
<summary>tests/fault_tolerance/deploy/parse_results.py (3)</summary><blockquote>

`209-221`: **Make non‑JSON log parsing robust to “[TEST] …” prefixes.**

Fallback assumes the timestamp is the first token; with `[TEST]` prefix that fails. Extract ISO timestamp via regex.

```diff
-                        parts = clean_line.split()
-                        if len(parts) < 2:
-                            continue
-
-                        try:
-                            # Parse timestamp (remove 'Z' for naive datetime)
-                            timestamp = datetime.fromisoformat(
-                                parts[0].replace("Z", "")
-                            )
-                        except ValueError:
-                            continue
-
-                        log_message = " ".join(parts[1:])
+                        # Extract ISO 8601 timestamp anywhere in the line
+                        m = re.search(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?", clean_line)
+                        if not m:
+                            continue
+                        ts = m.group(0).replace("Z", "")
+                        try:
+                            timestamp = datetime.fromisoformat(ts.replace("T", " "))
+                        except ValueError:
+                            continue
+                        log_message = clean_line[m.end():].strip()

267-269: Avoid treating zero‑second recovery as “no recovery.”

If recovery aligns to the same second as the fault, 0 returns None. Use <= 0.

-    if last_recovery_time == 0:
+    if last_recovery_time <= 0:
         return None

56-75: Harden client log parsing against corrupt lines.

A single bad JSON line will currently crash parsing. Wrap json.loads and skip on failure.

-                for line in f:
+                for line in f:
                     request_number += 1
-                    data = json.loads(line.strip())
+                    try:
+                        data = json.loads(line.strip())
+                    except json.JSONDecodeError:
+                        continue
tests/utils/managed_deployment.py (3)

398-404: Typo in method name and callers: _scale_statfulset → _scale_statefulset.

Minor correctness/readability; fix name and references.

-    async def _scale_statfulset(self, name, label, replicas):
+    async def _scale_statefulset(self, name, label, replicas):
@@
-        await self._apps_v1.patch_namespaced_stateful_set_scale(
+        await self._apps_v1.patch_namespaced_stateful_set_scale(
             name, self.namespace, body
         )
         await self._wait_for_pods(label, replicas)
@@
-        await self._scale_statfulset(name, label, 0)
+        await self._scale_statefulset(name, label, 0)
@@
-        await self._scale_statfulset(name, label, 1)
+        await self._scale_statefulset(name, label, 1)

Also applies to: 406-420, 500-505


582-604: Use logging.exception and narrow excepts while collecting logs/metrics.

Broad except Exception with error() hides stack traces; prefer exception() and narrow where practical.

-        except Exception as e:
-            self._logger.error(e)
+        except Exception:
+            self._logger.exception("Failed to write pod YAML")
@@
-        except Exception as e:
-            self._logger.error(e)
+        except Exception:
+            self._logger.exception("Failed to write current pod logs")
@@
-        except Exception as e:
-            self._logger.debug(e)
+        except Exception:
+            self._logger.debug("No previous logs available", exc_info=True)
@@
-        except Exception as e:
-            self._logger.error(str(e))
+        except requests.RequestException:
+            self._logger.exception("Error fetching /metrics")

Also applies to: 641-649


705-717: Consider gating disruptive infra restarts behind a flag.

Restarting etcd and NATS on every test run is heavy and can cause flakiness. Add a restart_infra: bool = False flag and conditionally call _restart_etcd()/_restart_nats().

Happy to propose a small diff adding the flag if you confirm this won’t be needed for every scenario.

tests/fault_tolerance/deploy/conftest.py (1)

19-31: LGTM; add help text for CLI options for discoverability.

Minor ergonomics: provide help= for --image and --namespace.

-    parser.addoption("--image", type=str, default=None)
-    parser.addoption("--namespace", type=str, default="fault-tolerance-test")
+    parser.addoption("--image", type=str, default=None, help="Container image override for services")
+    parser.addoption("--namespace", type=str, default="fault-tolerance-test", help="Kubernetes namespace to use")
tests/fault_tolerance/deploy/test_deployment.py (3)

129-175: Don’t block the event loop in an async test; move failure injection to a thread.

time.sleep inside async def blocks the loop for long periods. Run the injector in a thread.

Apply this diff:

@@
+import asyncio
@@
         with _clients(
@@
-        ):
-            _inject_failures(scenario.failures, logger, deployment)
+        ):
+            await asyncio.to_thread(_inject_failures, scenario.failures, logger, deployment)

61-65: Join with timeout and terminate on hang to avoid deadlocks.

If a client process wedges, join() will block indefinitely during teardown.

Apply this diff:

     for proc in procs:
         logger.debug(f"{proc} waiting for join")
-        proc.join()
+        proc.join(timeout=60)
+        if proc.is_alive():
+            logger.warning("Terminating hung client process: %s", proc.pid)
+            proc.terminate()
+            proc.join(timeout=10)
         logger.debug(f"{proc} joined")

67-67: Remove unused noqa F811 directives.

These suppressions are unnecessary and flagged by Ruff.

Apply this diff:

-def _inject_failures(failures, logger, deployment: ManagedDeployment):  # noqa: F811
+def _inject_failures(failures, logger, deployment: ManagedDeployment):
@@
-@pytest.fixture(autouse=True)
-def results_table(request, scenario):  # noqa: F811
+@pytest.fixture(autouse=True)
+def results_table(request, scenario):
@@
-async def test_fault_scenario(
-    scenario,  # noqa: F811
+async def test_fault_scenario(
+    scenario,

Also applies to: 105-105, 130-130

tests/fault_tolerance/deploy/client.py (3)

49-53: Avoid configuring global logging in an importable module.

basicConfig at import time can interfere with pytest logging. Consider moving to if name == "main" or let the test harness configure logging.


61-72: Use logger or drop the arg; tighten retry loop and timestamp.

  • logger is unused (Ruff ARG001). Either use it or remove it.
  • Prefer timezone-aware ISO timestamp for consistency.

Apply this diff:

 def _single_request(
     url,
     pod,
     payload,
     model,
-    logger,
+    logger,
     retry_attempts=1,
@@
-    start_time = time.time()
+    start_time = time.time()
     results = []
@@
             end_time = time.time()
 
             content = None
@@
             results.append(
                 {
                     "status": response.status_code,
                     "result": content,
                     "request_elapsed_time": end_time - start_request_time,
                     "url": url,
                     "pod": pod,
                 }
             )
+            if logger:
+                logger.debug("POST %s -> %s in %.3fs", url, response.status_code, end_time - start_request_time)
@@
     return {
-        "time": datetime.now().strftime("%Y-%m-%dT%H:%M:%S"),
+        "time": datetime.now().astimezone().isoformat(timespec="seconds"),
         "results": results,
         "total_time": time.time() - start_time,
         "url": url,
         "pod": pod,
     }

Also applies to: 83-117, 118-131, 132-139


216-218: Broadened exception handling; stop port-forwards and log stack traces.

Use logger.exception and always stop port-forward tunnels.

Apply this diff:

-    except Exception as e:
-        logger.error(str(e))
-    logger.info("Exiting")
+    except Exception:
+        logger.exception("Client error")
+    finally:
+        for pf in pod_ports.values():
+            try:
+                pf.stop()
+            except Exception:
+                logger.debug("Port-forward stop failed", exc_info=True)
+        logger.info("Exiting")
tests/fault_tolerance/deploy/scenarios.py (3)

16-19: Tighten types with Optional; fix minor doc typos below.

Use Optional for nullable fields to improve editor/type-checker help.

Apply this diff:

-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Optional
@@
 class Load:
@@
-    sla: float = None
+    sla: Optional[float] = None
@@
 class Scenario:
     deployment: DeploymentSpec
     load: Load
     failures: list[Failure]
-    model: str = None
+    model: Optional[str] = None

Also applies to: 21-30, 41-47


93-93: Fix typos in comments (readability).

Replicats → Replicas; scenaro → scenario; pervious → previous.

Apply this diff:

-# Derivative Specs With Incremented Replicats
+# Derivative specs with incremented replicas
@@
-# Each failure scenaro contains a list of failure injections
-# Each failure injection has a time in seconds after the pervious injection and
+# Each failure scenario contains a list of failure injections
+# Each failure injection has a time in seconds after the previous injection and

Also applies to: 109-119


62-76: Consistency: prefer set_tensor_parallel over direct property mutation.

Use the helper to avoid diverging behaviors across specs.

Example:

deployment_specs["disagg-prefill-tp-2-decode-tp-2-dp-1"].set_tensor_parallel(
    2, ["VllmPrefillWorker", "VllmDecodeWorker"]
)

Also applies to: 83-92

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a73b3c and ff83fe7.

📒 Files selected for processing (8)
  • tests/fault_tolerance/deploy/README.md (1 hunks)
  • tests/fault_tolerance/deploy/__init__.py (1 hunks)
  • tests/fault_tolerance/deploy/client.py (1 hunks)
  • tests/fault_tolerance/deploy/conftest.py (1 hunks)
  • tests/fault_tolerance/deploy/parse_results.py (1 hunks)
  • tests/fault_tolerance/deploy/scenarios.py (1 hunks)
  • tests/fault_tolerance/deploy/test_deployment.py (1 hunks)
  • tests/utils/managed_deployment.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/fault_tolerance/deploy/parse_results.py (1)
tests/utils/managed_deployment.py (1)
  • main (727-758)
tests/fault_tolerance/deploy/conftest.py (1)
tests/utils/managed_deployment.py (4)
  • image (36-41)
  • image (44-49)
  • namespace (201-203)
  • namespace (206-207)
tests/fault_tolerance/deploy/test_deployment.py (4)
tests/fault_tolerance/deploy/client.py (1)
  • client (141-218)
tests/fault_tolerance/deploy/parse_results.py (1)
  • main (320-435)
tests/utils/managed_deployment.py (21)
  • main (727-758)
  • ManagedDeployment (350-724)
  • namespace (201-203)
  • namespace (206-207)
  • model (61-72)
  • model (75-99)
  • name (30-32)
  • name (178-180)
  • name (197-198)
  • get_pods (554-580)
  • replicas (53-54)
  • replicas (57-58)
  • get_pod_logs (582-605)
  • get_processes (541-547)
  • kill (322-331)
  • image (36-41)
  • image (44-49)
  • disable_grove (209-214)
  • set_image (224-230)
  • set_model (216-222)
  • set_logging (248-270)
tests/fault_tolerance/deploy/conftest.py (2)
  • namespace (30-31)
  • image (25-26)
tests/fault_tolerance/deploy/client.py (1)
tests/utils/managed_deployment.py (10)
  • ManagedDeployment (350-724)
  • namespace (201-203)
  • namespace (206-207)
  • get_pods (554-580)
  • port (183-185)
  • name (30-32)
  • name (178-180)
  • name (197-198)
  • port_forward (667-697)
  • endpoint (193-194)
tests/utils/managed_deployment.py (2)
benchmarks/profiler/utils/config.py (1)
  • Service (54-58)
tests/fault_tolerance/deploy/client.py (1)
  • client (141-218)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/utils/managed_deployment.py (8)
  • DeploymentSpec (166-311)
  • replicas (53-54)
  • replicas (57-58)
  • model (61-72)
  • model (75-99)
  • set_tensor_parallel (232-246)
  • tensor_parallel_size (118-130)
  • tensor_parallel_size (133-163)
🪛 Ruff (0.12.2)
tests/fault_tolerance/deploy/parse_results.py

241-241: Unused function argument: failure_type

(ARG001)


260-260: Loop control variable process not used within loop body

(B007)


261-261: Loop control variable replica not used within loop body

Rename unused replica to _replica

(B007)


320-320: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

tests/fault_tolerance/deploy/test_deployment.py

67-67: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)


105-105: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)


130-130: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

tests/fault_tolerance/deploy/client.py

58-58: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)


66-66: Unused function argument: logger

(ARG001)


216-216: Do not catch blind exception: Exception

(BLE001)


217-217: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

tests/utils/managed_deployment.py

344-344: Do not catch blind exception: Exception

(BLE001)


372-372: Do not catch blind exception: Exception

(BLE001)


396-396: Create your own exception

(TRY002)


396-396: Avoid specifying long messages outside the exception class

(TRY003)


487-487: Do not catch blind exception: Exception

(BLE001)


492-492: Avoid specifying long messages outside the exception class

(TRY003)


589-589: Do not catch blind exception: Exception

(BLE001)


590-590: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


594-594: Do not catch blind exception: Exception

(BLE001)


595-595: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


602-602: Do not catch blind exception: Exception

(BLE001)


611-611: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


641-641: Do not catch blind exception: Exception

(BLE001)


642-642: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


672-672: Possible binding to all interfaces

(S104)

🪛 markdownlint-cli2 (0.17.2)
tests/fault_tolerance/deploy/README.md

41-41: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)


119-119: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


235-235: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


237-237: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


257-257: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


295-295: Multiple headings with the same content

(MD024, no-duplicate-heading)


295-295: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


296-296: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


316-316: Multiple headings with the same content

(MD024, no-duplicate-heading)


316-316: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


344-344: Hard tabs
Column: 4

(MD010, no-hard-tabs)


365-365: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


367-367: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


394-394: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


414-414: Hard tabs
Column: 1

(MD010, no-hard-tabs)


420-420: Hard tabs
Column: 1

(MD010, no-hard-tabs)


428-428: Hard tabs
Column: 1

(MD010, no-hard-tabs)


432-432: Hard tabs
Column: 1

(MD010, no-hard-tabs)


446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)


454-454: Multiple headings with the same content

(MD024, no-duplicate-heading)


454-454: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


456-456: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


483-483: Multiple headings with the same content

(MD024, no-duplicate-heading)


483-483: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


504-504: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


510-510: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


521-521: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/fault_tolerance/deploy/__init__.py (1)

1-2: LGTM on SPDX headers.

Header is correct and non-invasive.

tests/fault_tolerance/deploy/test_deployment.py (1)

156-160: Great: deployment context setup is clear and bounded.

tests/fault_tolerance/deploy/client.py (1)

140-156: LGTM: client scaffolding is clear and aligns with ManagedDeployment.

tests/fault_tolerance/deploy/scenarios.py (1)

146-152: Scenario generation filter is correct.

Skipping prefill failures for non-disagg deployments looks good.

Comment on lines +126 to +134
@pytest.mark.e2e
@pytest.mark.slow
@pytest.mark.filterwarnings("ignore::DeprecationWarning")
async def test_fault_scenario(
scenario, # noqa: F811
request,
image,
namespace,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify pytest-asyncio is available to run async tests.

Without the plugin, async tests won’t be awaited.


🏁 Script executed:

#!/bin/bash
set -euo pipefail
python - <<'PY'
import importlib, sys
ok = importlib.util.find_spec("pytest_asyncio") is not None
print("pytest-asyncio:", "FOUND" if ok else "MISSING")
sys.exit(0 if ok else 1)
PY

Length of output: 131


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "python binary: $(command -v python || echo 'none')"
python -V || true
python - <<'PY'
import sys, pkgutil, subprocess
def detect():
    results = {}
    try:
        import pytest_asyncio
        results['import'] = 'FOUND'
    except Exception as e:
        results['import'] = f'MISSING ({e!r})'
    try:
        import importlib.util
        spec = importlib.util.find_spec("pytest_asyncio")
        results['importlib.util.find_spec'] = 'FOUND' if spec else 'MISSING'
    except Exception as e:
        results['importlib.util.find_spec'] = f'ERROR ({e!r})'
    try:
        found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules())
        results['pkgutil.iter_modules'] = 'FOUND' if found else 'MISSING'
    except Exception as e:
        results['pkgutil.iter_modules'] = f'ERROR ({e!r})'
    try:
        r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True)
        results['pip_show_rc'] = r.returncode
        results['pip_show_out'] = (r.stdout.strip() or r.stderr.strip())[:1000]
    except Exception as e:
        results['pip_show'] = f'ERROR ({e!r})'
    try:
        r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True)
        results['pytest_version_rc'] = r.returncode
        results['pytest_version_out'] = (r.stdout.strip() or r.stderr.strip())[:1000]
    except Exception as e:
        results['pytest_version'] = f'ERROR ({e!r})'
    for k,v in results.items():
        print(f"{k}: {v}")
detect()
PY

Length of output: 223


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "python binary: $(command -v python || echo 'none')"
python -V || true

python - <<'PY'
import sys, pkgutil, subprocess
def sp(k,v):
    print(str(k) + ": " + str(v))
# import check
try:
    import pytest_asyncio
    sp("import", "FOUND")
except Exception as e:
    sp("import", "MISSING " + type(e).__name__ + " " + str(e))
# importlib.util.find_spec
try:
    import importlib.util as iu
    spec = iu.find_spec("pytest_asyncio")
    sp("importlib.util.find_spec", "FOUND" if spec else "MISSING")
except Exception as e:
    sp("importlib.util.find_spec", "ERROR " + type(e).__name__ + " " + str(e))
# pkgutil.iter_modules
try:
    found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules())
    sp("pkgutil.iter_modules", "FOUND" if found else "MISSING")
except Exception as e:
    sp("pkgutil.iter_modules", "ERROR " + type(e).__name__ + " " + str(e))
# pip show
try:
    r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True)
    sp("pip_show_rc", r.returncode)
    out = (r.stdout.strip() or r.stderr.strip())
    sp("pip_show_out", out.splitlines()[0] if out else "<empty>")
except Exception as e:
    sp("pip_show", "ERROR " + type(e).__name__ + " " + str(e))
# pytest --version
try:
    r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True)
    sp("pytest_version_rc", r.returncode)
    out = (r.stdout.strip() or r.stderr.strip())
    sp("pytest_version_out", out.splitlines()[0] if out else "<empty>")
except Exception as e:
    sp("pytest_version", "ERROR " + type(e).__name__ + " " + str(e))
PY

Length of output: 306


Install pytest-asyncio (required for async tests)
Verification shows pytest-asyncio is MISSING while pytest 8.4.2 is installed; add pytest-asyncio to your test/dev dependencies and CI so async tests in tests/fault_tolerance/deploy/test_deployment.py (L126-L134) are actually awaited.

🧰 Tools
🪛 Ruff (0.12.2)

130-130: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents
In tests/fault_tolerance/deploy/test_deployment.py around lines 126-134, async
tests are used but pytest-asyncio is not installed; add pytest-asyncio to the
project's test/dev dependencies (e.g., pyproject.toml or requirements-dev.txt)
with a version compatible with pytest 8.4.x (for example
pytest-asyncio>=0.21.0), update CI pipeline to install those dev/test deps
before running pytest so async tests are awaited, and re-run CI to verify the
async tests execute correctly.

@rmccorm4
Copy link
Contributor

Closed in favor of #3194

@rmccorm4 rmccorm4 closed this Sep 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants