test: fault injection tests for k8s #2942

nnshah1 · 2025-09-08T23:20:44Z

Overview:

K8s based fault injection tests for recovery and resilience.

Details:

Adds in test infrastructure for deploying dynamo on K8s and injecting faults (pod delete, process kill).

Where should the reviewer start?

managed_deployment.py
test_deployment.py
readme.md

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Added a fault-tolerance test suite with configurable scenarios, load generation, and failure injection for Kubernetes deployments.
- Introduced a log parsing CLI to summarize latency, success rates, SLA violations, and recovery times from test outputs.
- Provided a Kubernetes deployment helper to configure, deploy, monitor, and collect logs/metrics from services.
Documentation
- Added a README detailing the test architecture, scenarios, workflows, artifacts, and quick start.
Tests
- Added end-to-end tests that spin up deployments, run concurrent clients, inject failures, and produce summarized results.

Signed-off-by: nnshah1 <[email protected]>

copy-pr-bot · 2025-09-08T23:20:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-09-08T23:22:04Z

Walkthrough

Adds a Kubernetes-based fault-tolerance testing harness under tests/, including a managed deployment utility, pytest fixtures, scenario definitions, a multiprocessing client with port-forwarding, failure injection logic, result parsing/reporting CLI, and documentation. Also adds SPDX headers to init.py.

Changes

Cohort / File(s)	Summary of Changes
Fault-tolerance docs `tests/fault_tolerance/deploy/README.md`	New README detailing test architecture, scenarios, workflow, logs, and result interpretation.
Module init/metadata `tests/fault_tolerance/deploy/__init__.py`	Added SPDX license headers; no behavior changes.
Pytest CLI and fixtures `tests/fault_tolerance/deploy/conftest.py`	New pytest options (--image, --namespace) and fixtures (image, namespace).
K8s managed deployment utility `tests/utils/managed_deployment.py`	New utility with DeploymentSpec/ServiceSpec, PodProcess, and ManagedDeployment for CR lifecycle, readiness, logs, metrics, and port-forwarding.
Test harness: scenarios & client `tests/fault_tolerance/deploy/scenarios.py`, `tests/fault_tolerance/deploy/client.py`	New dataclasses for Load/Failure/Scenario; predefined deployment+failure matrices. Client sends paced requests via rotating port-forwards to frontend pods with retries and per-request logging.
E2E test and failure injection `tests/fault_tolerance/deploy/test_deployment.py`	New pytest e2e test: deploys scenario, spawns clients, injects failures (process kill or pod delete), aggregates logs, and renders tables via parser.
Results parsing/reporting `tests/fault_tolerance/deploy/parse_results.py`	New parser CLI: ingests logs, computes success/failure, latency, SLA, and recovery; prints grouped tables.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant PyTest as PyTest (test_deployment)
  participant MD as ManagedDeployment
  participant K8s as Kubernetes API
  participant Pods as Frontend/Workers Pods
  participant Clients as Client Processes
  participant Parser as parse_results CLI

  PyTest->>MD: init & __aenter__()
  MD->>K8s: create Custom Resource
  MD-->>PyTest: wait for ready (status)
  PyTest->>Clients: spawn N clients (multiprocessing)
  loop For each request
    Clients->>MD: port-forward to a ready Frontend pod
    Clients->>Pods: POST /v1/chat/completions
    Pods-->>Clients: response (timing, status)
    Clients-->>Clients: throttle to max_request_rate
  end
  par Fault injection
    PyTest->>K8s: delete pod(s) OR
    PyTest->>Pods: kill process (signal)
  and Logging
    MD->>K8s: fetch pod logs/metrics
    MD-->>PyTest: store logs in log_dir
  end
  PyTest->>MD: __aexit__() (cleanup)
  PyTest->>Parser: parse logs -> summary tables
  Parser-->>PyTest: results (latency, success/failure, recovery)

sequenceDiagram
  autonumber
  participant Client as Client worker
  participant PF as Port-forward tunnel
  participant FE as Frontend Pod
  rect rgb(240,248,255)
  note right of Client: Per-request with retries
  Client->>PF: ensure tunnel to FE:port
  Client->>FE: POST completion
  alt success
    FE-->>Client: 200 + JSON
  else transient error
    FE-->>Client: error / timeout
    loop up to max_retries
      Client->>FE: retry after delay
      FE-->>Client: response
    end
  end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

chore: updating and removing tests #2130 — Touches the same tests/fault_tolerance paths (README, parse_results, scenarios), indicating overlapping or follow-up adjustments.
test: fault tolerance tests #1444 — Adds/modifies the same deploy-focused test harness files (client, scenarios, parser), likely part of the same test-suite evolution.
feat: kubernetes overrides for the entrypoint and cmd #1396 — Edits deployment-spec handling (extraPodSpec.mainContainer args) used by DeploymentSpec utilities here.

Poem

Thump-thump goes my tester’s heart, so brave,
I launch the pods, then surf the wave—
A sudden zap! a pod goodnight—
I nibble logs by moonlit light.
Requests hop on, retries in tow,
Recovery blooms—onwards we go! 🐇🚀

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.88% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "test: fault injection tests for k8s" concisely reflects the main change — adding Kubernetes fault-injection tests — and is specific enough for a reviewer scanning history to understand the primary purpose. It uses a conventional "test:" prefix and avoids extraneous file lists or emojis. It could be marginally clearer by using "Kubernetes" instead of the abbreviation "k8s", but that is optional.
Description Check	✅ Passed	The PR description follows the repository template and includes the required sections: Overview, Details, Where should the reviewer start, and Related Issues, and it succinctly summarizes the addition of Kubernetes fault-injection tests and points reviewers to the core files. However the "Where should the reviewer start?" references "readme.md" while the new doc is at tests/fault_tolerance/deploy/README.md, and the Related Issues line contains a placeholder "#xxx", so those should be corrected for clarity. Overall the description is sufficient to understand intent and scope.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tests/fault_tolerance/deploy/client.py

tests/fault_tolerance/deploy/conftest.py

Signed-off-by: nnshah1 <[email protected]>

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/fault_tolerance/deploy/parse_results.py (1)
438-454: CLI bug: wrong argument order when calling main() (TypeError at runtime).

main(args.log_dir, tablefmt, args.sla) passes a float as log_paths, then iterates it. Must pass sla by name and fix default for log_paths.

Apply this patch (also fixes mutable default arg for log_paths):
-def main(logs_dir, tablefmt, log_paths=[], sla=None):
+def main(logs_dir, tablefmt, sla=None, log_paths=None):
@@
-    if log_paths:
+    if log_paths:
         for log_path in log_paths:
             result = process_test_directory(log_path, sla)
             if result:
                 results.append(result)
@@
-if __name__ == "__main__":
+if __name__ == "__main__":
@@
-    main(args.log_dir, tablefmt, args.sla)
+    main(args.log_dir, tablefmt, sla=args.sla)

🧹 Nitpick comments (22)

tests/fault_tolerance/deploy/README.md (6)

20-35: Fix typos and clarity in intro/architecture.

resilency → resiliency
unforseen → unforeseen
environemnt → environment
“Test pass / failure … It only indicates is the test was executed…” → grammar pass

Apply this minimal patch:

-As a large scale distributed inference serving framework in addition
-to providing high throughput and low latency, Dynamo needs to
-provide fault detection, resilency, and quick recovery in the face of
-unforseen failures. In order to test Dynamo we are developing a test
+As a large-scale distributed inference serving framework, in addition
+to providing high throughput and low latency, Dynamo needs to
+provide fault detection, resiliency, and quick recovery in the face of
+unforeseen failures. To evaluate this, we are developing a test
 suite to inject and measure the impact of different types of failure
 conditions.
@@
-configurations that launch typical dynamo deployments in a Kubernetes
-environemnt and then inject failures by terminating processes or
+configurations that launch typical Dynamo deployments in a Kubernetes
+environment and then inject failures by terminating processes or
 pods. To test the recovery time and impact of failures a set number of
 clients are launched in parallel. Each client sends a set number of
 synchronous requests. Log files are stored for each pod as well as for
 each client and inspected using a post-processing script.
@@
-> Test pass / failure is not an indication of SLA for recovery or resilience
-> It only indicates is the test was executed and data was collected
+> Test pass/fail is not an indication of an SLA for recovery or resilience.
+> It only indicates the test executed and data was collected.

41-44: Remove extra space after heading marker.

“### Test Sequence Diagram” has two spaces after the hashes, tripping MD019.

-###  Test Sequence Diagram
+### Test Sequence Diagram

121-126: Add fenced code languages and fix minor typos in results section.

Add languages to code fences (bash, text) to satisfy MD040.
Correct “Succes/nBefore”, “Numoer”, “Aftere”, “VllmDecodeWroer”.

-```bash
+```bash
 pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE}

@@
- +text
test_fault_scenario[agg-tp-1-dp-1-none]
.
├── client_0.log.txt
@@

@@
-| **Succes/nBefore**   | Numoer of client requests that succeeded before fault injection                                   |
+| **Success Before**   | Number of client requests that succeeded before fault injection                                   |
-| **Failed/nBefore**    | Number of client requests that failed or were invalid before fault injection                      |
+| **Failed Before**    | Number of client requests that failed or were invalid before fault injection                      |
-| **Success/nAftere**   | Number of client requests that  succeeded after fault injection |
+| **Success After**    | Number of client requests that succeeded after fault injection |
@@
-| **{Service}/*.log                  | Container log for pod at end of test (Frontend, VllmDecodeWroer, etc.)                                                                         |
+| **{Service}/*.log**                | Container log for pod at end of test (Frontend, VllmDecodeWorker, etc.)                                 |
-| **{Service}/*.previous.log**       | Previous container log for pod in case of crash / exit. (Frontend, VllmDecodeWroer, etc.). Empty if N/A.               |
+| **{Service}/*.previous.log**       | Previous container log for pod in case of crash/exit (Frontend, VllmDecodeWorker, etc.). Empty if N/A.   |

Also applies to: 131-156, 171-188

235-256: Deduplicate and adjust headings; add code fence languages.

Duplicate “Results:”/“Summary:” headings (MD024) and trailing colon (MD026).
Add language to code fences (MD040).

-#### Results:
+#### Results
-```
+```text
 ...

-#### Summary:
+#### Summary



Also applies to: 295-316, 454-482

---

`338-346`: **Replace hard tabs with spaces in Mermaid blocks.**

Tabs trigger MD010; convert to spaces for consistent rendering.

```diff
-   		end
+        end
@@
-	Frontend_2["Frontend 2"]
+    Frontend_2["Frontend 2"]
@@
-	Frontend_2 <--> DecodePool
+    Frontend_2 <--> DecodePool
@@
-		end
+        end
@@
-	DecodePool --> PrefillPool
+    DecodePool --> PrefillPool

Also applies to: 410-452

490-512: Add languages to Quick Start code fences.

Specify bash to satisfy MD040.

-```
+```bash
 pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE} --image ${IMAGE}

@@
- +bash

In case you have multiple configs

export KUBECONFIG=~/.kube/dynamo-kubeconfig



Also applies to: 521-529

</blockquote></details>
<details>
<summary>tests/fault_tolerance/deploy/parse_results.py (3)</summary><blockquote>

`209-221`: **Make non‑JSON log parsing robust to “[TEST] …” prefixes.**

Fallback assumes the timestamp is the first token; with `[TEST]` prefix that fails. Extract ISO timestamp via regex.

```diff
-                        parts = clean_line.split()
-                        if len(parts) < 2:
-                            continue
-
-                        try:
-                            # Parse timestamp (remove 'Z' for naive datetime)
-                            timestamp = datetime.fromisoformat(
-                                parts[0].replace("Z", "")
-                            )
-                        except ValueError:
-                            continue
-
-                        log_message = " ".join(parts[1:])
+                        # Extract ISO 8601 timestamp anywhere in the line
+                        m = re.search(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?", clean_line)
+                        if not m:
+                            continue
+                        ts = m.group(0).replace("Z", "")
+                        try:
+                            timestamp = datetime.fromisoformat(ts.replace("T", " "))
+                        except ValueError:
+                            continue
+                        log_message = clean_line[m.end():].strip()

267-269: Avoid treating zero‑second recovery as “no recovery.”

If recovery aligns to the same second as the fault, 0 returns None. Use <= 0.

-    if last_recovery_time == 0:
+    if last_recovery_time <= 0:
         return None

56-75: Harden client log parsing against corrupt lines.

A single bad JSON line will currently crash parsing. Wrap json.loads and skip on failure.

-                for line in f:
+                for line in f:
                     request_number += 1
-                    data = json.loads(line.strip())
+                    try:
+                        data = json.loads(line.strip())
+                    except json.JSONDecodeError:
+                        continue

tests/utils/managed_deployment.py (3)

398-404: Typo in method name and callers: _scale_statfulset → _scale_statefulset.

Minor correctness/readability; fix name and references.

-    async def _scale_statfulset(self, name, label, replicas):
+    async def _scale_statefulset(self, name, label, replicas):
@@
-        await self._apps_v1.patch_namespaced_stateful_set_scale(
+        await self._apps_v1.patch_namespaced_stateful_set_scale(
             name, self.namespace, body
         )
         await self._wait_for_pods(label, replicas)
@@
-        await self._scale_statfulset(name, label, 0)
+        await self._scale_statefulset(name, label, 0)
@@
-        await self._scale_statfulset(name, label, 1)
+        await self._scale_statefulset(name, label, 1)

Also applies to: 406-420, 500-505

582-604: Use logging.exception and narrow excepts while collecting logs/metrics.

Broad except Exception with error() hides stack traces; prefer exception() and narrow where practical.

-        except Exception as e:
-            self._logger.error(e)
+        except Exception:
+            self._logger.exception("Failed to write pod YAML")
@@
-        except Exception as e:
-            self._logger.error(e)
+        except Exception:
+            self._logger.exception("Failed to write current pod logs")
@@
-        except Exception as e:
-            self._logger.debug(e)
+        except Exception:
+            self._logger.debug("No previous logs available", exc_info=True)
@@
-        except Exception as e:
-            self._logger.error(str(e))
+        except requests.RequestException:
+            self._logger.exception("Error fetching /metrics")

Also applies to: 641-649

705-717: Consider gating disruptive infra restarts behind a flag.

Restarting etcd and NATS on every test run is heavy and can cause flakiness. Add a restart_infra: bool = False flag and conditionally call _restart_etcd()/_restart_nats().

Happy to propose a small diff adding the flag if you confirm this won’t be needed for every scenario.

tests/fault_tolerance/deploy/conftest.py (1)

19-31: LGTM; add help text for CLI options for discoverability.

Minor ergonomics: provide help= for --image and --namespace.

-    parser.addoption("--image", type=str, default=None)
-    parser.addoption("--namespace", type=str, default="fault-tolerance-test")
+    parser.addoption("--image", type=str, default=None, help="Container image override for services")
+    parser.addoption("--namespace", type=str, default="fault-tolerance-test", help="Kubernetes namespace to use")

tests/fault_tolerance/deploy/test_deployment.py (3)

129-175: Don’t block the event loop in an async test; move failure injection to a thread.

time.sleep inside async def blocks the loop for long periods. Run the injector in a thread.

Apply this diff:

@@
+import asyncio
@@
         with _clients(
@@
-        ):
-            _inject_failures(scenario.failures, logger, deployment)
+        ):
+            await asyncio.to_thread(_inject_failures, scenario.failures, logger, deployment)

61-65: Join with timeout and terminate on hang to avoid deadlocks.

If a client process wedges, join() will block indefinitely during teardown.

Apply this diff:

     for proc in procs:
         logger.debug(f"{proc} waiting for join")
-        proc.join()
+        proc.join(timeout=60)
+        if proc.is_alive():
+            logger.warning("Terminating hung client process: %s", proc.pid)
+            proc.terminate()
+            proc.join(timeout=10)
         logger.debug(f"{proc} joined")

67-67: Remove unused noqa F811 directives.

These suppressions are unnecessary and flagged by Ruff.

Apply this diff:

-def _inject_failures(failures, logger, deployment: ManagedDeployment):  # noqa: F811
+def _inject_failures(failures, logger, deployment: ManagedDeployment):
@@
-@pytest.fixture(autouse=True)
-def results_table(request, scenario):  # noqa: F811
+@pytest.fixture(autouse=True)
+def results_table(request, scenario):
@@
-async def test_fault_scenario(
-    scenario,  # noqa: F811
+async def test_fault_scenario(
+    scenario,

Also applies to: 105-105, 130-130

tests/fault_tolerance/deploy/client.py (3)

49-53: Avoid configuring global logging in an importable module.

basicConfig at import time can interfere with pytest logging. Consider moving to if name == "main" or let the test harness configure logging.

61-72: Use logger or drop the arg; tighten retry loop and timestamp.

logger is unused (Ruff ARG001). Either use it or remove it.
Prefer timezone-aware ISO timestamp for consistency.

Apply this diff:

 def _single_request(
     url,
     pod,
     payload,
     model,
-    logger,
+    logger,
     retry_attempts=1,
@@
-    start_time = time.time()
+    start_time = time.time()
     results = []
@@
             end_time = time.time()
 
             content = None
@@
             results.append(
                 {
                     "status": response.status_code,
                     "result": content,
                     "request_elapsed_time": end_time - start_request_time,
                     "url": url,
                     "pod": pod,
                 }
             )
+            if logger:
+                logger.debug("POST %s -> %s in %.3fs", url, response.status_code, end_time - start_request_time)
@@
     return {
-        "time": datetime.now().strftime("%Y-%m-%dT%H:%M:%S"),
+        "time": datetime.now().astimezone().isoformat(timespec="seconds"),
         "results": results,
         "total_time": time.time() - start_time,
         "url": url,
         "pod": pod,
     }

Also applies to: 83-117, 118-131, 132-139

216-218: Broadened exception handling; stop port-forwards and log stack traces.

Use logger.exception and always stop port-forward tunnels.

Apply this diff:

-    except Exception as e:
-        logger.error(str(e))
-    logger.info("Exiting")
+    except Exception:
+        logger.exception("Client error")
+    finally:
+        for pf in pod_ports.values():
+            try:
+                pf.stop()
+            except Exception:
+                logger.debug("Port-forward stop failed", exc_info=True)
+        logger.info("Exiting")

tests/fault_tolerance/deploy/scenarios.py (3)

16-19: Tighten types with Optional; fix minor doc typos below.

Use Optional for nullable fields to improve editor/type-checker help.

Apply this diff:
-from dataclasses import dataclass
+from dataclasses import dataclass
+from typing import Optional
@@
 class Load:
@@
-    sla: float = None
+    sla: Optional[float] = None
@@
 class Scenario:
     deployment: DeploymentSpec
     load: Load
     failures: list[Failure]
-    model: str = None
+    model: Optional[str] = None
Also applies to: 21-30, 41-47

93-93: Fix typos in comments (readability).

Replicats → Replicas; scenaro → scenario; pervious → previous.

Apply this diff:
-# Derivative Specs With Incremented Replicats
+# Derivative specs with incremented replicas
@@
-# Each failure scenaro contains a list of failure injections
-# Each failure injection has a time in seconds after the pervious injection and
+# Each failure scenario contains a list of failure injections
+# Each failure injection has a time in seconds after the previous injection and
Also applies to: 109-119

62-76: Consistency: prefer set_tensor_parallel over direct property mutation.

Use the helper to avoid diverging behaviors across specs.

Example:
deployment_specs["disagg-prefill-tp-2-decode-tp-2-dp-1"].set_tensor_parallel(
    2, ["VllmPrefillWorker", "VllmDecodeWorker"]
)
Also applies to: 83-92

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a73b3c and ff83fe7.

📒 Files selected for processing (8)

tests/fault_tolerance/deploy/README.md (1 hunks)
tests/fault_tolerance/deploy/__init__.py (1 hunks)
tests/fault_tolerance/deploy/client.py (1 hunks)
tests/fault_tolerance/deploy/conftest.py (1 hunks)
tests/fault_tolerance/deploy/parse_results.py (1 hunks)
tests/fault_tolerance/deploy/scenarios.py (1 hunks)
tests/fault_tolerance/deploy/test_deployment.py (1 hunks)
tests/utils/managed_deployment.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

tests/fault_tolerance/deploy/parse_results.py (1)

tests/utils/managed_deployment.py (1)

main (727-758)

tests/fault_tolerance/deploy/conftest.py (1)

tests/utils/managed_deployment.py (4)

image (36-41)

image (44-49)

namespace (201-203)

namespace (206-207)

tests/fault_tolerance/deploy/test_deployment.py (4)

tests/fault_tolerance/deploy/client.py (1)

client (141-218)

tests/fault_tolerance/deploy/parse_results.py (1)

main (320-435)

tests/utils/managed_deployment.py (21)

main (727-758)

ManagedDeployment (350-724)

namespace (201-203)

namespace (206-207)

model (61-72)

model (75-99)

name (30-32)

name (178-180)

name (197-198)

get_pods (554-580)

replicas (53-54)

replicas (57-58)

get_pod_logs (582-605)

get_processes (541-547)

kill (322-331)

image (36-41)

image (44-49)

disable_grove (209-214)

set_image (224-230)

set_model (216-222)

set_logging (248-270)

tests/fault_tolerance/deploy/conftest.py (2)

namespace (30-31)

image (25-26)

tests/fault_tolerance/deploy/client.py (1)

tests/utils/managed_deployment.py (10)

ManagedDeployment (350-724)

namespace (201-203)

namespace (206-207)

get_pods (554-580)

port (183-185)

name (30-32)

name (178-180)

name (197-198)

port_forward (667-697)

endpoint (193-194)

tests/utils/managed_deployment.py (2)

benchmarks/profiler/utils/config.py (1)

Service (54-58)

tests/fault_tolerance/deploy/client.py (1)

client (141-218)

tests/fault_tolerance/deploy/scenarios.py (1)

tests/utils/managed_deployment.py (8)

DeploymentSpec (166-311)

replicas (53-54)

replicas (57-58)

model (61-72)

model (75-99)

set_tensor_parallel (232-246)

tensor_parallel_size (118-130)

tensor_parallel_size (133-163)

🪛 Ruff (0.12.2)

tests/fault_tolerance/deploy/parse_results.py

241-241: Unused function argument: failure_type

(ARG001)

260-260: Loop control variable process not used within loop body

(B007)

261-261: Loop control variable replica not used within loop body

Rename unused replica to _replica

(B007)

320-320: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

tests/fault_tolerance/deploy/test_deployment.py

67-67: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

105-105: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

130-130: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

tests/fault_tolerance/deploy/client.py

58-58: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

66-66: Unused function argument: logger

(ARG001)

216-216: Do not catch blind exception: Exception

(BLE001)

217-217: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

tests/utils/managed_deployment.py

344-344: Do not catch blind exception: Exception

(BLE001)

372-372: Do not catch blind exception: Exception

(BLE001)

396-396: Create your own exception

(TRY002)

396-396: Avoid specifying long messages outside the exception class

(TRY003)

487-487: Do not catch blind exception: Exception

(BLE001)

492-492: Avoid specifying long messages outside the exception class

(TRY003)

589-589: Do not catch blind exception: Exception

(BLE001)

590-590: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

594-594: Do not catch blind exception: Exception

(BLE001)

595-595: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

602-602: Do not catch blind exception: Exception

(BLE001)

611-611: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

641-641: Do not catch blind exception: Exception

(BLE001)

642-642: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

672-672: Possible binding to all interfaces

(S104)

🪛 markdownlint-cli2 (0.17.2)

tests/fault_tolerance/deploy/README.md

41-41: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

119-119: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

235-235: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

237-237: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

257-257: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

295-295: Multiple headings with the same content

(MD024, no-duplicate-heading)

295-295: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

296-296: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

316-316: Multiple headings with the same content

(MD024, no-duplicate-heading)

316-316: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

344-344: Hard tabs
Column: 4

(MD010, no-hard-tabs)

365-365: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

367-367: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

394-394: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

414-414: Hard tabs
Column: 1

(MD010, no-hard-tabs)

420-420: Hard tabs
Column: 1

(MD010, no-hard-tabs)

428-428: Hard tabs
Column: 1

(MD010, no-hard-tabs)

432-432: Hard tabs
Column: 1

(MD010, no-hard-tabs)

446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)

454-454: Multiple headings with the same content

(MD024, no-duplicate-heading)

454-454: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

456-456: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

483-483: Multiple headings with the same content

(MD024, no-duplicate-heading)

483-483: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

504-504: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

510-510: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

521-521: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (4)

tests/fault_tolerance/deploy/__init__.py (1)

1-2: LGTM on SPDX headers.

Header is correct and non-invasive.

tests/fault_tolerance/deploy/test_deployment.py (1)

156-160: Great: deployment context setup is clear and bounded.

tests/fault_tolerance/deploy/client.py (1)

140-156: LGTM: client scaffolding is clear and aligns with ManagedDeployment.

tests/fault_tolerance/deploy/scenarios.py (1)

146-152: Scenario generation filter is correct.

Skipping prefill failures for non-disagg deployments looks good.

tests/fault_tolerance/deploy/client.py

tests/fault_tolerance/deploy/parse_results.py

tests/fault_tolerance/deploy/test_deployment.py

coderabbitai · 2025-09-16T00:48:49Z

tests/fault_tolerance/deploy/test_deployment.py

+@pytest.mark.e2e
+@pytest.mark.slow
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+async def test_fault_scenario(
+    scenario,  # noqa: F811
+    request,
+    image,
+    namespace,
+):


💡 Verification agent

🧩 Analysis chain

Verify pytest-asyncio is available to run async tests.

Without the plugin, async tests won’t be awaited.

🏁 Script executed:

#!/bin/bash set -euo pipefail python - <<'PY' import importlib, sys ok = importlib.util.find_spec("pytest_asyncio") is not None print("pytest-asyncio:", "FOUND" if ok else "MISSING") sys.exit(0 if ok else 1) PY

Length of output: 131

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "python binary: $(command -v python || echo 'none')" python -V || true python - <<'PY' import sys, pkgutil, subprocess def detect(): results = {} try: import pytest_asyncio results['import'] = 'FOUND' except Exception as e: results['import'] = f'MISSING ({e!r})' try: import importlib.util spec = importlib.util.find_spec("pytest_asyncio") results['importlib.util.find_spec'] = 'FOUND' if spec else 'MISSING' except Exception as e: results['importlib.util.find_spec'] = f'ERROR ({e!r})' try: found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules()) results['pkgutil.iter_modules'] = 'FOUND' if found else 'MISSING' except Exception as e: results['pkgutil.iter_modules'] = f'ERROR ({e!r})' try: r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True) results['pip_show_rc'] = r.returncode results['pip_show_out'] = (r.stdout.strip() or r.stderr.strip())[:1000] except Exception as e: results['pip_show'] = f'ERROR ({e!r})' try: r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True) results['pytest_version_rc'] = r.returncode results['pytest_version_out'] = (r.stdout.strip() or r.stderr.strip())[:1000] except Exception as e: results['pytest_version'] = f'ERROR ({e!r})' for k,v in results.items(): print(f"{k}: {v}") detect() PY

Length of output: 223

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "python binary: $(command -v python || echo 'none')" python -V || true python - <<'PY' import sys, pkgutil, subprocess def sp(k,v): print(str(k) + ": " + str(v)) # import check try: import pytest_asyncio sp("import", "FOUND") except Exception as e: sp("import", "MISSING " + type(e).__name__ + " " + str(e)) # importlib.util.find_spec try: import importlib.util as iu spec = iu.find_spec("pytest_asyncio") sp("importlib.util.find_spec", "FOUND" if spec else "MISSING") except Exception as e: sp("importlib.util.find_spec", "ERROR " + type(e).__name__ + " " + str(e)) # pkgutil.iter_modules try: found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules()) sp("pkgutil.iter_modules", "FOUND" if found else "MISSING") except Exception as e: sp("pkgutil.iter_modules", "ERROR " + type(e).__name__ + " " + str(e)) # pip show try: r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True) sp("pip_show_rc", r.returncode) out = (r.stdout.strip() or r.stderr.strip()) sp("pip_show_out", out.splitlines()[0] if out else "<empty>") except Exception as e: sp("pip_show", "ERROR " + type(e).__name__ + " " + str(e)) # pytest --version try: r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True) sp("pytest_version_rc", r.returncode) out = (r.stdout.strip() or r.stderr.strip()) sp("pytest_version_out", out.splitlines()[0] if out else "<empty>") except Exception as e: sp("pytest_version", "ERROR " + type(e).__name__ + " " + str(e)) PY

Length of output: 306

Install pytest-asyncio (required for async tests)
Verification shows pytest-asyncio is MISSING while pytest 8.4.2 is installed; add pytest-asyncio to your test/dev dependencies and CI so async tests in tests/fault_tolerance/deploy/test_deployment.py (L126-L134) are actually awaited.

🧰 Tools

🪛 Ruff (0.12.2)

130-130: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In tests/fault_tolerance/deploy/test_deployment.py around lines 126-134, async tests are used but pytest-asyncio is not installed; add pytest-asyncio to the project's test/dev dependencies (e.g., pyproject.toml or requirements-dev.txt) with a version compatible with pytest 8.4.x (for example pytest-asyncio>=0.21.0), update CI pipeline to install those dev/test deps before running pytest so async tests are awaited, and re-run CI to verify the async tests execute correctly.

tests/utils/managed_deployment.py

rmccorm4 · 2025-09-24T18:51:27Z

Closed in favor of #3194

updated

ed3be77

Signed-off-by: nnshah1 <[email protected]>

nnshah1 requested review from a team as code owners September 8, 2025 23:20

pull-request-size bot added the size/XXL label Sep 8, 2025

github-actions bot added the test label Sep 8, 2025

nnshah1 mentioned this pull request Sep 8, 2025

test: k8s based fault tolerance tests #2811

Closed

ishandhanani reviewed Sep 8, 2025

View reviewed changes

tests/fault_tolerance/deploy/client.py Show resolved Hide resolved

tests/fault_tolerance/deploy/conftest.py Show resolved Hide resolved

nnshah1 added 2 commits September 13, 2025 16:57

local updates

512a2c6

updated to parse json

ff83fe7

Signed-off-by: nnshah1 <[email protected]>

coderabbitai bot reviewed Sep 16, 2025

View reviewed changes

rmccorm4 requested review from indrajit96 and tzulingk September 24, 2025 00:45

rmccorm4 closed this Sep 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: fault injection tests for k8s #2942

test: fault injection tests for k8s #2942

Uh oh!

nnshah1 commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 8, 2025

Uh oh!

coderabbitai bot commented Sep 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

In case you have multiple configs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 16, 2025

Uh oh!

Uh oh!

Uh oh!

rmccorm4 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

test: fault injection tests for k8s #2942

test: fault injection tests for k8s #2942

Uh oh!

Conversation

nnshah1 commented Sep 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 8, 2025

Uh oh!

coderabbitai bot commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

In case you have multiple configs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rmccorm4 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nnshah1 commented Sep 8, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 8, 2025 •

edited

Loading