test: fault injection tests for k8s #3194

tzulingk · 2025-09-24T04:22:19Z

Overview:

K8s based fault injection tests for recovery and resilience.
(cherry pick PR2942, help nnshah1 to merge it and fix mypy errors)

Details:

Adds in test infrastructure for deploying dynamo on K8s and injecting faults (pod delete, process kill).

Where should the reviewer start?

managed_deployment.py
test_deployment.py
readme.md

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

DIS-276

Summary by CodeRabbit

Documentation
- Added a comprehensive guide for the Fault Tolerance Test Suite, including architecture, scenarios, failure types, result formats, and a Quick Start with setup and run instructions.
Tests
- Introduced an end-to-end fault-tolerance testing framework: scenario registry, client load generator with retries and rate controls, and automated failure injection.
- Added result parsing and reporting with grouped tables, latency metrics, recovery-time analysis, and optional SLA evaluation.
- Provided configurable test options (e.g., image and namespace) and a Kubernetes-based deployment manager to automate setup, logging, and cleanup.

Signed-off-by: nnshah1 <[email protected]>

Signed-off-by: [email protected] <[email protected]>

coderabbitai · 2025-09-24T04:29:27Z

Walkthrough

Adds a new fault-tolerance test suite for Dynamo deployments: documentation, pytest config, deployment orchestration utilities, scenario catalog, client load generator with retries and port-forwarding, failure injection, end-to-end test runner, and result parsing/reporting CLI. Introduces Kubernetes-managed deployment lifecycle, log/metrics collection, and grouped summary tables.

Changes

Cohort / File(s)	Summary of Changes
Documentation `tests/fault_tolerance/deploy/README.md`	New README describing test architecture, deployments, failures, scenarios, run instructions, diagrams, and results structure.
Pytest configuration `tests/fault_tolerance/deploy/conftest.py`	Adds pytest options `--image`, `--namespace` and fixtures `image`, `namespace`.
Client load generator `tests/fault_tolerance/deploy/client.py`	New client with GPT-style payload template, random prompts, HTTP POST with retries, per-request timing/status, pod-aware port-forwarding, rate limiting, and JSONL logging. Exposes `client(...)`.
Scenario catalog `tests/fault_tolerance/deploy/scenarios.py`	Adds dataclasses `Load`, `Failure`, `Scenario`; defines multiple deployment specs (agg/disagg, TP variants), failure profiles, default load/model, and builds `scenarios` mapping with filtering for prefill failures.
E2E fault tests `tests/fault_tolerance/deploy/test_deployment.py`	Parametrized test over `scenarios`; spawns multiple client processes; schedules and injects failures (pod/process); manages deployment lifecycle; autouse fixtures to parse and summarize results.
Results parsing/reporting `tests/fault_tolerance/deploy/parse_results.py`	Parser producing grouped tables: extracts startup/fault times, aggregates client logs, computes latency/success/SLA metrics, parses process readiness, derives recovery time, and provides CLI (`--log-dir`, `--format`, `--sla`).
Deployment management utilities `tests/utils/managed_deployment.py`	Introduces `ServiceSpec`, `DeploymentSpec`, `PodProcess`, and async `ManagedDeployment` to create/delete CR, wait for pods, scale/restart components, fetch logs/metrics, and port-forward. Includes example `main()`.
Package plumbing `tests/fault_tolerance/deploy/__init__.py`	Adds SPDX license-only `__init__.py`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester
  participant Pytest as Pytest Runner
  participant MD as ManagedDeployment
  participant Clients as Client Procs
  participant K8s as Kubernetes
  participant Parser as Results Parser

  Tester->>Pytest: run test_fault_scenario(params)
  Pytest->>MD: __aenter__ (create CR, wait ready)
  MD->>K8s: create DynamographDeployment CR
  K8s-->>MD: pods/services ready
  Pytest->>Clients: spawn N processes (client(...))
  Note over Clients: Each selects ready pod, port-forwards if needed,<br/>sends HTTP requests with retries and logs JSONL
  Pytest->>K8s: inject failures (pod delete / process kill) at T
  Clients-->>Pytest: write per-request logs
  Pytest->>MD: collect pod logs/metrics
  Pytest->>MD: __aexit__ (delete CR, cleanup)
  Pytest->>Parser: parse_results (logs_dir, SLA)
  Parser-->>Pytest: grouped tables (startup, recovery, metrics)

sequenceDiagram
  autonumber
  participant Client as Client
  participant MD as ManagedDeployment
  participant Pod as Target Pod
  participant PF as Port-forward
  participant Svc as Model Service

  loop For each request
    Client->>MD: get ready pods
    alt no local PF for selected pod
      Client->>Pod: establish port-forward
      Pod-->>Client: PF established (local port)
    end
    Client->>Svc: POST /v1/chat/completions via PF
    alt success
      Svc-->>Client: 200 + payload
    else transient error
      Client->>Svc: retry (<= max_retries, with delay)
      Svc-->>Client: response or final error
    end
    Client-->>Client: log time, status, elapsed
    Client-->>Client: sleep to enforce max_request_rate
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~70 minutes

Possibly related PRs

test: fault tolerance tests #1444 — Touches the same fault-tolerance suite (client, scenarios, parse_results, conftest); likely complementary or overlapping implementations.
chore: updating and removing tests #2130 — Removes many of the same fault_tolerance files added here; directly conflicting/overlapping changes.

Poem

I tap my paws on cluster ground,
Pods hop up, then fall back down.
With whiskered logs and JSON light,
I chase the faults through starry night.
Requests retry, metrics chime—
Recover! Recover! Right on time.
Carrots for green pipelines! 🥕✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 32.88% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "test: fault injection tests for k8s" accurately and concisely describes the primary change—adding Kubernetes-based fault-injection tests and related infrastructure—so it is directly related to the changeset. It is short and focused, though slightly redundant ("test" and "tests") and uses the abbreviation "k8s" which could be expanded for clarity.
Description Check	✅ Passed	The PR description includes the required template sections (Overview, Details, Where should the reviewer start, Related Issues) and succinctly states the intent to add Kubernetes-based fault injection tests and supporting infrastructure. It points reviewers to key files (managed_deployment.py, test_deployment.py, readme.md) and references the related issue. The description is concise and follows the repository template, though brief.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tests/fault_tolerance/deploy/test_deployment.py (1)
159-173: Await the async failure injector

After making _inject_failures async, await it here.
         with _clients(
             logger,
             scenario.load.clients,
             request,
             scenario.deployment,
             namespace,
             model,
             scenario.load.requests_per_client,
             scenario.load.input_token_length,
             scenario.load.output_token_length,
             scenario.load.max_retries,
             scenario.load.max_request_rate,
         ):
-            _inject_failures(scenario.failures, logger, deployment)
+            await _inject_failures(scenario.failures, logger, deployment)
tests/fault_tolerance/deploy/parse_results.py (1)
453-454: CLI bug: args.sla passed as log_paths positional

main(logs_dir, tablefmt, log_paths=None, sla=None) is called with sla as the third positional arg. Pass by keyword.
-    main(args.log_dir, tablefmt, args.sla)
+    main(args.log_dir, tablefmt, sla=args.sla)

🧹 Nitpick comments (19)

tests/utils/managed_deployment.py (5)
724-727: Avoid bare except, log the exception

Catching all exceptions silently can hide root causes. Catch Exception and log traceback before cleanup.

Apply this diff:
-        except:
-            await self._cleanup()
-            raise
+        except Exception:
+            self._logger.exception("Error during deployment setup; cleaning up")
+            await self._cleanup()
+            raise
734-769: Don’t block the event loop with time.sleep in async context

Use await asyncio.sleep(...) inside async with. Also prefer logger over print for consistency.

Apply this diff:
-    print(deployment_spec._deployment_spec)
+    logging.getLogger(__name__).info("Spec: %s", deployment_spec._deployment_spec)
@@
-    print(f"Logging config: {deployment_spec.get_logging_config()}")
+    logging.getLogger(__name__).info("Logging config: %s", deployment_spec.get_logging_config())
@@
-    async with ManagedDeployment(
+    async with ManagedDeployment(
         namespace="test", log_dir=".", deployment_spec=deployment_spec
     ):
-        time.sleep(60)
+        await asyncio.sleep(60)
255-271: Harden set_logging against missing spec and use exception logging

Guard against missing spec and prefer structured logging.

Apply this diff:
     def set_logging(self, enable_jsonl: bool = True, log_level: str = "debug"):
@@
-        spec = self._deployment_spec
-        if "envs" not in spec["spec"]:
-            spec["spec"]["envs"] = []
+        spec = self._deployment_spec
+        spec.setdefault("spec", {})
+        spec["spec"].setdefault("envs", [])
381-399: Avoid generic Exception with long message; define a specific timeout error

Raise a specific exception type and carry structured context.

Apply this diff:
-        raise Exception(f"Didn't Reach Expected Pod Count {label}=={expected}")
+        raise TimeoutError(f"Timed out waiting for pods with {label} ready={expected}")
10-12: Dataclass logger default and API client reuse

Use field(default_factory=...) for _logger to avoid calling at import time.

Reuse the same ApiClient for AppsV1Api to be consistent.

Apply this diff:
-from dataclasses import dataclass
+from dataclasses import dataclass, field
@@
-    _logger: logging.Logger = logging.getLogger()
+    _logger: logging.Logger = field(default_factory=lambda: logging.getLogger(__name__))
@@
-        self._apps_v1 = client.AppsV1Api()
+        self._apps_v1 = client.AppsV1Api(k8s_client)
Also applies to: 359-360, 376-380
tests/fault_tolerance/deploy/conftest.py (1)
19-22: Add help text for CLI options

Small UX improvement for discoverability.

Apply this diff:
 def pytest_addoption(parser):
-    parser.addoption("--image", type=str, default=None)
-    parser.addoption("--namespace", type=str, default="fault-tolerance-test")
+    parser.addoption("--image", type=str, default=None, help="Container image to use for services")
+    parser.addoption("--namespace", type=str, default="fault-tolerance-test", help="Kubernetes namespace for tests")
tests/fault_tolerance/deploy/README.md (4)
20-36: Fix typos/wording in docs

Several typos reduce clarity (e.g., resilency→resiliency, environemnt→environment, unforseen→unforeseen, Wroer→Worker, Numoer→Number, Aftere→After, Redunancy→Redundancy, simmple→simple, disaaggregated→disaggregated, Over Provisoned→Over Provisioned, “an cluster”→“a cluster”, “in to”→“into”). Please correct across these sections.

Also applies to: 158-166, 191-201, 207-213, 324-333, 396-401, 515-528

131-131: Annotate fenced code blocks with language

Add language specifiers to satisfy markdownlint and improve rendering.

Apply this diff (representative samples):
-```
+```text
@@
-```bash
+```bash
@@
-```
+```bash
@@
-```
+```bash
@@
-```
+```bash
Also applies to: 170-170, 237-237, 296-296, 367-367, 504-504, 510-510, 521-521

---

`295-296`: **Avoid duplicate headings at same level**

Rename repeated “Results”/“Summary” headings to be unique (e.g., “Results (agg-tp-1-dp-2)”).



Also applies to: 316-316, 454-456, 483-483

---

`344-344`: **Replace hard tabs with spaces**

Tabs in markdown break alignment in some renderers; replace with spaces.



Also applies to: 414-414, 420-420, 428-428, 432-432, 446-446

</blockquote></details>
<details>
<summary>tests/fault_tolerance/deploy/test_deployment.py (2)</summary><blockquote>

`59-63`: **Client processes may hang indefinitely; add join timeouts and termination**

Guard joins to avoid deadlocks; terminate stragglers and surface failures via exit codes.

```diff
     for proc in procs:
         logger.debug(f"{proc} waiting for join")
-        proc.join()
+        proc.join(timeout=60)
+        if proc.is_alive():
+            logger.warning(f"{proc} did not exit in time; terminating")
+            proc.terminate()
+            proc.join(timeout=10)
+        if proc.exitcode not in (0, None):
+            logger.error(f"{proc} exited with code {proc.exitcode}")
         logger.debug(f"{proc} joined")
65-65: Remove unused noqa: F811 suppressions

Ruff flags these as unused.
-def _inject_failures(failures, logger, deployment: ManagedDeployment):  # noqa: F811
+def _inject_failures(failures, logger, deployment: ManagedDeployment):
@@
-@pytest.fixture(autouse=True)
-def results_table(request, scenario):  # noqa: F811
+@pytest.fixture(autouse=True)
+def results_table(request, scenario):
@@
-async def test_fault_scenario(
-    scenario,  # noqa: F811
+async def test_fault_scenario(
+    scenario,
Also applies to: 103-103, 128-128
tests/fault_tolerance/deploy/scenarios.py (3)
53-60: Hardcoded absolute paths; parameterize for portability

Using /workspace/... will break outside the container. Support an environment variable or repo-relative paths.
+import os
+from pathlib import Path
@@
-deployment_specs = {
-    "agg-tp-1-dp-1": (
-        DeploymentSpec("/workspace/components/backends/vllm/deploy/agg.yaml")
-    ),
-    "disagg-tp-1-dp-1": (
-        DeploymentSpec("/workspace/components/backends/vllm/deploy/disagg.yaml")
-    ),
-}
+DEPLOY_ROOT = Path(os.environ.get("DYNAMO_DEPLOY_ROOT", "/workspace/components/backends/vllm/deploy"))
+deployment_specs = {
+    "agg-tp-1-dp-1": DeploymentSpec(str(DEPLOY_ROOT / "agg.yaml")),
+    "disagg-tp-1-dp-1": DeploymentSpec(str(DEPLOY_ROOT / "disagg.yaml")),
+}
62-77: Minor duplication in TP-2 setups

You can reduce repetition by looping over sizes and services; not blocking.

94-108: Typos in comments (“Replicats”, “scenaro”, “pervious”)

Fix spelling to keep docs clean.
-# Derivative Specs With Incremented Replicats
+# Derivative specs with incremented replicas
@@
-# Each failure scenaro contains a list of failure injections
-# Each failure injection has a time in seconds after the pervious injection and
+# Each failure scenario contains a list of failure injections
+# Each failure injection has a time in seconds after the previous injection and
tests/fault_tolerance/deploy/parse_results.py (4)
241-270: Unused parameter ‘failure_type’; clean up signature and call

Either use it or drop it. Simplest: remove from calculate_recovery_time and its call.
-def calculate_recovery_time(test_dir, failure_type, fault_time):
+def calculate_recovery_time(test_dir, fault_time):
@@
-    recovery_time = calculate_recovery_time(test_dir, failure_type, fault_time)
+    recovery_time = calculate_recovery_time(test_dir, fault_time)
Also applies to: 299-299

254-266: Minor: clarify loop variable names to appease linters

Rename variables for readability and to avoid B007 warnings.
-    for process in processes:
-        starts = parse_process_log(os.path.join(test_dir, process), process)
+    for proc_name in processes:
+        starts = parse_process_log(os.path.join(test_dir, proc_name), proc_name)
@@
-    for process, replicas in process_start.items():
-        for replica, container_starts in replicas.items():
+    for proc_name, replicas in process_start.items():
+        for _replica, container_starts in replicas.items():
27-54: parse_test_log tokenization is brittle; consider robust timestamp parsing

Splitting by space and taking index 1 assumes a specific log format. If LOG_FORMAT changes, this breaks. Consider regex to extract ISO timestamps within brackets.

56-105: parse_client_logs: guard JSON decode and partial writes

A single corrupt line will raise and drop the whole client file. Wrap json.loads in try/except and continue; also coerce fields safely. Optional but improves resilience.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6243bcb and 2959170.

📒 Files selected for processing (8)

tests/fault_tolerance/deploy/README.md (1 hunks)
tests/fault_tolerance/deploy/__init__.py (1 hunks)
tests/fault_tolerance/deploy/client.py (1 hunks)
tests/fault_tolerance/deploy/conftest.py (1 hunks)
tests/fault_tolerance/deploy/parse_results.py (1 hunks)
tests/fault_tolerance/deploy/scenarios.py (1 hunks)
tests/fault_tolerance/deploy/test_deployment.py (1 hunks)
tests/utils/managed_deployment.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

tests/fault_tolerance/deploy/conftest.py (1)

tests/utils/managed_deployment.py (4)

image (36-41)

image (44-49)

namespace (201-203)

namespace (206-207)

tests/fault_tolerance/deploy/parse_results.py (2)

tests/utils/managed_deployment.py (1)

main (733-764)

tests/utils/managed_process.py (1)

log_path (98-100)

tests/fault_tolerance/deploy/scenarios.py (1)

tests/utils/managed_deployment.py (8)

DeploymentSpec (166-311)

replicas (53-54)

replicas (57-58)

model (61-72)

model (75-99)

set_tensor_parallel (232-246)

tensor_parallel_size (118-130)

tensor_parallel_size (133-163)

tests/fault_tolerance/deploy/client.py (1)

tests/utils/managed_deployment.py (10)

ManagedDeployment (350-730)

namespace (201-203)

namespace (206-207)

get_pods (560-586)

port (187-189)

name (30-32)

name (178-180)

name (183-184)

port_forward (673-703)

endpoint (197-198)

tests/fault_tolerance/deploy/test_deployment.py (4)

tests/fault_tolerance/deploy/client.py (1)

client (142-220)

tests/fault_tolerance/deploy/parse_results.py (1)

main (320-435)

tests/utils/managed_deployment.py (18)

main (733-764)

ManagedDeployment (350-730)

namespace (201-203)

namespace (206-207)

model (61-72)

model (75-99)

name (30-32)

name (178-180)

name (183-184)

get_pods (560-586)

get_pod_logs (588-611)

get_processes (547-553)

image (36-41)

image (44-49)

disable_grove (209-214)

set_image (224-230)

set_model (216-222)

set_logging (248-270)

tests/fault_tolerance/deploy/conftest.py (2)

namespace (30-31)

image (25-26)

tests/utils/managed_deployment.py (2)

benchmarks/profiler/utils/config.py (1)

Service (57-61)

tests/fault_tolerance/deploy/client.py (1)

client (142-220)

🪛 Ruff (0.13.1)

tests/fault_tolerance/deploy/parse_results.py

241-241: Unused function argument: failure_type

(ARG001)

260-260: Loop control variable process not used within loop body

(B007)

261-261: Loop control variable replica not used within loop body

Rename unused replica to _replica

(B007)

tests/fault_tolerance/deploy/client.py

59-59: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

67-67: Unused function argument: logger

(ARG001)

218-218: Do not catch blind exception: Exception

(BLE001)

219-219: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

tests/fault_tolerance/deploy/test_deployment.py

65-65: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

103-103: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

128-128: Unused noqa directive (unused: F811)

Remove unused noqa directive

(RUF100)

tests/utils/managed_deployment.py

344-344: Do not catch blind exception: Exception

(BLE001)

359-359: Do not perform function call logging.getLogger in dataclass defaults

(RUF009)

373-373: Do not catch blind exception: Exception

(BLE001)

398-398: Create your own exception

(TRY002)

398-398: Avoid specifying long messages outside the exception class

(TRY003)

492-492: Do not catch blind exception: Exception

(BLE001)

497-497: Avoid specifying long messages outside the exception class

(TRY003)

595-595: Do not catch blind exception: Exception

(BLE001)

596-596: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

600-600: Do not catch blind exception: Exception

(BLE001)

601-601: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

608-608: Do not catch blind exception: Exception

(BLE001)

617-617: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

647-647: Do not catch blind exception: Exception

(BLE001)

648-648: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

678-678: Possible binding to all interfaces

(S104)

🪛 markdownlint-cli2 (0.18.1)

tests/fault_tolerance/deploy/README.md

41-41: Multiple spaces after hash on atx style heading

(MD019, no-multiple-space-atx)

131-131: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

170-170: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

237-237: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

295-295: Multiple headings with the same content

(MD024, no-duplicate-heading)

296-296: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

316-316: Multiple headings with the same content

(MD024, no-duplicate-heading)

344-344: Hard tabs
Column: 4

(MD010, no-hard-tabs)

367-367: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

414-414: Hard tabs
Column: 1

(MD010, no-hard-tabs)

420-420: Hard tabs
Column: 1

(MD010, no-hard-tabs)

428-428: Hard tabs
Column: 1

(MD010, no-hard-tabs)

432-432: Hard tabs
Column: 1

(MD010, no-hard-tabs)

446-446: Hard tabs
Column: 1

(MD010, no-hard-tabs)

454-454: Multiple headings with the same content

(MD024, no-duplicate-heading)

456-456: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

483-483: Multiple headings with the same content

(MD024, no-duplicate-heading)

504-504: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

510-510: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

521-521: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - sglang
GitHub Check: Build and Test - trtllm
GitHub Check: Build and Test - vllm
GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

tests/fault_tolerance/deploy/__init__.py (1)

1-2: LGTM on package init and licensing

Header looks correct and consistent.

tests/fault_tolerance/deploy/scenarios.py (1)

147-153: Scenario filtering logic looks correct

Skipping prefill failures for non-disaggregated deployments is sensible.

tests/fault_tolerance/deploy/client.py

tests/fault_tolerance/deploy/test_deployment.py

tests/utils/managed_deployment.py

…r various code paths Signed-off-by: [email protected] <[email protected]>

rmccorm4

Approving as a formality. This was originally Neelay's PR which Tzu-Ling tested and approved, but Tzu-Ling needed to take over this PR while @nnshah1 is unable to access git/devbox. @indrajit96 was able to reproduce running the tests (manually) on both minikube and Azure as well.

Signed-off-by: nnshah1 <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: nnshah1 <[email protected]> Signed-off-by: Jason Zhou <[email protected]>

Signed-off-by: nnshah1 <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: nnshah1 <[email protected]> Signed-off-by: Kyle H <[email protected]>

nnshah1 and others added 4 commits September 23, 2025 21:20

updated

7e0f279

Signed-off-by: nnshah1 <[email protected]>

local updates

97a8a28

updated to parse json

fd7a25f

Signed-off-by: nnshah1 <[email protected]>

Address comments and fix mypy errors.

2959170

Signed-off-by: [email protected] <[email protected]>

tzulingk requested review from a team as code owners September 24, 2025 04:22

pull-request-size bot added the size/XXL label Sep 24, 2025

github-actions bot added the test label Sep 24, 2025

tzulingk requested review from indrajit96, ishandhanani and rmccorm4 September 24, 2025 04:23

coderabbitai bot reviewed Sep 24, 2025

View reviewed changes

tzulingk requested review from dillon-cullinan and nv-kmcgill53 September 24, 2025 17:06

tzulingk enabled auto-merge (squash) September 24, 2025 17:50

deepcopy(payload) to avoid modifying global payload and add guards fo…

7705e9d

…r various code paths Signed-off-by: [email protected] <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 18:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 24, 2025 18:08 Inactive

rmccorm4 approved these changes Sep 24, 2025

View reviewed changes

tzulingk merged commit 52c7536 into main Sep 24, 2025
14 of 15 checks passed

tzulingk deleted the merge_pr2942 branch September 24, 2025 18:37

rmccorm4 mentioned this pull request Sep 24, 2025

test: fault injection tests for k8s #2942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: fault injection tests for k8s #3194

test: fault injection tests for k8s #3194

Uh oh!

tzulingk commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

test: fault injection tests for k8s #3194

test: fault injection tests for k8s #3194

Uh oh!

Conversation

tzulingk commented Sep 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rmccorm4 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tzulingk commented Sep 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 24, 2025 •

edited

Loading

rmccorm4 left a comment •

edited

Loading