-
Notifications
You must be signed in to change notification settings - Fork 688
test: fault injection tests for k8s #3194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: nnshah1 <[email protected]>
Signed-off-by: [email protected] <[email protected]>
WalkthroughAdds a new fault-tolerance test suite for Dynamo deployments: documentation, pytest config, deployment orchestration utilities, scenario catalog, client load generator with retries and port-forwarding, failure injection, end-to-end test runner, and result parsing/reporting CLI. Introduces Kubernetes-managed deployment lifecycle, log/metrics collection, and grouped summary tables. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Tester
participant Pytest as Pytest Runner
participant MD as ManagedDeployment
participant Clients as Client Procs
participant K8s as Kubernetes
participant Parser as Results Parser
Tester->>Pytest: run test_fault_scenario(params)
Pytest->>MD: __aenter__ (create CR, wait ready)
MD->>K8s: create DynamographDeployment CR
K8s-->>MD: pods/services ready
Pytest->>Clients: spawn N processes (client(...))
Note over Clients: Each selects ready pod, port-forwards if needed,<br/>sends HTTP requests with retries and logs JSONL
Pytest->>K8s: inject failures (pod delete / process kill) at T
Clients-->>Pytest: write per-request logs
Pytest->>MD: collect pod logs/metrics
Pytest->>MD: __aexit__ (delete CR, cleanup)
Pytest->>Parser: parse_results (logs_dir, SLA)
Parser-->>Pytest: grouped tables (startup, recovery, metrics)
sequenceDiagram
autonumber
participant Client as Client
participant MD as ManagedDeployment
participant Pod as Target Pod
participant PF as Port-forward
participant Svc as Model Service
loop For each request
Client->>MD: get ready pods
alt no local PF for selected pod
Client->>Pod: establish port-forward
Pod-->>Client: PF established (local port)
end
Client->>Svc: POST /v1/chat/completions via PF
alt success
Svc-->>Client: 200 + payload
else transient error
Client->>Svc: retry (<= max_retries, with delay)
Svc-->>Client: response or final error
end
Client-->>Client: log time, status, elapsed
Client-->>Client: sleep to enforce max_request_rate
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/fault_tolerance/deploy/test_deployment.py (1)
159-173: Await the async failure injectorAfter making _inject_failures async, await it here.
with _clients( logger, scenario.load.clients, request, scenario.deployment, namespace, model, scenario.load.requests_per_client, scenario.load.input_token_length, scenario.load.output_token_length, scenario.load.max_retries, scenario.load.max_request_rate, ): - _inject_failures(scenario.failures, logger, deployment) + await _inject_failures(scenario.failures, logger, deployment)tests/fault_tolerance/deploy/parse_results.py (1)
453-454: CLI bug: args.sla passed as log_paths positionalmain(logs_dir, tablefmt, log_paths=None, sla=None) is called with sla as the third positional arg. Pass by keyword.
- main(args.log_dir, tablefmt, args.sla) + main(args.log_dir, tablefmt, sla=args.sla)
🧹 Nitpick comments (19)
tests/utils/managed_deployment.py (5)
724-727: Avoid bareexcept, log the exceptionCatching all exceptions silently can hide root causes. Catch
Exceptionand log traceback before cleanup.Apply this diff:
- except: - await self._cleanup() - raise + except Exception: + self._logger.exception("Error during deployment setup; cleaning up") + await self._cleanup() + raise
734-769: Don’t block the event loop withtime.sleepin async contextUse
await asyncio.sleep(...)insideasync with. Also prefer logger over print for consistency.Apply this diff:
- print(deployment_spec._deployment_spec) + logging.getLogger(__name__).info("Spec: %s", deployment_spec._deployment_spec) @@ - print(f"Logging config: {deployment_spec.get_logging_config()}") + logging.getLogger(__name__).info("Logging config: %s", deployment_spec.get_logging_config()) @@ - async with ManagedDeployment( + async with ManagedDeployment( namespace="test", log_dir=".", deployment_spec=deployment_spec ): - time.sleep(60) + await asyncio.sleep(60)
255-271: Harden set_logging against missing spec and use exception loggingGuard against missing
specand prefer structured logging.Apply this diff:
def set_logging(self, enable_jsonl: bool = True, log_level: str = "debug"): @@ - spec = self._deployment_spec - if "envs" not in spec["spec"]: - spec["spec"]["envs"] = [] + spec = self._deployment_spec + spec.setdefault("spec", {}) + spec["spec"].setdefault("envs", [])
381-399: Avoid generic Exception with long message; define a specific timeout errorRaise a specific exception type and carry structured context.
Apply this diff:
- raise Exception(f"Didn't Reach Expected Pod Count {label}=={expected}") + raise TimeoutError(f"Timed out waiting for pods with {label} ready={expected}")
10-12: Dataclass logger default and API client reuse
- Use
field(default_factory=...)for_loggerto avoid calling at import time.- Reuse the same ApiClient for AppsV1Api to be consistent.
Apply this diff:
-from dataclasses import dataclass +from dataclasses import dataclass, field @@ - _logger: logging.Logger = logging.getLogger() + _logger: logging.Logger = field(default_factory=lambda: logging.getLogger(__name__)) @@ - self._apps_v1 = client.AppsV1Api() + self._apps_v1 = client.AppsV1Api(k8s_client)Also applies to: 359-360, 376-380
tests/fault_tolerance/deploy/conftest.py (1)
19-22: Add help text for CLI optionsSmall UX improvement for discoverability.
Apply this diff:
def pytest_addoption(parser): - parser.addoption("--image", type=str, default=None) - parser.addoption("--namespace", type=str, default="fault-tolerance-test") + parser.addoption("--image", type=str, default=None, help="Container image to use for services") + parser.addoption("--namespace", type=str, default="fault-tolerance-test", help="Kubernetes namespace for tests")tests/fault_tolerance/deploy/README.md (4)
20-36: Fix typos/wording in docsSeveral typos reduce clarity (e.g., resilency→resiliency, environemnt→environment, unforseen→unforeseen, Wroer→Worker, Numoer→Number, Aftere→After, Redunancy→Redundancy, simmple→simple, disaaggregated→disaggregated, Over Provisoned→Over Provisioned, “an cluster”→“a cluster”, “in to”→“into”). Please correct across these sections.
Also applies to: 158-166, 191-201, 207-213, 324-333, 396-401, 515-528
131-131: Annotate fenced code blocks with languageAdd language specifiers to satisfy markdownlint and improve rendering.
Apply this diff (representative samples):
-``` +```text @@ -```bash +```bash @@ -``` +```bash @@ -``` +```bash @@ -``` +```bashAlso applies to: 170-170, 237-237, 296-296, 367-367, 504-504, 510-510, 521-521 --- `295-296`: **Avoid duplicate headings at same level** Rename repeated “Results”/“Summary” headings to be unique (e.g., “Results (agg-tp-1-dp-2)”). Also applies to: 316-316, 454-456, 483-483 --- `344-344`: **Replace hard tabs with spaces** Tabs in markdown break alignment in some renderers; replace with spaces. Also applies to: 414-414, 420-420, 428-428, 432-432, 446-446 </blockquote></details> <details> <summary>tests/fault_tolerance/deploy/test_deployment.py (2)</summary><blockquote> `59-63`: **Client processes may hang indefinitely; add join timeouts and termination** Guard joins to avoid deadlocks; terminate stragglers and surface failures via exit codes. ```diff for proc in procs: logger.debug(f"{proc} waiting for join") - proc.join() + proc.join(timeout=60) + if proc.is_alive(): + logger.warning(f"{proc} did not exit in time; terminating") + proc.terminate() + proc.join(timeout=10) + if proc.exitcode not in (0, None): + logger.error(f"{proc} exited with code {proc.exitcode}") logger.debug(f"{proc} joined")
65-65: Remove unused noqa: F811 suppressionsRuff flags these as unused.
-def _inject_failures(failures, logger, deployment: ManagedDeployment): # noqa: F811 +def _inject_failures(failures, logger, deployment: ManagedDeployment): @@ -@pytest.fixture(autouse=True) -def results_table(request, scenario): # noqa: F811 +@pytest.fixture(autouse=True) +def results_table(request, scenario): @@ -async def test_fault_scenario( - scenario, # noqa: F811 +async def test_fault_scenario( + scenario,Also applies to: 103-103, 128-128
tests/fault_tolerance/deploy/scenarios.py (3)
53-60: Hardcoded absolute paths; parameterize for portabilityUsing /workspace/... will break outside the container. Support an environment variable or repo-relative paths.
+import os +from pathlib import Path @@ -deployment_specs = { - "agg-tp-1-dp-1": ( - DeploymentSpec("/workspace/components/backends/vllm/deploy/agg.yaml") - ), - "disagg-tp-1-dp-1": ( - DeploymentSpec("/workspace/components/backends/vllm/deploy/disagg.yaml") - ), -} +DEPLOY_ROOT = Path(os.environ.get("DYNAMO_DEPLOY_ROOT", "/workspace/components/backends/vllm/deploy")) +deployment_specs = { + "agg-tp-1-dp-1": DeploymentSpec(str(DEPLOY_ROOT / "agg.yaml")), + "disagg-tp-1-dp-1": DeploymentSpec(str(DEPLOY_ROOT / "disagg.yaml")), +}
62-77: Minor duplication in TP-2 setupsYou can reduce repetition by looping over sizes and services; not blocking.
94-108: Typos in comments (“Replicats”, “scenaro”, “pervious”)Fix spelling to keep docs clean.
-# Derivative Specs With Incremented Replicats +# Derivative specs with incremented replicas @@ -# Each failure scenaro contains a list of failure injections -# Each failure injection has a time in seconds after the pervious injection and +# Each failure scenario contains a list of failure injections +# Each failure injection has a time in seconds after the previous injection andtests/fault_tolerance/deploy/parse_results.py (4)
241-270: Unused parameter ‘failure_type’; clean up signature and callEither use it or drop it. Simplest: remove from calculate_recovery_time and its call.
-def calculate_recovery_time(test_dir, failure_type, fault_time): +def calculate_recovery_time(test_dir, fault_time): @@ - recovery_time = calculate_recovery_time(test_dir, failure_type, fault_time) + recovery_time = calculate_recovery_time(test_dir, fault_time)Also applies to: 299-299
254-266: Minor: clarify loop variable names to appease lintersRename variables for readability and to avoid B007 warnings.
- for process in processes: - starts = parse_process_log(os.path.join(test_dir, process), process) + for proc_name in processes: + starts = parse_process_log(os.path.join(test_dir, proc_name), proc_name) @@ - for process, replicas in process_start.items(): - for replica, container_starts in replicas.items(): + for proc_name, replicas in process_start.items(): + for _replica, container_starts in replicas.items():
27-54: parse_test_log tokenization is brittle; consider robust timestamp parsingSplitting by space and taking index 1 assumes a specific log format. If LOG_FORMAT changes, this breaks. Consider regex to extract ISO timestamps within brackets.
56-105: parse_client_logs: guard JSON decode and partial writesA single corrupt line will raise and drop the whole client file. Wrap json.loads in try/except and continue; also coerce fields safely. Optional but improves resilience.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tests/fault_tolerance/deploy/README.md(1 hunks)tests/fault_tolerance/deploy/__init__.py(1 hunks)tests/fault_tolerance/deploy/client.py(1 hunks)tests/fault_tolerance/deploy/conftest.py(1 hunks)tests/fault_tolerance/deploy/parse_results.py(1 hunks)tests/fault_tolerance/deploy/scenarios.py(1 hunks)tests/fault_tolerance/deploy/test_deployment.py(1 hunks)tests/utils/managed_deployment.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/fault_tolerance/deploy/conftest.py (1)
tests/utils/managed_deployment.py (4)
image(36-41)image(44-49)namespace(201-203)namespace(206-207)
tests/fault_tolerance/deploy/parse_results.py (2)
tests/utils/managed_deployment.py (1)
main(733-764)tests/utils/managed_process.py (1)
log_path(98-100)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/utils/managed_deployment.py (8)
DeploymentSpec(166-311)replicas(53-54)replicas(57-58)model(61-72)model(75-99)set_tensor_parallel(232-246)tensor_parallel_size(118-130)tensor_parallel_size(133-163)
tests/fault_tolerance/deploy/client.py (1)
tests/utils/managed_deployment.py (10)
ManagedDeployment(350-730)namespace(201-203)namespace(206-207)get_pods(560-586)port(187-189)name(30-32)name(178-180)name(183-184)port_forward(673-703)endpoint(197-198)
tests/fault_tolerance/deploy/test_deployment.py (4)
tests/fault_tolerance/deploy/client.py (1)
client(142-220)tests/fault_tolerance/deploy/parse_results.py (1)
main(320-435)tests/utils/managed_deployment.py (18)
main(733-764)ManagedDeployment(350-730)namespace(201-203)namespace(206-207)model(61-72)model(75-99)name(30-32)name(178-180)name(183-184)get_pods(560-586)get_pod_logs(588-611)get_processes(547-553)image(36-41)image(44-49)disable_grove(209-214)set_image(224-230)set_model(216-222)set_logging(248-270)tests/fault_tolerance/deploy/conftest.py (2)
namespace(30-31)image(25-26)
tests/utils/managed_deployment.py (2)
benchmarks/profiler/utils/config.py (1)
Service(57-61)tests/fault_tolerance/deploy/client.py (1)
client(142-220)
🪛 Ruff (0.13.1)
tests/fault_tolerance/deploy/parse_results.py
241-241: Unused function argument: failure_type
(ARG001)
260-260: Loop control variable process not used within loop body
(B007)
261-261: Loop control variable replica not used within loop body
Rename unused replica to _replica
(B007)
tests/fault_tolerance/deploy/client.py
59-59: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
67-67: Unused function argument: logger
(ARG001)
218-218: Do not catch blind exception: Exception
(BLE001)
219-219: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
tests/fault_tolerance/deploy/test_deployment.py
65-65: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
103-103: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
128-128: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
tests/utils/managed_deployment.py
344-344: Do not catch blind exception: Exception
(BLE001)
359-359: Do not perform function call logging.getLogger in dataclass defaults
(RUF009)
373-373: Do not catch blind exception: Exception
(BLE001)
398-398: Create your own exception
(TRY002)
398-398: Avoid specifying long messages outside the exception class
(TRY003)
492-492: Do not catch blind exception: Exception
(BLE001)
497-497: Avoid specifying long messages outside the exception class
(TRY003)
595-595: Do not catch blind exception: Exception
(BLE001)
596-596: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
600-600: Do not catch blind exception: Exception
(BLE001)
601-601: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
608-608: Do not catch blind exception: Exception
(BLE001)
617-617: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
647-647: Do not catch blind exception: Exception
(BLE001)
648-648: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
678-678: Possible binding to all interfaces
(S104)
🪛 markdownlint-cli2 (0.18.1)
tests/fault_tolerance/deploy/README.md
41-41: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
170-170: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
237-237: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
295-295: Multiple headings with the same content
(MD024, no-duplicate-heading)
296-296: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
316-316: Multiple headings with the same content
(MD024, no-duplicate-heading)
344-344: Hard tabs
Column: 4
(MD010, no-hard-tabs)
367-367: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
414-414: Hard tabs
Column: 1
(MD010, no-hard-tabs)
420-420: Hard tabs
Column: 1
(MD010, no-hard-tabs)
428-428: Hard tabs
Column: 1
(MD010, no-hard-tabs)
432-432: Hard tabs
Column: 1
(MD010, no-hard-tabs)
446-446: Hard tabs
Column: 1
(MD010, no-hard-tabs)
454-454: Multiple headings with the same content
(MD024, no-duplicate-heading)
456-456: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
483-483: Multiple headings with the same content
(MD024, no-duplicate-heading)
504-504: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
510-510: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
521-521: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - sglang
- GitHub Check: Build and Test - trtllm
- GitHub Check: Build and Test - vllm
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
tests/fault_tolerance/deploy/__init__.py (1)
1-2: LGTM on package init and licensingHeader looks correct and consistent.
tests/fault_tolerance/deploy/scenarios.py (1)
147-153: Scenario filtering logic looks correctSkipping prefill failures for non-disaggregated deployments is sensible.
…r various code paths Signed-off-by: [email protected] <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving as a formality. This was originally Neelay's PR which Tzu-Ling tested and approved, but Tzu-Ling needed to take over this PR while @nnshah1 is unable to access git/devbox. @indrajit96 was able to reproduce running the tests (manually) on both minikube and Azure as well.
Signed-off-by: nnshah1 <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: nnshah1 <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: nnshah1 <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: nnshah1 <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
Signed-off-by: nnshah1 <[email protected]> Signed-off-by: [email protected] <[email protected]> Co-authored-by: nnshah1 <[email protected]> Signed-off-by: Kyle H <[email protected]>
Overview:
K8s based fault injection tests for recovery and resilience.
(cherry pick PR2942, help nnshah1 to merge it and fix mypy errors)
Details:
Adds in test infrastructure for deploying dynamo on K8s and injecting faults (pod delete, process kill).
Where should the reviewer start?
managed_deployment.py
test_deployment.py
readme.md
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
DIS-276
Summary by CodeRabbit
Documentation
Tests