-
Notifications
You must be signed in to change notification settings - Fork 688
test: fault injection tests for k8s #2942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: nnshah1 <[email protected]>
WalkthroughAdds a Kubernetes-based fault-tolerance testing harness under tests/, including a managed deployment utility, pytest fixtures, scenario definitions, a multiprocessing client with port-forwarding, failure injection logic, result parsing/reporting CLI, and documentation. Also adds SPDX headers to init.py. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PyTest as PyTest (test_deployment)
participant MD as ManagedDeployment
participant K8s as Kubernetes API
participant Pods as Frontend/Workers Pods
participant Clients as Client Processes
participant Parser as parse_results CLI
PyTest->>MD: init & __aenter__()
MD->>K8s: create Custom Resource
MD-->>PyTest: wait for ready (status)
PyTest->>Clients: spawn N clients (multiprocessing)
loop For each request
Clients->>MD: port-forward to a ready Frontend pod
Clients->>Pods: POST /v1/chat/completions
Pods-->>Clients: response (timing, status)
Clients-->>Clients: throttle to max_request_rate
end
par Fault injection
PyTest->>K8s: delete pod(s) OR
PyTest->>Pods: kill process (signal)
and Logging
MD->>K8s: fetch pod logs/metrics
MD-->>PyTest: store logs in log_dir
end
PyTest->>MD: __aexit__() (cleanup)
PyTest->>Parser: parse logs -> summary tables
Parser-->>PyTest: results (latency, success/failure, recovery)
sequenceDiagram
autonumber
participant Client as Client worker
participant PF as Port-forward tunnel
participant FE as Frontend Pod
rect rgb(240,248,255)
note right of Client: Per-request with retries
Client->>PF: ensure tunnel to FE:port
Client->>FE: POST completion
alt success
FE-->>Client: 200 + JSON
else transient error
FE-->>Client: error / timeout
loop up to max_retries
Client->>FE: retry after delay
FE-->>Client: response
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: nnshah1 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/fault_tolerance/deploy/parse_results.py (1)
438-454: CLI bug: wrong argument order when calling main() (TypeError at runtime).
main(args.log_dir, tablefmt, args.sla)passes a float aslog_paths, then iterates it. Must passslaby name and fix default forlog_paths.Apply this patch (also fixes mutable default arg for
log_paths):-def main(logs_dir, tablefmt, log_paths=[], sla=None): +def main(logs_dir, tablefmt, sla=None, log_paths=None): @@ - if log_paths: + if log_paths: for log_path in log_paths: result = process_test_directory(log_path, sla) if result: results.append(result) @@ -if __name__ == "__main__": +if __name__ == "__main__": @@ - main(args.log_dir, tablefmt, args.sla) + main(args.log_dir, tablefmt, sla=args.sla)
🧹 Nitpick comments (22)
tests/fault_tolerance/deploy/README.md (6)
20-35: Fix typos and clarity in intro/architecture.
- resilency → resiliency
- unforseen → unforeseen
- environemnt → environment
- “Test pass / failure … It only indicates is the test was executed…” → grammar pass
Apply this minimal patch:
-As a large scale distributed inference serving framework in addition -to providing high throughput and low latency, Dynamo needs to -provide fault detection, resilency, and quick recovery in the face of -unforseen failures. In order to test Dynamo we are developing a test +As a large-scale distributed inference serving framework, in addition +to providing high throughput and low latency, Dynamo needs to +provide fault detection, resiliency, and quick recovery in the face of +unforeseen failures. To evaluate this, we are developing a test suite to inject and measure the impact of different types of failure conditions. @@ -configurations that launch typical dynamo deployments in a Kubernetes -environemnt and then inject failures by terminating processes or +configurations that launch typical Dynamo deployments in a Kubernetes +environment and then inject failures by terminating processes or pods. To test the recovery time and impact of failures a set number of clients are launched in parallel. Each client sends a set number of synchronous requests. Log files are stored for each pod as well as for each client and inspected using a post-processing script. @@ -> Test pass / failure is not an indication of SLA for recovery or resilience -> It only indicates is the test was executed and data was collected +> Test pass/fail is not an indication of an SLA for recovery or resilience. +> It only indicates the test executed and data was collected.
41-44: Remove extra space after heading marker.“### Test Sequence Diagram” has two spaces after the hashes, tripping MD019.
-### Test Sequence Diagram +### Test Sequence Diagram
121-126: Add fenced code languages and fix minor typos in results section.
- Add languages to code fences (bash, text) to satisfy MD040.
- Correct “Succes/nBefore”, “Numoer”, “Aftere”, “VllmDecodeWroer”.
-```bash +```bash pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE}@@
-+text
test_fault_scenario[agg-tp-1-dp-1-none]
.
├── client_0.log.txt
@@@@ -| **Succes/nBefore** | Numoer of client requests that succeeded before fault injection | +| **Success Before** | Number of client requests that succeeded before fault injection | -| **Failed/nBefore** | Number of client requests that failed or were invalid before fault injection | +| **Failed Before** | Number of client requests that failed or were invalid before fault injection | -| **Success/nAftere** | Number of client requests that succeeded after fault injection | +| **Success After** | Number of client requests that succeeded after fault injection | @@ -| **{Service}/*.log | Container log for pod at end of test (Frontend, VllmDecodeWroer, etc.) | +| **{Service}/*.log** | Container log for pod at end of test (Frontend, VllmDecodeWorker, etc.) | -| **{Service}/*.previous.log** | Previous container log for pod in case of crash / exit. (Frontend, VllmDecodeWroer, etc.). Empty if N/A. | +| **{Service}/*.previous.log** | Previous container log for pod in case of crash/exit (Frontend, VllmDecodeWorker, etc.). Empty if N/A. |Also applies to: 131-156, 171-188
235-256: Deduplicate and adjust headings; add code fence languages.
- Duplicate “Results:”/“Summary:” headings (MD024) and trailing colon (MD026).
- Add language to code fences (MD040).
-#### Results: +#### Results -``` +```text ...-#### Summary:
+#### SummaryAlso applies to: 295-316, 454-482 --- `338-346`: **Replace hard tabs with spaces in Mermaid blocks.** Tabs trigger MD010; convert to spaces for consistent rendering. ```diff - end + end @@ - Frontend_2["Frontend 2"] + Frontend_2["Frontend 2"] @@ - Frontend_2 <--> DecodePool + Frontend_2 <--> DecodePool @@ - end + end @@ - DecodePool --> PrefillPool + DecodePool --> PrefillPoolAlso applies to: 410-452
490-512: Add languages to Quick Start code fences.Specify bash to satisfy MD040.
-``` +```bash pytest tests/fault_tolerance/deploy/test_deployment.py -s -v --namespace ${NAMESPACE} --image ${IMAGE}@@
-+bashIn case you have multiple configs
export KUBECONFIG=~/.kube/dynamo-kubeconfig
Also applies to: 521-529 </blockquote></details> <details> <summary>tests/fault_tolerance/deploy/parse_results.py (3)</summary><blockquote> `209-221`: **Make non‑JSON log parsing robust to “[TEST] …” prefixes.** Fallback assumes the timestamp is the first token; with `[TEST]` prefix that fails. Extract ISO timestamp via regex. ```diff - parts = clean_line.split() - if len(parts) < 2: - continue - - try: - # Parse timestamp (remove 'Z' for naive datetime) - timestamp = datetime.fromisoformat( - parts[0].replace("Z", "") - ) - except ValueError: - continue - - log_message = " ".join(parts[1:]) + # Extract ISO 8601 timestamp anywhere in the line + m = re.search(r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(?:\.\d+)?Z?", clean_line) + if not m: + continue + ts = m.group(0).replace("Z", "") + try: + timestamp = datetime.fromisoformat(ts.replace("T", " ")) + except ValueError: + continue + log_message = clean_line[m.end():].strip()
267-269: Avoid treating zero‑second recovery as “no recovery.”If recovery aligns to the same second as the fault,
0returnsNone. Use<= 0.- if last_recovery_time == 0: + if last_recovery_time <= 0: return None
56-75: Harden client log parsing against corrupt lines.A single bad JSON line will currently crash parsing. Wrap
json.loadsand skip on failure.- for line in f: + for line in f: request_number += 1 - data = json.loads(line.strip()) + try: + data = json.loads(line.strip()) + except json.JSONDecodeError: + continuetests/utils/managed_deployment.py (3)
398-404: Typo in method name and callers: _scale_statfulset → _scale_statefulset.Minor correctness/readability; fix name and references.
- async def _scale_statfulset(self, name, label, replicas): + async def _scale_statefulset(self, name, label, replicas): @@ - await self._apps_v1.patch_namespaced_stateful_set_scale( + await self._apps_v1.patch_namespaced_stateful_set_scale( name, self.namespace, body ) await self._wait_for_pods(label, replicas) @@ - await self._scale_statfulset(name, label, 0) + await self._scale_statefulset(name, label, 0) @@ - await self._scale_statfulset(name, label, 1) + await self._scale_statefulset(name, label, 1)Also applies to: 406-420, 500-505
582-604: Use logging.exception and narrow excepts while collecting logs/metrics.Broad
except Exceptionwitherror()hides stack traces; preferexception()and narrow where practical.- except Exception as e: - self._logger.error(e) + except Exception: + self._logger.exception("Failed to write pod YAML") @@ - except Exception as e: - self._logger.error(e) + except Exception: + self._logger.exception("Failed to write current pod logs") @@ - except Exception as e: - self._logger.debug(e) + except Exception: + self._logger.debug("No previous logs available", exc_info=True) @@ - except Exception as e: - self._logger.error(str(e)) + except requests.RequestException: + self._logger.exception("Error fetching /metrics")Also applies to: 641-649
705-717: Consider gating disruptive infra restarts behind a flag.Restarting etcd and NATS on every test run is heavy and can cause flakiness. Add a
restart_infra: bool = Falseflag and conditionally call_restart_etcd()/_restart_nats().Happy to propose a small diff adding the flag if you confirm this won’t be needed for every scenario.
tests/fault_tolerance/deploy/conftest.py (1)
19-31: LGTM; add help text for CLI options for discoverability.Minor ergonomics: provide
help=for--imageand--namespace.- parser.addoption("--image", type=str, default=None) - parser.addoption("--namespace", type=str, default="fault-tolerance-test") + parser.addoption("--image", type=str, default=None, help="Container image override for services") + parser.addoption("--namespace", type=str, default="fault-tolerance-test", help="Kubernetes namespace to use")tests/fault_tolerance/deploy/test_deployment.py (3)
129-175: Don’t block the event loop in an async test; move failure injection to a thread.time.sleep inside async def blocks the loop for long periods. Run the injector in a thread.
Apply this diff:
@@ +import asyncio @@ with _clients( @@ - ): - _inject_failures(scenario.failures, logger, deployment) + ): + await asyncio.to_thread(_inject_failures, scenario.failures, logger, deployment)
61-65: Join with timeout and terminate on hang to avoid deadlocks.If a client process wedges, join() will block indefinitely during teardown.
Apply this diff:
for proc in procs: logger.debug(f"{proc} waiting for join") - proc.join() + proc.join(timeout=60) + if proc.is_alive(): + logger.warning("Terminating hung client process: %s", proc.pid) + proc.terminate() + proc.join(timeout=10) logger.debug(f"{proc} joined")
67-67: Remove unused noqa F811 directives.These suppressions are unnecessary and flagged by Ruff.
Apply this diff:
-def _inject_failures(failures, logger, deployment: ManagedDeployment): # noqa: F811 +def _inject_failures(failures, logger, deployment: ManagedDeployment): @@ -@pytest.fixture(autouse=True) -def results_table(request, scenario): # noqa: F811 +@pytest.fixture(autouse=True) +def results_table(request, scenario): @@ -async def test_fault_scenario( - scenario, # noqa: F811 +async def test_fault_scenario( + scenario,Also applies to: 105-105, 130-130
tests/fault_tolerance/deploy/client.py (3)
49-53: Avoid configuring global logging in an importable module.basicConfig at import time can interfere with pytest logging. Consider moving to if name == "main" or let the test harness configure logging.
61-72: Use logger or drop the arg; tighten retry loop and timestamp.
- logger is unused (Ruff ARG001). Either use it or remove it.
- Prefer timezone-aware ISO timestamp for consistency.
Apply this diff:
def _single_request( url, pod, payload, model, - logger, + logger, retry_attempts=1, @@ - start_time = time.time() + start_time = time.time() results = [] @@ end_time = time.time() content = None @@ results.append( { "status": response.status_code, "result": content, "request_elapsed_time": end_time - start_request_time, "url": url, "pod": pod, } ) + if logger: + logger.debug("POST %s -> %s in %.3fs", url, response.status_code, end_time - start_request_time) @@ return { - "time": datetime.now().strftime("%Y-%m-%dT%H:%M:%S"), + "time": datetime.now().astimezone().isoformat(timespec="seconds"), "results": results, "total_time": time.time() - start_time, "url": url, "pod": pod, }Also applies to: 83-117, 118-131, 132-139
216-218: Broadened exception handling; stop port-forwards and log stack traces.Use logger.exception and always stop port-forward tunnels.
Apply this diff:
- except Exception as e: - logger.error(str(e)) - logger.info("Exiting") + except Exception: + logger.exception("Client error") + finally: + for pf in pod_ports.values(): + try: + pf.stop() + except Exception: + logger.debug("Port-forward stop failed", exc_info=True) + logger.info("Exiting")tests/fault_tolerance/deploy/scenarios.py (3)
16-19: Tighten types with Optional; fix minor doc typos below.Use Optional for nullable fields to improve editor/type-checker help.
Apply this diff:
-from dataclasses import dataclass +from dataclasses import dataclass +from typing import Optional @@ class Load: @@ - sla: float = None + sla: Optional[float] = None @@ class Scenario: deployment: DeploymentSpec load: Load failures: list[Failure] - model: str = None + model: Optional[str] = NoneAlso applies to: 21-30, 41-47
93-93: Fix typos in comments (readability).Replicats → Replicas; scenaro → scenario; pervious → previous.
Apply this diff:
-# Derivative Specs With Incremented Replicats +# Derivative specs with incremented replicas @@ -# Each failure scenaro contains a list of failure injections -# Each failure injection has a time in seconds after the pervious injection and +# Each failure scenario contains a list of failure injections +# Each failure injection has a time in seconds after the previous injection andAlso applies to: 109-119
62-76: Consistency: prefer set_tensor_parallel over direct property mutation.Use the helper to avoid diverging behaviors across specs.
Example:
deployment_specs["disagg-prefill-tp-2-decode-tp-2-dp-1"].set_tensor_parallel( 2, ["VllmPrefillWorker", "VllmDecodeWorker"] )Also applies to: 83-92
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
tests/fault_tolerance/deploy/README.md(1 hunks)tests/fault_tolerance/deploy/__init__.py(1 hunks)tests/fault_tolerance/deploy/client.py(1 hunks)tests/fault_tolerance/deploy/conftest.py(1 hunks)tests/fault_tolerance/deploy/parse_results.py(1 hunks)tests/fault_tolerance/deploy/scenarios.py(1 hunks)tests/fault_tolerance/deploy/test_deployment.py(1 hunks)tests/utils/managed_deployment.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
tests/fault_tolerance/deploy/parse_results.py (1)
tests/utils/managed_deployment.py (1)
main(727-758)
tests/fault_tolerance/deploy/conftest.py (1)
tests/utils/managed_deployment.py (4)
image(36-41)image(44-49)namespace(201-203)namespace(206-207)
tests/fault_tolerance/deploy/test_deployment.py (4)
tests/fault_tolerance/deploy/client.py (1)
client(141-218)tests/fault_tolerance/deploy/parse_results.py (1)
main(320-435)tests/utils/managed_deployment.py (21)
main(727-758)ManagedDeployment(350-724)namespace(201-203)namespace(206-207)model(61-72)model(75-99)name(30-32)name(178-180)name(197-198)get_pods(554-580)replicas(53-54)replicas(57-58)get_pod_logs(582-605)get_processes(541-547)kill(322-331)image(36-41)image(44-49)disable_grove(209-214)set_image(224-230)set_model(216-222)set_logging(248-270)tests/fault_tolerance/deploy/conftest.py (2)
namespace(30-31)image(25-26)
tests/fault_tolerance/deploy/client.py (1)
tests/utils/managed_deployment.py (10)
ManagedDeployment(350-724)namespace(201-203)namespace(206-207)get_pods(554-580)port(183-185)name(30-32)name(178-180)name(197-198)port_forward(667-697)endpoint(193-194)
tests/utils/managed_deployment.py (2)
benchmarks/profiler/utils/config.py (1)
Service(54-58)tests/fault_tolerance/deploy/client.py (1)
client(141-218)
tests/fault_tolerance/deploy/scenarios.py (1)
tests/utils/managed_deployment.py (8)
DeploymentSpec(166-311)replicas(53-54)replicas(57-58)model(61-72)model(75-99)set_tensor_parallel(232-246)tensor_parallel_size(118-130)tensor_parallel_size(133-163)
🪛 Ruff (0.12.2)
tests/fault_tolerance/deploy/parse_results.py
241-241: Unused function argument: failure_type
(ARG001)
260-260: Loop control variable process not used within loop body
(B007)
261-261: Loop control variable replica not used within loop body
Rename unused replica to _replica
(B007)
320-320: Do not use mutable data structures for argument defaults
Replace with None; initialize within function
(B006)
tests/fault_tolerance/deploy/test_deployment.py
67-67: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
105-105: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
130-130: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
tests/fault_tolerance/deploy/client.py
58-58: Standard pseudo-random generators are not suitable for cryptographic purposes
(S311)
66-66: Unused function argument: logger
(ARG001)
216-216: Do not catch blind exception: Exception
(BLE001)
217-217: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
tests/utils/managed_deployment.py
344-344: Do not catch blind exception: Exception
(BLE001)
372-372: Do not catch blind exception: Exception
(BLE001)
396-396: Create your own exception
(TRY002)
396-396: Avoid specifying long messages outside the exception class
(TRY003)
487-487: Do not catch blind exception: Exception
(BLE001)
492-492: Avoid specifying long messages outside the exception class
(TRY003)
589-589: Do not catch blind exception: Exception
(BLE001)
590-590: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
594-594: Do not catch blind exception: Exception
(BLE001)
595-595: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
602-602: Do not catch blind exception: Exception
(BLE001)
611-611: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
641-641: Do not catch blind exception: Exception
(BLE001)
642-642: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
672-672: Possible binding to all interfaces
(S104)
🪛 markdownlint-cli2 (0.17.2)
tests/fault_tolerance/deploy/README.md
41-41: Multiple spaces after hash on atx style heading
(MD019, no-multiple-space-atx)
119-119: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
131-131: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
170-170: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
235-235: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
237-237: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
257-257: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
295-295: Multiple headings with the same content
(MD024, no-duplicate-heading)
295-295: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
296-296: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
316-316: Multiple headings with the same content
(MD024, no-duplicate-heading)
316-316: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
344-344: Hard tabs
Column: 4
(MD010, no-hard-tabs)
365-365: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
367-367: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
394-394: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
414-414: Hard tabs
Column: 1
(MD010, no-hard-tabs)
420-420: Hard tabs
Column: 1
(MD010, no-hard-tabs)
428-428: Hard tabs
Column: 1
(MD010, no-hard-tabs)
432-432: Hard tabs
Column: 1
(MD010, no-hard-tabs)
446-446: Hard tabs
Column: 1
(MD010, no-hard-tabs)
454-454: Multiple headings with the same content
(MD024, no-duplicate-heading)
454-454: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
456-456: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
483-483: Multiple headings with the same content
(MD024, no-duplicate-heading)
483-483: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
504-504: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
510-510: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
521-521: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (4)
tests/fault_tolerance/deploy/__init__.py (1)
1-2: LGTM on SPDX headers.Header is correct and non-invasive.
tests/fault_tolerance/deploy/test_deployment.py (1)
156-160: Great: deployment context setup is clear and bounded.tests/fault_tolerance/deploy/client.py (1)
140-156: LGTM: client scaffolding is clear and aligns with ManagedDeployment.tests/fault_tolerance/deploy/scenarios.py (1)
146-152: Scenario generation filter is correct.Skipping prefill failures for non-disagg deployments looks good.
| @pytest.mark.e2e | ||
| @pytest.mark.slow | ||
| @pytest.mark.filterwarnings("ignore::DeprecationWarning") | ||
| async def test_fault_scenario( | ||
| scenario, # noqa: F811 | ||
| request, | ||
| image, | ||
| namespace, | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify pytest-asyncio is available to run async tests.
Without the plugin, async tests won’t be awaited.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python - <<'PY'
import importlib, sys
ok = importlib.util.find_spec("pytest_asyncio") is not None
print("pytest-asyncio:", "FOUND" if ok else "MISSING")
sys.exit(0 if ok else 1)
PYLength of output: 131
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "python binary: $(command -v python || echo 'none')"
python -V || true
python - <<'PY'
import sys, pkgutil, subprocess
def detect():
results = {}
try:
import pytest_asyncio
results['import'] = 'FOUND'
except Exception as e:
results['import'] = f'MISSING ({e!r})'
try:
import importlib.util
spec = importlib.util.find_spec("pytest_asyncio")
results['importlib.util.find_spec'] = 'FOUND' if spec else 'MISSING'
except Exception as e:
results['importlib.util.find_spec'] = f'ERROR ({e!r})'
try:
found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules())
results['pkgutil.iter_modules'] = 'FOUND' if found else 'MISSING'
except Exception as e:
results['pkgutil.iter_modules'] = f'ERROR ({e!r})'
try:
r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True)
results['pip_show_rc'] = r.returncode
results['pip_show_out'] = (r.stdout.strip() or r.stderr.strip())[:1000]
except Exception as e:
results['pip_show'] = f'ERROR ({e!r})'
try:
r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True)
results['pytest_version_rc'] = r.returncode
results['pytest_version_out'] = (r.stdout.strip() or r.stderr.strip())[:1000]
except Exception as e:
results['pytest_version'] = f'ERROR ({e!r})'
for k,v in results.items():
print(f"{k}: {v}")
detect()
PYLength of output: 223
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "python binary: $(command -v python || echo 'none')"
python -V || true
python - <<'PY'
import sys, pkgutil, subprocess
def sp(k,v):
print(str(k) + ": " + str(v))
# import check
try:
import pytest_asyncio
sp("import", "FOUND")
except Exception as e:
sp("import", "MISSING " + type(e).__name__ + " " + str(e))
# importlib.util.find_spec
try:
import importlib.util as iu
spec = iu.find_spec("pytest_asyncio")
sp("importlib.util.find_spec", "FOUND" if spec else "MISSING")
except Exception as e:
sp("importlib.util.find_spec", "ERROR " + type(e).__name__ + " " + str(e))
# pkgutil.iter_modules
try:
found = any(m.name == "pytest_asyncio" for m in pkgutil.iter_modules())
sp("pkgutil.iter_modules", "FOUND" if found else "MISSING")
except Exception as e:
sp("pkgutil.iter_modules", "ERROR " + type(e).__name__ + " " + str(e))
# pip show
try:
r = subprocess.run([sys.executable, "-m", "pip", "show", "pytest-asyncio"], capture_output=True, text=True)
sp("pip_show_rc", r.returncode)
out = (r.stdout.strip() or r.stderr.strip())
sp("pip_show_out", out.splitlines()[0] if out else "<empty>")
except Exception as e:
sp("pip_show", "ERROR " + type(e).__name__ + " " + str(e))
# pytest --version
try:
r = subprocess.run([sys.executable, "-m", "pytest", "--version"], capture_output=True, text=True)
sp("pytest_version_rc", r.returncode)
out = (r.stdout.strip() or r.stderr.strip())
sp("pytest_version_out", out.splitlines()[0] if out else "<empty>")
except Exception as e:
sp("pytest_version", "ERROR " + type(e).__name__ + " " + str(e))
PYLength of output: 306
Install pytest-asyncio (required for async tests)
Verification shows pytest-asyncio is MISSING while pytest 8.4.2 is installed; add pytest-asyncio to your test/dev dependencies and CI so async tests in tests/fault_tolerance/deploy/test_deployment.py (L126-L134) are actually awaited.
🧰 Tools
🪛 Ruff (0.12.2)
130-130: Unused noqa directive (unused: F811)
Remove unused noqa directive
(RUF100)
🤖 Prompt for AI Agents
In tests/fault_tolerance/deploy/test_deployment.py around lines 126-134, async
tests are used but pytest-asyncio is not installed; add pytest-asyncio to the
project's test/dev dependencies (e.g., pyproject.toml or requirements-dev.txt)
with a version compatible with pytest 8.4.x (for example
pytest-asyncio>=0.21.0), update CI pipeline to install those dev/test deps
before running pytest so async tests are awaited, and re-run CI to verify the
async tests execute correctly.
|
Closed in favor of #3194 |
Overview:
K8s based fault injection tests for recovery and resilience.
Details:
Adds in test infrastructure for deploying dynamo on K8s and injecting faults (pod delete, process kill).
Where should the reviewer start?
managed_deployment.py
test_deployment.py
readme.md
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Documentation
Tests