fix(ci): wait for code-server before UI selenium tests#2720
Conversation
Podman health only checks Selenium :4444; code-server on :8080 can still be starting, which led to TimeoutException and 30m job timeouts (e.g. run 24068021216). - Poll HTTP on code-server after container health (skip 502/503/504) - Bound podman health wait; remove accidental always-true condition - Set page/script load timeouts on WebDriver; extend ui step to 60m related: ansible#2580
📝 WalkthroughWalkthroughThe changes increase the UI test timeout from 30 to 60 minutes and enhance the test startup sequence with direct HTTP polling to verify code-server readiness before running Selenium tests. Container health checks are now deadline-bounded, and Selenium driver timeouts are explicitly configured. Changes
Sequence Diagram(s)sequenceDiagram
participant Test as Test Framework
participant Container as Container System
participant HTTP as code-server HTTP
participant Driver as Selenium Driver
Test->>Container: Check container health
loop Until healthy or timeout
Container-->>Test: Health status
end
Test->>HTTP: Poll HTTP endpoint
loop Until 200 or timeout
HTTP-->>Test: HTTP response (202/503/etc)
end
HTTP-->>Test: HTTP 200 OK
Test->>Driver: Set page load timeout (300s)
Test->>Driver: Set script timeout (300s)
Driver-->>Test: Timeouts configured
Test->>Test: Begin UI tests
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR hardens the UI Selenium CI setup by ensuring the code-server endpoint is actually reachable before running browser tests, reducing flakiness and long timeouts in the task ui (selenium) workflow step.
Changes:
- Add polling for
http://127.0.0.1:8080/after the container reports healthy to ensure code-server is ready. - Add an upper bound to the container health wait (configurable via
UI_CONTAINER_HEALTH_TIMEOUT) and remove unconditional container recreation. - Increase the GitHub Actions timeout for the UI Selenium step from 30 to 60 minutes and set WebDriver page/script timeouts.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
test/ui/fixtures/ui_fixtures.py |
Adds bounded health waits and an HTTP readiness probe for code-server before initializing WebDriver. |
.github/workflows/ci.yaml |
Extends the task ui (selenium) step timeout to accommodate image pulls and readiness waits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Upstream still starting — keep polling. Other responses mean HTTP is up. | ||
| if exc.code in (502, 503, 504): | ||
| pass | ||
| else: | ||
| log.info( | ||
| "code-server ready at %s (HTTP %s, attempt %s)", | ||
| url, | ||
| exc.code, | ||
| attempt, | ||
| ) | ||
| return |
There was a problem hiding this comment.
urllib.request.urlopen() raises HTTPError that still holds an open response file (exc.fp). In the retry path (502/503/504) this exception is not closed, so repeated polling can leak sockets/file descriptors in long CI runs. Close the HTTPError (e.g., exc.close() or ensure the response is closed via contextlib.closing).
| # Upstream still starting — keep polling. Other responses mean HTTP is up. | |
| if exc.code in (502, 503, 504): | |
| pass | |
| else: | |
| log.info( | |
| "code-server ready at %s (HTTP %s, attempt %s)", | |
| url, | |
| exc.code, | |
| attempt, | |
| ) | |
| return | |
| with contextlib.closing(exc): | |
| # Upstream still starting — keep polling. Other responses mean HTTP is up. | |
| if exc.code in (502, 503, 504): | |
| pass | |
| else: | |
| log.info( | |
| "code-server ready at %s (HTTP %s, attempt %s)", | |
| url, | |
| exc.code, | |
| attempt, | |
| ) | |
| return |
| health_deadline = time.time() + int( | ||
| os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"), | ||
| ) |
There was a problem hiding this comment.
UI_CONTAINER_HEALTH_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the fixture will crash with a ValueError that doesn’t identify the offending env var. Consider parsing with a helper that either falls back to the default or raises a clearer message mentioning UI_CONTAINER_HEALTH_TIMEOUT.
| _wait_for_code_server_http( | ||
| timeout_sec=int(os.environ.get("UI_CODE_SERVER_TIMEOUT", "600")), | ||
| ) |
There was a problem hiding this comment.
UI_CODE_SERVER_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the session fixture will crash with a ValueError before any tests run, and the traceback won’t clearly indicate which env var was invalid. Consider validating/parsing this value with a clearer error message (or falling back to the default).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/ui/fixtures/ui_fixtures.py`:
- Around line 125-143: The subprocess.run call inside is_container_healthy()
must include an explicit timeout to prevent the health-check command from
hanging; update is_container_healthy() to pass a timeout (e.g.,
timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to
subprocess.run and catch subprocess.TimeoutExpired (return False and optionally
log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure
you reference is_container_healthy and handle TimeoutExpired consistently with
other error paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 43f9470d-f018-4450-b84b-7a37a96e2006
📒 Files selected for processing (2)
.github/workflows/ci.yamltest/ui/fixtures/ui_fixtures.py
| health_deadline = time.time() + int( | ||
| os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"), | ||
| ) | ||
| count = 0 | ||
| while True: | ||
| while time.time() < health_deadline: | ||
| if is_container_healthy(): | ||
| break | ||
| count += 1 | ||
| time.sleep(1) | ||
| log.info( | ||
| "Waiting for container %s to be healthy: %s", CONTAINER_NAME, count | ||
| if count <= 5 or count % 15 == 0: | ||
| log.info( | ||
| "Waiting for container %s to be healthy: %s", | ||
| CONTAINER_NAME, | ||
| count, | ||
| ) | ||
| else: | ||
| pytest.fail( | ||
| f"container {CONTAINER_NAME} did not become healthy in time", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '42,52p;125,147p' test/ui/fixtures/ui_fixtures.py
rg -n -C3 'healthcheck|4444|8080' --iglob '*compose*.y*ml' || trueRepository: ansible/vscode-ansible
Length of output: 2414
Add an explicit timeout to the subprocess call in is_container_healthy().
While the docker-compose.yml defines a healthcheck with a 5-second timeout, the is_container_healthy() function calls subprocess.run() without a timeout argument. This means the subprocess itself could hang if the podman healthcheck run command stalls at the system level, potentially causing the entire wait loop to exceed UI_CONTAINER_HEALTH_TIMEOUT. Add an explicit timeout to the subprocess call for defensive programming and to ensure the behavior is fully under code control.
Patch sketch
-def is_container_healthy() -> bool:
+def is_container_healthy(timeout_sec: int = 10) -> bool:
"""Check if the selenium container is healthy."""
- result = subprocess.run(
- f"podman healthcheck run {CONTAINER_NAME}",
- shell=True,
- check=False,
- text=True,
- capture_output=True,
- )
+ try:
+ result = subprocess.run(
+ f"podman healthcheck run {CONTAINER_NAME}",
+ shell=True,
+ check=False,
+ text=True,
+ capture_output=True,
+ timeout=timeout_sec,
+ )
+ except subprocess.TimeoutExpired:
+ return False
return result.returncode == 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/ui/fixtures/ui_fixtures.py` around lines 125 - 143, The subprocess.run
call inside is_container_healthy() must include an explicit timeout to prevent
the health-check command from hanging; update is_container_healthy() to pass a
timeout (e.g.,
timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to
subprocess.run and catch subprocess.TimeoutExpired (return False and optionally
log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure
you reference is_container_healthy and handle TimeoutExpired consistently with
other error paths.
## Summary - **Remove the entire Python/Selenium UI test suite** (`test/ui/`, `docker-compose.yml`, Selenium/Podman Python deps) — eliminates the fragile `selenium-vscode` container, brittle XPath selectors, and DOM mismatches between `code-server` and desktop VS Code. - **Add a WebDriverIO test suite** (`test/wdio/`, `wdio.conf.ts`) using `wdio-vscode-service`, which tests against real VS Code Electron. 30 tests across 7 spec files: smoke, commands, terminal, sidebar, webviews (devfile, devcontainer, welcome page, LLM provider settings), MCP server toggle, and Lightspeed (backed by an Express mock server with injected auth session). - **Dedicated CI runner** — WDIO tests run on their own `test (linux-wdio)` matrix entry, independent of unit/e2e. No more cascading failures from unrelated e2e issues blocking UI tests. - **Update CI** (`ci.yaml`, `Taskfile.yml`) to run `task wdio` under `xvfb-run` instead of the Selenium container, removing the Podman dependency from Linux CI. - **Add onboarding docs** — Cursor skill (`.cursor/skills/wdio-testing/`) with architecture, patterns, and gotchas; updated `test/README.md` and `docs/development/test_code.md`. ## Why All 29 Selenium tests on `main` are currently **skipped** — zero tests actually execute: | File | Tests | Status | Skip Reason | |------|-------|--------|-------------| | `test_00_commands.py` | 2 | All skipped | Flaky on CI - extension activation timing | | `test_01_dev_webviews.py` | 2 | All skipped | Flaky on CI - extension activation timing | | `test_02_welcome_webviews.py` | 3 | All skipped | Flaky on CI - extension activation timing | | `test_03_llm_provider_webview.py` | 4 | All skipped | Flaky on CI - extension activation timing | | `test_50_mcp_server.py` | 4 | All skipped | Flaky on CI - extension activation timing | | `test_80_lightspeed.py` | 5 | All skipped | AAP-67210 | | `test_81_login.py` | 8 | All skipped | AAP-67210 | | `test_82_lightspeed_trial.py` | 1 | All skipped | AAP-67210 | | **Total** | **29** | **0 running** | | The Selenium suite had a **100% CI failure rate** for the last two weeks on `main` (W15-W16) before the tests were skipped, and an accelerating failure rate before that (5% → 16% → 38% → 100% over 8 weeks). Root causes: `code-server` DOM divergence from real VS Code, fragile XPath selectors, deeply nested iframe traversal, and container startup timing. See [full analysis](https://gist.github.com/cidrblock/eaa433824950742554bee78d217e3772). WDIO with `wdio-vscode-service` provides stable Page Objects against the real VS Code Electron shell, `browser.executeWorkbench()` for direct access to the VS Code API, and zero container overhead. ## AAP-67210 Coverage The 14 Lightspeed/SSO tests tracked by [AAP-67210](https://redhat.atlassian.net/browse/AAP-67210) were all skipped in Selenium. WDIO restores coverage for 5 of them using a mock server approach, and adds 2 new tests (error/timeout handling) that never existed: | AAP-67210 Selenium Test | WDIO Status | WDIO Test | |---|---|---| | `test_vscode_widget` | **Covered** | `should have Lightspeed commands registered` | | `test_vscode_playbook_explanation` | **Covered** | `should open playbook explanation webview` | | `test_vscode_playbook_generation` | **Covered** | `should open playbook generation webview` | | `test_vscode_role_generation` | **Covered** | `should open role generation webview` | | `test_vscode_lightspeed_explorer` | **Partial** | Explorer view covered via sidebar tests | | `test_vscode_trial_button` | Deferred | Excluded from initial PR | | `test_unsubed_login` | N/A | SSO flow — tests RH infra, not extension | | `test_unsubed_admin_login` | N/A | SSO flow | | `test_no_wca_user_login` | N/A | SSO flow | | `test_no_wca_admin_login` | N/A | SSO flow | | `test_login_page` | N/A | SSO flow | | `test_sso_auth_flow` | N/A | SSO flow | | `test_admin_portal_error` | N/A | SSO flow | | `test_vscode_rhsso_auth_flow` | N/A | SSO flow | The 8 SSO login tests exercise Red Hat SSO infrastructure (external OAuth pages), not extension code. They require live SSO credentials and a running Lightspeed backend, which is why they were never functional in CI. ## CI architecture ``` Before: test (linux) → task unit → task e2e → task ui (selenium, blocked by e2e) After: test (linux) → task unit → task e2e test (linux-wdio) → task build → task wdio ← independent runner ``` The WDIO job skips `task package` (no `.vsix` needed) and runs only `task build` → `task wdio`. Either job can be re-run independently. ## Test coverage comparison | Area | Selenium (before) | WDIO (after) | |---|---|---| | Smoke / activation | implicit | 5 explicit tests | | Commands | 1 (create playbook) | 1 (create playbook) | | Terminal | 1 (flaky, DOM scraping) | 2 (execSync + API) | | Webviews | 4 (devfile, devcontainer, welcome, LLM) | 9 (deeper assertions) | | MCP Server | 2 | 4 (enable/disable/settings/list) | | Sidebar | 0 | 3 (view, welcome link, settings) | | Lightspeed | 0 (all skipped/SSO-gated) | 6 (mock server: explain, generate, role, commands, errors, timeout) | | **Total** | **0 active (29 skipped)** | **30 (all passing)** | ## Extension source changes - `src/extension.ts`: registers `ansible.lightspeed.mockSession` command (gated to `ExtensionMode.Development`/`Test` only — not available in production) - `src/features/lightspeed/lightSpeedOAuthProvider.ts`: adds `setMockSession()` to inject fake auth sessions ## Test plan - [x] `task wdio` passes locally under `xvfb-run` - [x] CI `test (linux-wdio)` job passes - [x] `task lint` passes - [x] `task unit` / `task e2e` unaffected on `test (linux)` - [x] No Selenium/Podman references remain in CI or Taskfile Closes: ansible#2722 Closes: ansible#2743 Closes: ansible#2720 Closes: ansible#2752 related: AAP-67210
Problem
test (linux)job 70211403232 failed ontask ui (selenium):test_terminal,test_create_empty_playbook, andtest_devfile_webviewraisedselenium.common.exceptions.TimeoutException, and the step hit the 30-minute timeout.Cause
docker-compose.ymlhealthcheck only curls Selenium on port 4444. The container can report healthy before code-server on :8080 is ready, so tests hit long waits or timeouts.Also,
if not is_container_healthy() or True:always recreated the container, and the health loop had no upper bound.Changes
http://127.0.0.1:8080/until HTTP responds (retry on 502/503/504).UI_CONTAINER_HEALTH_TIMEOUT).or True; set page/script timeouts on the WebDriver.task ui (selenium)step limit from 30 to 60 minutes.related: #2580
Fixed CI timeouts in UI Selenium tests by adding HTTP polling to verify code-server readiness before running tests. Container health check now waits for code-server (port 8080) to respond, with capped timeouts. Removed a bug preventing proper container reuse and set WebDriver timeouts.
task ui (selenium)step timeout from 30 to 60 minutes_wait_for_code_server_http()function to poll code-server HTTP endpoint after container health checkor Trueconditionrelated: #2580