fix(ci): wait for code-server before UI selenium tests by ssbarnea · Pull Request #2720 · ansible/vscode-ansible

ssbarnea · 2026-04-07T10:06:42Z

Problem

test (linux) job 70211403232 failed on task ui (selenium): test_terminal, test_create_empty_playbook, and test_devfile_webview raised selenium.common.exceptions.TimeoutException, and the step hit the 30-minute timeout.

Cause

docker-compose.yml healthcheck only curls Selenium on port 4444. The container can report healthy before code-server on :8080 is ready, so tests hit long waits or timeouts.

Also, if not is_container_healthy() or True: always recreated the container, and the health loop had no upper bound.

Changes

After podman health passes, poll http://127.0.0.1:8080/ until HTTP responds (retry on 502/503/504).
Cap container health wait (default 900s, overridable via UI_CONTAINER_HEALTH_TIMEOUT).
Remove the accidental or True; set page/script timeouts on the WebDriver.
Raise the GitHub Actions task ui (selenium) step limit from 30 to 60 minutes.

related: #2580

Fixed CI timeouts in UI Selenium tests by adding HTTP polling to verify code-server readiness before running tests. Container health check now waits for code-server (port 8080) to respond, with capped timeouts. Removed a bug preventing proper container reuse and set WebDriver timeouts.

Increased GitHub Actions task ui (selenium) step timeout from 30 to 60 minutes
Added _wait_for_code_server_http() function to poll code-server HTTP endpoint after container health check
Fixed container recreation logic by removing accidental or True condition
Capped container health wait time (UI_CONTAINER_HEALTH_TIMEOUT, default 900 seconds)
Added configurable code-server startup timeout (UI_CODE_SERVER_TIMEOUT, default 600 seconds)
Set WebDriver page load and script timeouts to 300 seconds

related: #2580

Podman health only checks Selenium :4444; code-server on :8080 can still be starting, which led to TimeoutException and 30m job timeouts (e.g. run 24068021216). - Poll HTTP on code-server after container health (skip 502/503/504) - Bound podman health wait; remove accidental always-true condition - Set page/script load timeouts on WebDriver; extend ui step to 60m related: ansible#2580

coderabbitai · 2026-04-07T10:08:50Z

📝 Walkthrough

Walkthrough

The changes increase the UI test timeout from 30 to 60 minutes and enhance the test startup sequence with direct HTTP polling to verify code-server readiness before running Selenium tests. Container health checks are now deadline-bounded, and Selenium driver timeouts are explicitly configured.

Changes

Cohort / File(s)	Summary
CI Configuration `.github/workflows/ci.yaml`	Increased `task ui (selenium)` timeout from 30 to 60 minutes with clarifying comment about Selenium startup delays relative to code-server readiness.
UI Test Fixtures `test/ui/fixtures/ui_fixtures.py`	Added HTTP polling function `_wait_for_code_server_http()` to verify endpoint availability before tests run. Updated container health check from unbounded loop to deadline-bounded polling. Removed conditional `or True` branch to enforce actual health validation. Configured Selenium driver page load and script execution timeouts (300s each).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Framework
    participant Container as Container System
    participant HTTP as code-server HTTP
    participant Driver as Selenium Driver

    Test->>Container: Check container health
    loop Until healthy or timeout
        Container-->>Test: Health status
    end
    
    Test->>HTTP: Poll HTTP endpoint
    loop Until 200 or timeout
        HTTP-->>Test: HTTP response (202/503/etc)
    end
    HTTP-->>Test: HTTP 200 OK
    
    Test->>Driver: Set page load timeout (300s)
    Test->>Driver: Set script timeout (300s)
    Driver-->>Test: Timeouts configured
    
    Test->>Test: Begin UI tests

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A timeout stretched, from half to whole,
Now HTTP polls to reach its goal,
No phantom starts that waste the time,
Just steady checks—a clean design!
Selenium waits with purpose true, ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: waiting for code-server before UI selenium tests. It directly reflects the primary objective of the PR, which addresses CI timeout issues by ensuring code-server is ready before tests run.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

⚔️ Resolve merge conflicts

Resolve merge conflict in branch fix/ui-wait-code-server-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR hardens the UI Selenium CI setup by ensuring the code-server endpoint is actually reachable before running browser tests, reducing flakiness and long timeouts in the task ui (selenium) workflow step.

Changes:

Add polling for http://127.0.0.1:8080/ after the container reports healthy to ensure code-server is ready.
Add an upper bound to the container health wait (configurable via UI_CONTAINER_HEALTH_TIMEOUT) and remove unconditional container recreation.
Increase the GitHub Actions timeout for the UI Selenium step from 30 to 60 minutes and set WebDriver page/script timeouts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`test/ui/fixtures/ui_fixtures.py`	Adds bounded health waits and an HTTP readiness probe for code-server before initializing WebDriver.
`.github/workflows/ci.yaml`	Extends the `task ui (selenium)` step timeout to accommodate image pulls and readiness waits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-07T10:10:10Z

+            # Upstream still starting — keep polling. Other responses mean HTTP is up.
+            if exc.code in (502, 503, 504):
+                pass
+            else:
+                log.info(
+                    "code-server ready at %s (HTTP %s, attempt %s)",
+                    url,
+                    exc.code,
+                    attempt,
+                )
+                return


urllib.request.urlopen() raises HTTPError that still holds an open response file (exc.fp). In the retry path (502/503/504) this exception is not closed, so repeated polling can leak sockets/file descriptors in long CI runs. Close the HTTPError (e.g., exc.close() or ensure the response is closed via contextlib.closing).

Suggested change

# Upstream still starting — keep polling. Other responses mean HTTP is up.

if exc.code in (502, 503, 504):

pass

else:

log.info(

"code-server ready at %s (HTTP %s, attempt %s)",

url,

exc.code,

attempt,

)

return

with contextlib.closing(exc):

# Upstream still starting — keep polling. Other responses mean HTTP is up.

if exc.code in (502, 503, 504):

pass

else:

log.info(

"code-server ready at %s (HTTP %s, attempt %s)",

url,

exc.code,

attempt,

)

return

Copilot · 2026-04-07T10:10:11Z

+            health_deadline = time.time() + int(
+                os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"),
+            )


UI_CONTAINER_HEALTH_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the fixture will crash with a ValueError that doesn’t identify the offending env var. Consider parsing with a helper that either falls back to the default or raises a clearer message mentioning UI_CONTAINER_HEALTH_TIMEOUT.

Copilot · 2026-04-07T10:10:11Z

+        _wait_for_code_server_http(
+            timeout_sec=int(os.environ.get("UI_CODE_SERVER_TIMEOUT", "600")),
+        )


UI_CODE_SERVER_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the session fixture will crash with a ValueError before any tests run, and the traceback won’t clearly indicate which env var was invalid. Consider validating/parsing this value with a clearer error message (or falling back to the default).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/ui/fixtures/ui_fixtures.py`:
- Around line 125-143: The subprocess.run call inside is_container_healthy()
must include an explicit timeout to prevent the health-check command from
hanging; update is_container_healthy() to pass a timeout (e.g.,
timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to
subprocess.run and catch subprocess.TimeoutExpired (return False and optionally
log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure
you reference is_container_healthy and handle TimeoutExpired consistently with
other error paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43f9470d-f018-4450-b84b-7a37a96e2006

📥 Commits

Reviewing files that changed from the base of the PR and between ea17e85 and 7a1b325.

📒 Files selected for processing (2)

.github/workflows/ci.yaml
test/ui/fixtures/ui_fixtures.py

coderabbitai · 2026-04-08T12:43:58Z

+            health_deadline = time.time() + int(
+                os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"),
+            )
            count = 0
-            while True:
+            while time.time() < health_deadline:
                if is_container_healthy():
                    break
                count += 1
                time.sleep(1)
-                log.info(
-                    "Waiting for container %s to be healthy: %s", CONTAINER_NAME, count
+                if count <= 5 or count % 15 == 0:
+                    log.info(
+                        "Waiting for container %s to be healthy: %s",
+                        CONTAINER_NAME,
+                        count,
+                    )
+            else:
+                pytest.fail(
+                    f"container {CONTAINER_NAME} did not become healthy in time",
                )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '42,52p;125,147p' test/ui/fixtures/ui_fixtures.py rg -n -C3 'healthcheck|4444|8080' --iglob '*compose*.y*ml' || true

Repository: ansible/vscode-ansible

Length of output: 2414

Add an explicit timeout to the subprocess call in is_container_healthy().

While the docker-compose.yml defines a healthcheck with a 5-second timeout, the is_container_healthy() function calls subprocess.run() without a timeout argument. This means the subprocess itself could hang if the podman healthcheck run command stalls at the system level, potentially causing the entire wait loop to exceed UI_CONTAINER_HEALTH_TIMEOUT. Add an explicit timeout to the subprocess call for defensive programming and to ensure the behavior is fully under code control.

Patch sketch

-def is_container_healthy() -> bool: +def is_container_healthy(timeout_sec: int = 10) -> bool: """Check if the selenium container is healthy.""" - result = subprocess.run( - f"podman healthcheck run {CONTAINER_NAME}", - shell=True, - check=False, - text=True, - capture_output=True, - ) + try: + result = subprocess.run( + f"podman healthcheck run {CONTAINER_NAME}", + shell=True, + check=False, + text=True, + capture_output=True, + timeout=timeout_sec, + ) + except subprocess.TimeoutExpired: + return False return result.returncode == 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/ui/fixtures/ui_fixtures.py` around lines 125 - 143, The subprocess.run call inside is_container_healthy() must include an explicit timeout to prevent the health-check command from hanging; update is_container_healthy() to pass a timeout (e.g., timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to subprocess.run and catch subprocess.TimeoutExpired (return False and optionally log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure you reference is_container_healthy and handle TimeoutExpired consistently with other error paths.

## Summary - **Remove the entire Python/Selenium UI test suite** (`test/ui/`, `docker-compose.yml`, Selenium/Podman Python deps) — eliminates the fragile `selenium-vscode` container, brittle XPath selectors, and DOM mismatches between `code-server` and desktop VS Code. - **Add a WebDriverIO test suite** (`test/wdio/`, `wdio.conf.ts`) using `wdio-vscode-service`, which tests against real VS Code Electron. 30 tests across 7 spec files: smoke, commands, terminal, sidebar, webviews (devfile, devcontainer, welcome page, LLM provider settings), MCP server toggle, and Lightspeed (backed by an Express mock server with injected auth session). - **Dedicated CI runner** — WDIO tests run on their own `test (linux-wdio)` matrix entry, independent of unit/e2e. No more cascading failures from unrelated e2e issues blocking UI tests. - **Update CI** (`ci.yaml`, `Taskfile.yml`) to run `task wdio` under `xvfb-run` instead of the Selenium container, removing the Podman dependency from Linux CI. - **Add onboarding docs** — Cursor skill (`.cursor/skills/wdio-testing/`) with architecture, patterns, and gotchas; updated `test/README.md` and `docs/development/test_code.md`. ## Why All 29 Selenium tests on `main` are currently **skipped** — zero tests actually execute: | File | Tests | Status | Skip Reason | |------|-------|--------|-------------| | `test_00_commands.py` | 2 | All skipped | Flaky on CI - extension activation timing | | `test_01_dev_webviews.py` | 2 | All skipped | Flaky on CI - extension activation timing | | `test_02_welcome_webviews.py` | 3 | All skipped | Flaky on CI - extension activation timing | | `test_03_llm_provider_webview.py` | 4 | All skipped | Flaky on CI - extension activation timing | | `test_50_mcp_server.py` | 4 | All skipped | Flaky on CI - extension activation timing | | `test_80_lightspeed.py` | 5 | All skipped | AAP-67210 | | `test_81_login.py` | 8 | All skipped | AAP-67210 | | `test_82_lightspeed_trial.py` | 1 | All skipped | AAP-67210 | | **Total** | **29** | **0 running** | | The Selenium suite had a **100% CI failure rate** for the last two weeks on `main` (W15-W16) before the tests were skipped, and an accelerating failure rate before that (5% → 16% → 38% → 100% over 8 weeks). Root causes: `code-server` DOM divergence from real VS Code, fragile XPath selectors, deeply nested iframe traversal, and container startup timing. See [full analysis](https://gist.github.com/cidrblock/eaa433824950742554bee78d217e3772). WDIO with `wdio-vscode-service` provides stable Page Objects against the real VS Code Electron shell, `browser.executeWorkbench()` for direct access to the VS Code API, and zero container overhead. ## AAP-67210 Coverage The 14 Lightspeed/SSO tests tracked by [AAP-67210](https://redhat.atlassian.net/browse/AAP-67210) were all skipped in Selenium. WDIO restores coverage for 5 of them using a mock server approach, and adds 2 new tests (error/timeout handling) that never existed: | AAP-67210 Selenium Test | WDIO Status | WDIO Test | |---|---|---| | `test_vscode_widget` | **Covered** | `should have Lightspeed commands registered` | | `test_vscode_playbook_explanation` | **Covered** | `should open playbook explanation webview` | | `test_vscode_playbook_generation` | **Covered** | `should open playbook generation webview` | | `test_vscode_role_generation` | **Covered** | `should open role generation webview` | | `test_vscode_lightspeed_explorer` | **Partial** | Explorer view covered via sidebar tests | | `test_vscode_trial_button` | Deferred | Excluded from initial PR | | `test_unsubed_login` | N/A | SSO flow — tests RH infra, not extension | | `test_unsubed_admin_login` | N/A | SSO flow | | `test_no_wca_user_login` | N/A | SSO flow | | `test_no_wca_admin_login` | N/A | SSO flow | | `test_login_page` | N/A | SSO flow | | `test_sso_auth_flow` | N/A | SSO flow | | `test_admin_portal_error` | N/A | SSO flow | | `test_vscode_rhsso_auth_flow` | N/A | SSO flow | The 8 SSO login tests exercise Red Hat SSO infrastructure (external OAuth pages), not extension code. They require live SSO credentials and a running Lightspeed backend, which is why they were never functional in CI. ## CI architecture ``` Before: test (linux) → task unit → task e2e → task ui (selenium, blocked by e2e) After: test (linux) → task unit → task e2e test (linux-wdio) → task build → task wdio ← independent runner ``` The WDIO job skips `task package` (no `.vsix` needed) and runs only `task build` → `task wdio`. Either job can be re-run independently. ## Test coverage comparison | Area | Selenium (before) | WDIO (after) | |---|---|---| | Smoke / activation | implicit | 5 explicit tests | | Commands | 1 (create playbook) | 1 (create playbook) | | Terminal | 1 (flaky, DOM scraping) | 2 (execSync + API) | | Webviews | 4 (devfile, devcontainer, welcome, LLM) | 9 (deeper assertions) | | MCP Server | 2 | 4 (enable/disable/settings/list) | | Sidebar | 0 | 3 (view, welcome link, settings) | | Lightspeed | 0 (all skipped/SSO-gated) | 6 (mock server: explain, generate, role, commands, errors, timeout) | | **Total** | **0 active (29 skipped)** | **30 (all passing)** | ## Extension source changes - `src/extension.ts`: registers `ansible.lightspeed.mockSession` command (gated to `ExtensionMode.Development`/`Test` only — not available in production) - `src/features/lightspeed/lightSpeedOAuthProvider.ts`: adds `setMockSession()` to inject fake auth sessions ## Test plan - [x] `task wdio` passes locally under `xvfb-run` - [x] CI `test (linux-wdio)` job passes - [x] `task lint` passes - [x] `task unit` / `task e2e` unaffected on `test (linux)` - [x] No Selenium/Podman references remain in CI or Taskfile Closes: ansible#2722 Closes: ansible#2743 Closes: ansible#2720 Closes: ansible#2752 related: AAP-67210

ssbarnea requested review from a team as code owners April 7, 2026 10:06

Copilot AI review requested due to automatic review settings April 7, 2026 10:06

github-project-automation Bot added this to 🧰 devtools project board Apr 7, 2026

ssbarnea temporarily deployed to ack April 7, 2026 10:06 — with GitHub Actions Inactive

github-actions Bot added the fix label Apr 7, 2026

Copilot started reviewing on behalf of ssbarnea April 7, 2026 10:07 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

lalten mentioned this pull request Apr 7, 2026

fix: use persistent container for execution environment commands #2721

Open

Merge branch 'main' into fix/ui-wait-code-server-ci

7a1b325

coderabbitai Bot temporarily deployed to ack April 8, 2026 12:38 Inactive

github-actions Bot added fix and removed fix labels Apr 8, 2026

coderabbitai Bot requested changes Apr 8, 2026

View reviewed changes

github-project-automation Bot moved this to In Progress in 🧰 devtools project board Apr 8, 2026

This was referenced Apr 16, 2026

fix(ci): wait for code-server in healthcheck #2744

Closed

feat: replace Selenium UI tests with WebDriverIO #2746

Merged

cidrblock closed this in #2746 Apr 17, 2026

github-project-automation Bot moved this from In Progress to Done in 🧰 devtools project board Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): wait for code-server before UI selenium tests#2720

fix(ci): wait for code-server before UI selenium tests#2720
ssbarnea wants to merge 2 commits into
ansible:mainfrom
ssbarnea:fix/ui-wait-code-server-ci

ssbarnea commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ssbarnea commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Cause

Changes

Uh oh!

coderabbitai Bot commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ssbarnea commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 7, 2026 •

edited

Loading