Skip to content

fix(ci): wait for code-server before UI selenium tests#2720

Closed
ssbarnea wants to merge 2 commits into
ansible:mainfrom
ssbarnea:fix/ui-wait-code-server-ci
Closed

fix(ci): wait for code-server before UI selenium tests#2720
ssbarnea wants to merge 2 commits into
ansible:mainfrom
ssbarnea:fix/ui-wait-code-server-ci

Conversation

@ssbarnea
Copy link
Copy Markdown
Member

@ssbarnea ssbarnea commented Apr 7, 2026

Problem

test (linux) job 70211403232 failed on task ui (selenium): test_terminal, test_create_empty_playbook, and test_devfile_webview raised selenium.common.exceptions.TimeoutException, and the step hit the 30-minute timeout.

Cause

docker-compose.yml healthcheck only curls Selenium on port 4444. The container can report healthy before code-server on :8080 is ready, so tests hit long waits or timeouts.

Also, if not is_container_healthy() or True: always recreated the container, and the health loop had no upper bound.

Changes

  • After podman health passes, poll http://127.0.0.1:8080/ until HTTP responds (retry on 502/503/504).
  • Cap container health wait (default 900s, overridable via UI_CONTAINER_HEALTH_TIMEOUT).
  • Remove the accidental or True; set page/script timeouts on the WebDriver.
  • Raise the GitHub Actions task ui (selenium) step limit from 30 to 60 minutes.

related: #2580

Fixed CI timeouts in UI Selenium tests by adding HTTP polling to verify code-server readiness before running tests. Container health check now waits for code-server (port 8080) to respond, with capped timeouts. Removed a bug preventing proper container reuse and set WebDriver timeouts.

  • Increased GitHub Actions task ui (selenium) step timeout from 30 to 60 minutes
  • Added _wait_for_code_server_http() function to poll code-server HTTP endpoint after container health check
  • Fixed container recreation logic by removing accidental or True condition
  • Capped container health wait time (UI_CONTAINER_HEALTH_TIMEOUT, default 900 seconds)
  • Added configurable code-server startup timeout (UI_CODE_SERVER_TIMEOUT, default 600 seconds)
  • Set WebDriver page load and script timeouts to 300 seconds

related: #2580

Podman health only checks Selenium :4444; code-server on :8080 can still be starting, which led to TimeoutException and 30m job timeouts (e.g. run 24068021216).

- Poll HTTP on code-server after container health (skip 502/503/504)
- Bound podman health wait; remove accidental always-true condition
- Set page/script load timeouts on WebDriver; extend ui step to 60m

related: ansible#2580
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

The changes increase the UI test timeout from 30 to 60 minutes and enhance the test startup sequence with direct HTTP polling to verify code-server readiness before running Selenium tests. Container health checks are now deadline-bounded, and Selenium driver timeouts are explicitly configured.

Changes

Cohort / File(s) Summary
CI Configuration
.github/workflows/ci.yaml
Increased task ui (selenium) timeout from 30 to 60 minutes with clarifying comment about Selenium startup delays relative to code-server readiness.
UI Test Fixtures
test/ui/fixtures/ui_fixtures.py
Added HTTP polling function _wait_for_code_server_http() to verify endpoint availability before tests run. Updated container health check from unbounded loop to deadline-bounded polling. Removed conditional or True branch to enforce actual health validation. Configured Selenium driver page load and script execution timeouts (300s each).

Sequence Diagram(s)

sequenceDiagram
    participant Test as Test Framework
    participant Container as Container System
    participant HTTP as code-server HTTP
    participant Driver as Selenium Driver

    Test->>Container: Check container health
    loop Until healthy or timeout
        Container-->>Test: Health status
    end
    
    Test->>HTTP: Poll HTTP endpoint
    loop Until 200 or timeout
        HTTP-->>Test: HTTP response (202/503/etc)
    end
    HTTP-->>Test: HTTP 200 OK
    
    Test->>Driver: Set page load timeout (300s)
    Test->>Driver: Set script timeout (300s)
    Driver-->>Test: Timeouts configured
    
    Test->>Test: Begin UI tests
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A timeout stretched, from half to whole,
Now HTTP polls to reach its goal,
No phantom starts that waste the time,
Just steady checks—a clean design!
Selenium waits with purpose true,

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: waiting for code-server before UI selenium tests. It directly reflects the primary objective of the PR, which addresses CI timeout issues by ensuring code-server is ready before tests run.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch fix/ui-wait-code-server-ci

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the UI Selenium CI setup by ensuring the code-server endpoint is actually reachable before running browser tests, reducing flakiness and long timeouts in the task ui (selenium) workflow step.

Changes:

  • Add polling for http://127.0.0.1:8080/ after the container reports healthy to ensure code-server is ready.
  • Add an upper bound to the container health wait (configurable via UI_CONTAINER_HEALTH_TIMEOUT) and remove unconditional container recreation.
  • Increase the GitHub Actions timeout for the UI Selenium step from 30 to 60 minutes and set WebDriver page/script timeouts.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
test/ui/fixtures/ui_fixtures.py Adds bounded health waits and an HTTP readiness probe for code-server before initializing WebDriver.
.github/workflows/ci.yaml Extends the task ui (selenium) step timeout to accommodate image pulls and readiness waits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to +83
# Upstream still starting — keep polling. Other responses mean HTTP is up.
if exc.code in (502, 503, 504):
pass
else:
log.info(
"code-server ready at %s (HTTP %s, attempt %s)",
url,
exc.code,
attempt,
)
return
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urllib.request.urlopen() raises HTTPError that still holds an open response file (exc.fp). In the retry path (502/503/504) this exception is not closed, so repeated polling can leak sockets/file descriptors in long CI runs. Close the HTTPError (e.g., exc.close() or ensure the response is closed via contextlib.closing).

Suggested change
# Upstream still starting — keep polling. Other responses mean HTTP is up.
if exc.code in (502, 503, 504):
pass
else:
log.info(
"code-server ready at %s (HTTP %s, attempt %s)",
url,
exc.code,
attempt,
)
return
with contextlib.closing(exc):
# Upstream still starting — keep polling. Other responses mean HTTP is up.
if exc.code in (502, 503, 504):
pass
else:
log.info(
"code-server ready at %s (HTTP %s, attempt %s)",
url,
exc.code,
attempt,
)
return

Copilot uses AI. Check for mistakes.
Comment on lines +125 to +127
health_deadline = time.time() + int(
os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"),
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI_CONTAINER_HEALTH_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the fixture will crash with a ValueError that doesn’t identify the offending env var. Consider parsing with a helper that either falls back to the default or raises a clearer message mentioning UI_CONTAINER_HEALTH_TIMEOUT.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +147
_wait_for_code_server_http(
timeout_sec=int(os.environ.get("UI_CODE_SERVER_TIMEOUT", "600")),
)
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UI_CODE_SERVER_TIMEOUT is parsed with int(...) without validation. If it’s set to a non-integer value, the session fixture will crash with a ValueError before any tests run, and the traceback won’t clearly indicate which env var was invalid. Consider validating/parsing this value with a clearer error message (or falling back to the default).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/ui/fixtures/ui_fixtures.py`:
- Around line 125-143: The subprocess.run call inside is_container_healthy()
must include an explicit timeout to prevent the health-check command from
hanging; update is_container_healthy() to pass a timeout (e.g.,
timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to
subprocess.run and catch subprocess.TimeoutExpired (return False and optionally
log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure
you reference is_container_healthy and handle TimeoutExpired consistently with
other error paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 43f9470d-f018-4450-b84b-7a37a96e2006

📥 Commits

Reviewing files that changed from the base of the PR and between ea17e85 and 7a1b325.

📒 Files selected for processing (2)
  • .github/workflows/ci.yaml
  • test/ui/fixtures/ui_fixtures.py

Comment on lines +125 to 143
health_deadline = time.time() + int(
os.environ.get("UI_CONTAINER_HEALTH_TIMEOUT", "900"),
)
count = 0
while True:
while time.time() < health_deadline:
if is_container_healthy():
break
count += 1
time.sleep(1)
log.info(
"Waiting for container %s to be healthy: %s", CONTAINER_NAME, count
if count <= 5 or count % 15 == 0:
log.info(
"Waiting for container %s to be healthy: %s",
CONTAINER_NAME,
count,
)
else:
pytest.fail(
f"container {CONTAINER_NAME} did not become healthy in time",
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '42,52p;125,147p' test/ui/fixtures/ui_fixtures.py
rg -n -C3 'healthcheck|4444|8080' --iglob '*compose*.y*ml' || true

Repository: ansible/vscode-ansible

Length of output: 2414


Add an explicit timeout to the subprocess call in is_container_healthy().

While the docker-compose.yml defines a healthcheck with a 5-second timeout, the is_container_healthy() function calls subprocess.run() without a timeout argument. This means the subprocess itself could hang if the podman healthcheck run command stalls at the system level, potentially causing the entire wait loop to exceed UI_CONTAINER_HEALTH_TIMEOUT. Add an explicit timeout to the subprocess call for defensive programming and to ensure the behavior is fully under code control.

Patch sketch
-def is_container_healthy() -> bool:
+def is_container_healthy(timeout_sec: int = 10) -> bool:
     """Check if the selenium container is healthy."""
-    result = subprocess.run(
-        f"podman healthcheck run {CONTAINER_NAME}",
-        shell=True,
-        check=False,
-        text=True,
-        capture_output=True,
-    )
+    try:
+        result = subprocess.run(
+            f"podman healthcheck run {CONTAINER_NAME}",
+            shell=True,
+            check=False,
+            text=True,
+            capture_output=True,
+            timeout=timeout_sec,
+        )
+    except subprocess.TimeoutExpired:
+        return False
     return result.returncode == 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/ui/fixtures/ui_fixtures.py` around lines 125 - 143, The subprocess.run
call inside is_container_healthy() must include an explicit timeout to prevent
the health-check command from hanging; update is_container_healthy() to pass a
timeout (e.g.,
timeout=int(os.environ.get("UI_CONTAINER_HEALTHCHECK_TIMEOUT","5"))) to
subprocess.run and catch subprocess.TimeoutExpired (return False and optionally
log) so the outer wait loop still respects UI_CONTAINER_HEALTH_TIMEOUT; ensure
you reference is_container_healthy and handle TimeoutExpired consistently with
other error paths.

@github-project-automation github-project-automation Bot moved this from In Progress to Done in 🧰 devtools project board Apr 17, 2026
rockygeekz pushed a commit to rockygeekz/vscode-ansible that referenced this pull request Apr 17, 2026
## Summary

- **Remove the entire Python/Selenium UI test suite** (`test/ui/`,
`docker-compose.yml`, Selenium/Podman Python deps) — eliminates the
fragile `selenium-vscode` container, brittle XPath selectors, and DOM
mismatches between `code-server` and desktop VS Code.
- **Add a WebDriverIO test suite** (`test/wdio/`, `wdio.conf.ts`) using
`wdio-vscode-service`, which tests against real VS Code Electron. 30
tests across 7 spec files: smoke, commands, terminal, sidebar, webviews
(devfile, devcontainer, welcome page, LLM provider settings), MCP server
toggle, and Lightspeed (backed by an Express mock server with injected
auth session).
- **Dedicated CI runner** — WDIO tests run on their own `test
(linux-wdio)` matrix entry, independent of unit/e2e. No more cascading
failures from unrelated e2e issues blocking UI tests.
- **Update CI** (`ci.yaml`, `Taskfile.yml`) to run `task wdio` under
`xvfb-run` instead of the Selenium container, removing the Podman
dependency from Linux CI.
- **Add onboarding docs** — Cursor skill
(`.cursor/skills/wdio-testing/`) with architecture, patterns, and
gotchas; updated `test/README.md` and `docs/development/test_code.md`.

## Why

All 29 Selenium tests on `main` are currently **skipped** — zero tests
actually execute:

| File | Tests | Status | Skip Reason |
|------|-------|--------|-------------|
| `test_00_commands.py` | 2 | All skipped | Flaky on CI - extension
activation timing |
| `test_01_dev_webviews.py` | 2 | All skipped | Flaky on CI - extension
activation timing |
| `test_02_welcome_webviews.py` | 3 | All skipped | Flaky on CI -
extension activation timing |
| `test_03_llm_provider_webview.py` | 4 | All skipped | Flaky on CI -
extension activation timing |
| `test_50_mcp_server.py` | 4 | All skipped | Flaky on CI - extension
activation timing |
| `test_80_lightspeed.py` | 5 | All skipped | AAP-67210 |
| `test_81_login.py` | 8 | All skipped | AAP-67210 |
| `test_82_lightspeed_trial.py` | 1 | All skipped | AAP-67210 |
| **Total** | **29** | **0 running** | |

The Selenium suite had a **100% CI failure rate** for the last two weeks
on `main` (W15-W16) before the tests were skipped, and an accelerating
failure rate before that (5% → 16% → 38% → 100% over 8 weeks). Root
causes: `code-server` DOM divergence from real VS Code, fragile XPath
selectors, deeply nested iframe traversal, and container startup timing.
See [full
analysis](https://gist.github.com/cidrblock/eaa433824950742554bee78d217e3772).

WDIO with `wdio-vscode-service` provides stable Page Objects against the
real VS Code Electron shell, `browser.executeWorkbench()` for direct
access to the VS Code API, and zero container overhead.

## AAP-67210 Coverage

The 14 Lightspeed/SSO tests tracked by
[AAP-67210](https://redhat.atlassian.net/browse/AAP-67210) were all
skipped in Selenium. WDIO restores coverage for 5 of them using a mock
server approach, and adds 2 new tests (error/timeout handling) that
never existed:

| AAP-67210 Selenium Test | WDIO Status | WDIO Test |
|---|---|---|
| `test_vscode_widget` | **Covered** | `should have Lightspeed commands
registered` |
| `test_vscode_playbook_explanation` | **Covered** | `should open
playbook explanation webview` |
| `test_vscode_playbook_generation` | **Covered** | `should open
playbook generation webview` |
| `test_vscode_role_generation` | **Covered** | `should open role
generation webview` |
| `test_vscode_lightspeed_explorer` | **Partial** | Explorer view
covered via sidebar tests |
| `test_vscode_trial_button` | Deferred | Excluded from initial PR |
| `test_unsubed_login` | N/A | SSO flow — tests RH infra, not extension
|
| `test_unsubed_admin_login` | N/A | SSO flow |
| `test_no_wca_user_login` | N/A | SSO flow |
| `test_no_wca_admin_login` | N/A | SSO flow |
| `test_login_page` | N/A | SSO flow |
| `test_sso_auth_flow` | N/A | SSO flow |
| `test_admin_portal_error` | N/A | SSO flow |
| `test_vscode_rhsso_auth_flow` | N/A | SSO flow |

The 8 SSO login tests exercise Red Hat SSO infrastructure (external
OAuth pages), not extension code. They require live SSO credentials and
a running Lightspeed backend, which is why they were never functional in
CI.

## CI architecture

```
Before:
  test (linux) → task unit → task e2e → task ui (selenium, blocked by e2e)

After:
  test (linux)      → task unit → task e2e
  test (linux-wdio) → task build → task wdio    ← independent runner
```

The WDIO job skips `task package` (no `.vsix` needed) and runs only
`task build` → `task wdio`. Either job can be re-run independently.

## Test coverage comparison

| Area | Selenium (before) | WDIO (after) |
|---|---|---|
| Smoke / activation | implicit | 5 explicit tests |
| Commands | 1 (create playbook) | 1 (create playbook) |
| Terminal | 1 (flaky, DOM scraping) | 2 (execSync + API) |
| Webviews | 4 (devfile, devcontainer, welcome, LLM) | 9 (deeper
assertions) |
| MCP Server | 2 | 4 (enable/disable/settings/list) |
| Sidebar | 0 | 3 (view, welcome link, settings) |
| Lightspeed | 0 (all skipped/SSO-gated) | 6 (mock server: explain,
generate, role, commands, errors, timeout) |
| **Total** | **0 active (29 skipped)** | **30 (all passing)** |

## Extension source changes

- `src/extension.ts`: registers `ansible.lightspeed.mockSession` command
(gated to `ExtensionMode.Development`/`Test` only — not available in
production)
- `src/features/lightspeed/lightSpeedOAuthProvider.ts`: adds
`setMockSession()` to inject fake auth sessions

## Test plan

- [x] `task wdio` passes locally under `xvfb-run`
- [x] CI `test (linux-wdio)` job passes
- [x] `task lint` passes
- [x] `task unit` / `task e2e` unaffected on `test (linux)`
- [x] No Selenium/Podman references remain in CI or Taskfile

Closes: ansible#2722
Closes: ansible#2743
Closes: ansible#2720
Closes: ansible#2752
related: AAP-67210
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants