Use Unix socket for Supervisor to Core communication#6590
Conversation
|
What's this waiting on btw @agners ? It looks like both PRs are in draft and there's a good number of comments on the other. Is this dependent on something else in the epic or is it just a time issue? |
Yeah it's waiting on Core part to complete. I've prioritized some release things and other bug fixes so just a time thing. I'll get back to it next week. |
c6c3917 to
bffc3e8
Compare
bffc3e8 to
9f5deeb
Compare
9f5deeb to
7c47347
Compare
There was a problem hiding this comment.
Pull request overview
Switches Supervisor→Home Assistant Core internal REST/WebSocket traffic from TCP (port 8123) to a Unix domain socket when Core is new enough, keeping TCP as a fallback for older versions and explicitly excluding the Landingpage “version” from version comparisons.
Changes:
- Add version-gated Unix-socket transport support to
HomeAssistantAPI(session + URL selection) and close the Unix session on Supervisor shutdown. - Update Core WebSocket client and proxy WebSocket code to support “trusted transport” connections (no auth handshake) vs TCP token-auth connections.
- Update Docker Core container configuration (mount + env var) and extend tests for mount/env behavior and WebSocket client attribute rename.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
supervisor/homeassistant/api.py |
Adds Unix-socket-aware session/url selection and uses it for API requests (token auth only for TCP). |
supervisor/homeassistant/websocket.py |
Splits WS connection logic into Unix-socket no-auth vs TCP auth; renames _client→client. |
supervisor/api/proxy.py |
Reuses WSClient for proxy WebSocket connection setup, including Unix-socket mode. |
supervisor/docker/const.py |
Introduces ENV_CORE_API_SOCKET and MOUNT_CORE_RUN; extends bind options serialization. |
supervisor/docker/homeassistant.py |
Mounts /run/supervisor into Core and sets socket path env var when supported. |
supervisor/const.py |
Adds SOCKET_CORE path constant used for Unix connector. |
supervisor/core.py |
Ensures the Unix-socket ClientSession is closed during Supervisor shutdown. |
tests/docker/test_homeassistant.py |
Updates expected mounts and adds coverage for the Core socket env var. |
tests/conftest.py |
Adjusts WebSocket client mocking for renamed attribute. |
tests/homeassistant/test_module.py |
Updates tests to use websocket.client instead of protected _client. |
7c47347 to
2e7c612
Compare
Switch internal Supervisor-to-Core HTTP and WebSocket communication from TCP (port 8123) to a Unix domain socket. The existing /run/supervisor directory on the host (already mounted at /run/os inside the Supervisor container) is bind-mounted into the Core container at /run/supervisor. Core receives the socket path via the SUPERVISOR_CORE_API_SOCKET environment variable, creates the socket there, and Supervisor connects to it via aiohttp.UnixConnector at /run/os/core.sock. Since the Unix socket is only reachable by processes on the same host, requests arriving over it are implicitly trusted and authenticated as the existing Supervisor system user. This removes the token round-trip where Supervisor had to obtain and send Bearer tokens on every Core API call. WebSocket connections are likewise authenticated implicitly, skipping the auth_required/auth handshake. Key design decisions: - Version-gated by CORE_UNIX_SOCKET_MIN_VERSION so older Core versions transparently continue using TCP with token auth - LANDINGPAGE is explicitly excluded (not a CalVer version) - Hard-fails with a clear error if the socket file is unexpectedly missing when Unix socket communication is expected - WSClient.connect() for Unix socket (no auth) and WSClient.connect_with_auth() for TCP (token auth) separate the two connection modes cleanly - Token refresh always uses the TCP websession since it is inherently a TCP/Bearer-auth operation - Logs which transport (Unix socket vs TCP) is being used on first request Closes #6626 Related Core PR: home-assistant/core#163907 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure the underlying WebSocket connection is closed before raising when the handshake produces an unexpected message. Also validate that the first TCP message is auth_required before sending credentials. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split use_unix_socket into two properties to handle the Supervisor upgrade transition where Core is still running with a container started by the old Supervisor (without SUPERVISOR_CORE_API_SOCKET): - supports_unix_socket: version check only, used when creating the Core container to decide whether to set the env var - use_unix_socket: version check + running container env check, used for communication decisions This ensures TCP fallback during the upgrade transition while still hard-failing if the socket is missing after Supervisor configured Core to use it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove transport log from make_request that logged before Core container was attached, causing misleading connection logs - Log "Connected to Core via ..." once on first successful API response in get_api_state, when the transport is actually known - Remove explicit socket existence check from session property, let aiohttp UnixConnector produce natural connection errors during Core startup (same as TCP connection refused) - Add validation in get_core_state matching get_config pattern - Restore make_request docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add is_running() check to make_request and connect_websocket so no HTTP or WebSocket connection is attempted when the Core container is not running. This avoids misleading connection attempts during Supervisor startup before Core is ready. Also make use_unix_socket raise if container metadata is not available instead of silently falling back to TCP. This is a defensive check since is_running() guards should prevent reaching this state. Add attached property to DockerInterface to expose whether container metadata has been loaded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Listen for Core container STOPPED/FAILED events to reset the connection state: clear the _core_connected flag so the transport is logged again on next successful connection, and close any stale Unix socket session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0b9c0bd to
b6c9704
Compare
89f833f to
c6d2a58
Compare
8ccd272 to
6d32b01
Compare
The is_running() guard in update_hass_panel is now redundant since make_request checks is_running() internally. Also mock is_running in the websession test fixture since tests using it need make_request to proceed past the container running check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mdegat01
left a comment
There was a problem hiding this comment.
Some small changes, looks great though. Should definitely be a performance boost, have you measured it at all?
| except HomeAssistantAPIError as err: | ||
| _LOGGER.error("Error connecting to Home Assistant WebSocket: %s", err) | ||
| raise APIError() from err | ||
| except (RuntimeError, ValueError, TypeError) as err: |
There was a problem hiding this comment.
| except (RuntimeError, ValueError, TypeError) as err: | |
| except RuntimeError, ValueError, TypeError as err: |
Looks like the AI is still not doing what you want here despite the instructions change. Is this a mypy problem by any chance? When it did it to me I noticed it was driven from a mypy error about it which was really because mypy was out of date in that environment.
There was a problem hiding this comment.
Might be that this code predates my CLAUDE.md file change. That said, I wonder why we need to capture RuntimeError here, that usually indicates a logic error in our code. I wonder if we should remove this from the list 🤔
There was a problem hiding this comment.
Yeah with the centralized authentication logic in websocket.py:connect() we capture all exceptions there. Which is probably also not all that nice.
RuntimeError only really seems to be raised if the connection is closed. From what I can tell, this can't really happen as we open a new connection in that case. RuntimeError has been handled in proxy.py since it's inception, its not clear to me if that ever happened in practice.
Let's remove RuntimeError. In case it happens in practice, we'll discover it through Sentry then. I've also got rid of the broad except Exception in connect(), and convert those into HomeAssistantAPIError. I think with that we have quite a bit better exception handling than before.
|
|
||
| propagation: PropagationMode | None = None | ||
| read_only_non_recursive: bool | None = None | ||
| create_mountpoint: bool | None = None |
There was a problem hiding this comment.
It doesn't seem like this is used anywhere?
There was a problem hiding this comment.
Yeah this came from an earlier iteration, where I used different paths in /run. But we need to have access to that location in Supervisor too, so it must exist at time we use in Supervisor. Hence I don't see any use case for that create mountpoint flag in this PR anymore. I drop it.
| # Unreachable, but satisfies type checker | ||
| raise HomeAssistantAPIError("Failed to connect WebSocket") |
There was a problem hiding this comment.
It does but codecov won't like this, permanently uncoverable code. I can't really think of a better way to do this without unreachable code so maybe we just end with a comment that says there's no return because it can't reach here and tell the type checker to ignore here?
There was a problem hiding this comment.
We have just a 80% requirement now, and I think this would bypass it as well: # pragma: no cover 🤷 . I made this a RuntimeError so it would be noticed.
|
Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍 |
Home Assistant OS (as well as the Supervised run scripts) bind mount /run/supervisor to /run/os in Supervisor. Since we reuse this location for the communication socket between Supervisor and Core, we need to also bind mount /run/supervisor to Supervisor /run/os in CI.
Just did some rough test, this is the time it takes for the first connection (in Measuring 10'000 API calls barley shows a difference: {
"result": "ok",
"data": {
"transport": "unix_socket",
"requests": 10000,
"errors": 0,
"min_ms": 2.2,
"max_ms": 157.8,
"avg_ms": 2.9,
"median_ms": 2.8,
"p95_ms": 3.5
}
}
{
"result": "ok",
"data": {
"transport": "tcp",
"requests": 10000,
"errors": 0,
"min_ms": 2.3,
"max_ms": 124.8,
"avg_ms": 2.8,
"median_ms": 2.7,
"p95_ms": 3.3
}
}From literature, the expected latency difference is single digit microseconds, Python being interpreted is just much to slow for the transport to make a difference. I also tested throughput with a test endpoint returning 1GB of data: "throughput": {
"transport": "tcp",
"size_bytes": 1073741824,
"duration_ms": 1798.4,
"throughput_mb_s": 569.4
}
"throughput": {
"transport": "unix_socket",
"size_bytes": 1073741824,
"duration_ms": 2057.8,
"throughput_mb_s": 497.6
}It actually appears that Unix socket is slightly slower. Not sure if the aiohttp implementation is not quite optimized 🤔 . |
Unexpected exceptions during the WebSocket handshake (KeyError, ValueError, TypeError from malformed messages) are now wrapped in HomeAssistantAPIError inside WSClient.connect/connect_with_auth. This means callers only need to catch HomeAssistantAPIError. Remove the now-unnecessary except (RuntimeError, ValueError, TypeError) from proxy _websocket_client and add a proper error message to the APIError per review feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace broad `except Exception` with specific exception types that can actually occur during the WebSocket handshake: KeyError (missing dict keys), ValueError (bad JSON), TypeError (non-text WS message), aiohttp.ClientError (connection errors), and TimeoutError. This avoids silently wrapping programming errors into HomeAssistantAPIError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The field was added but never used. The /run/supervisor host path is guaranteed to exist since HAOS creates it for the Supervisor container mount, so auto-creating the mountpoint is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move token clear before the attempt check in connect_websocket so the stale token is always discarded, even when raising on the final attempt. Without this, the next call would reuse the cached bad token via _ensure_access_token's fast path, wasting a round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d79df39 to
b289d9f
Compare
Add tests for the new Unix socket communication path and improve existing test coverage: - Version-based supports_unix_socket and env-based use_unix_socket - api_url/ws_url transport selection - Connection lifecycle: connected log after restart, ignoring unrelated container events - get_api_state/check_api_state parameterized across versions, responses, and error cases - make_request is_running guard and TCP flow with real token fetch - connect_websocket for both Unix and TCP (with token verification) - WSClient.connect/connect_with_auth handshake success, errors, cleanup on failure, and close with pending futures Consolidate existing tests into parameterized form and drop synthetic tests that covered very little. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
This looks good. Kind of thought we'd get more performance boost out of it but its a good change either way.
However I should note something. So I installed this on a local dev system to test. First off, I had no issues 👍 But it did take me a sec to figure out how to get it actually used. Just updating supervisor to the branch code isn't enough, you also have to rebuild the core container after the supervisor update. Otherwise the socket isn't mounted in core's container so it won't use it.
I fixed this for myself easily enough with ha core rebuild so I could do my testing. But our users aren't going to do that. Most likely the only time the core container is rebuilt for users is when they update core. Even a host reboot does not cause a rebuild of core container because unlike the addon containers we don't remove core's container automatically when we stop it. We only remove it if there's a recurring issue, an update or the user specifically asks us to with ha core rebuild or a safe mode restart.
This means you may not actually get real-world usage of this new code path for quite a while after we ship it unless core happens to need to ship a patch after. Maybe we want to consider a change to kind of force the issue there? Like something that detects the core container's config does not match expectations and sets a flag to make the next restart a rebuild automatically? This can be a separate PR if you want, its a related problem but not really the same task.
Yes, I am aware, and probably should have noted that in the PR description. It does have the added side effect that the feature will be used slowly more and more, like a slow rollout. Which in case we see issues, we can fix them before all users are affected, or worst case even revert.
So that would mean it will still take a while until the feature is rolled out. In my mind, we do point releases fairly often, so it will get adopted quite quickly. Not sure if it is really worth the additional complexity just for a slightly faster rollout. |
|
I was thinking restarts happen more frequently then updates but yea I guess that's fair. The main issue I see is just if you're hoping to collect info from beta before promoting to stable it may be tough to count on that. But I'm good with it either way, LGTM 👍 |
Proposed change
Switch internal Supervisor→Core HTTP and WebSocket communication from TCP (port 8123) to a Unix domain socket when the installed Core version supports it.
The existing
/run/supervisordirectory on the host (already mounted at/run/osinside the Supervisor container) is bind-mounted into the Core container as/run/supervisor. Core receives theSUPERVISOR_CORE_API_SOCKETenvironment variable with the socket path, creates the socket there, and Supervisor connects to it viaaiohttp.UnixConnectorat/run/os/core.sock.Since the Unix socket is only reachable by processes on the same host, requests arriving over it are implicitly trusted and authenticated as the existing "Supervisor" system user. This removes the current token round-trip where Core creates a refresh token, hands it to Supervisor, and Supervisor sends it back as a Bearer token on every Core API call. WebSocket connections are likewise authenticated implicitly, skipping the
auth_required/authhandshake.This reduces attack surface by removing the need for network-based communication between Supervisor and Core, avoids potential port conflicts or
http.server_hostconfiguration issues, and removes authentication overhead for internal IPC.Key design decisions:
HomeAssistantAPIas public properties (session,api_url,ws_url) used by WebSocket and proxy codeCORE_UNIX_SOCKET_MIN_VERSIONso older Core versions continue using TCP transparentlyWSClient.connect()for Unix socket (no auth) andWSClient.connect_with_auth()for TCP (token auth) cleanly separate the two connection modes_websocket_clientreusesWSClientinstead of duplicating the WebSocket auth handshake/run/os↔/run/supervisorhost mount rather than creating new pathsType of change
Additional information
Checklist
ruff format supervisor tests)If API endpoints or add-on configuration are added/changed: