Skip to content

Use Unix socket for Supervisor to Core communication#6590

Merged
agners merged 16 commits into
mainfrom
use-unix-socket-for-core-api
Apr 10, 2026
Merged

Use Unix socket for Supervisor to Core communication#6590
agners merged 16 commits into
mainfrom
use-unix-socket-for-core-api

Conversation

@agners
Copy link
Copy Markdown
Member

@agners agners commented Feb 23, 2026

Proposed change

Switch internal Supervisor→Core HTTP and WebSocket communication from TCP (port 8123) to a Unix domain socket when the installed Core version supports it.

The existing /run/supervisor directory on the host (already mounted at /run/os inside the Supervisor container) is bind-mounted into the Core container as /run/supervisor. Core receives the SUPERVISOR_CORE_API_SOCKET environment variable with the socket path, creates the socket there, and Supervisor connects to it via aiohttp.UnixConnector at /run/os/core.sock.

Since the Unix socket is only reachable by processes on the same host, requests arriving over it are implicitly trusted and authenticated as the existing "Supervisor" system user. This removes the current token round-trip where Core creates a refresh token, hands it to Supervisor, and Supervisor sends it back as a Bearer token on every Core API call. WebSocket connections are likewise authenticated implicitly, skipping the auth_required/auth handshake.

This reduces attack surface by removing the need for network-based communication between Supervisor and Core, avoids potential port conflicts or http.server_host configuration issues, and removes authentication overhead for internal IPC.

Key design decisions:

  • Session, URL, and transport management lives in HomeAssistantAPI as public properties (session, api_url, ws_url) used by WebSocket and proxy code
  • Version-gated with CORE_UNIX_SOCKET_MIN_VERSION so older Core versions continue using TCP transparently
  • LANDINGPAGE version is explicitly excluded from the version comparison (it's not CalVer)
  • Hard-fails with a clear error if the Unix socket file is unexpectedly missing when expected
  • WSClient.connect() for Unix socket (no auth) and WSClient.connect_with_auth() for TCP (token auth) cleanly separate the two connection modes
  • Token refresh always uses the TCP websession since it is inherently a TCP/Bearer-auth operation
  • Proxy _websocket_client reuses WSClient instead of duplicating the WebSocket auth handshake
  • Logs which transport (Unix socket vs TCP) is being used on first request
  • Reuses the existing /run/os/run/supervisor host mount rather than creating new paths

Type of change

  • Dependency upgrade
  • Bugfix (non-breaking change which fixes an issue)
  • New feature (which adds functionality to the supervisor)
  • Breaking change (fix/feature causing existing functionality to break)
  • Code quality improvements to existing code or addition of tests

Additional information

Checklist

  • The code change is tested and works locally.
  • Local tests pass. Your PR cannot be merged unless tests pass
  • There is no commented out code in this PR.
  • I have followed the development checklist
  • The code has been formatted using Ruff (ruff format supervisor tests)
  • Tests have been added to verify that the new code works.

If API endpoints or add-on configuration are added/changed:

@agners agners added the new-feature A new feature label Feb 23, 2026
@agners agners marked this pull request as draft February 23, 2026 23:22
@mdegat01
Copy link
Copy Markdown
Contributor

mdegat01 commented Mar 4, 2026

What's this waiting on btw @agners ? It looks like both PRs are in draft and there's a good number of comments on the other. Is this dependent on something else in the epic or is it just a time issue?

@agners
Copy link
Copy Markdown
Member Author

agners commented Mar 5, 2026

What's this waiting on btw @agners ? It looks like both PRs are in draft and there's a good number of comments on the other. Is this dependent on something else in the epic or is it just a time issue?

Yeah it's waiting on Core part to complete. I've prioritized some release things and other bug fixes so just a time thing. I'll get back to it next week.

@agners agners force-pushed the use-unix-socket-for-core-api branch from c6c3917 to bffc3e8 Compare March 11, 2026 18:07
@agners agners force-pushed the use-unix-socket-for-core-api branch from bffc3e8 to 9f5deeb Compare March 25, 2026 15:07
@agners agners force-pushed the use-unix-socket-for-core-api branch from 9f5deeb to 7c47347 Compare April 1, 2026 17:08
@agners agners requested a review from Copilot April 1, 2026 17:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Switches Supervisor→Home Assistant Core internal REST/WebSocket traffic from TCP (port 8123) to a Unix domain socket when Core is new enough, keeping TCP as a fallback for older versions and explicitly excluding the Landingpage “version” from version comparisons.

Changes:

  • Add version-gated Unix-socket transport support to HomeAssistantAPI (session + URL selection) and close the Unix session on Supervisor shutdown.
  • Update Core WebSocket client and proxy WebSocket code to support “trusted transport” connections (no auth handshake) vs TCP token-auth connections.
  • Update Docker Core container configuration (mount + env var) and extend tests for mount/env behavior and WebSocket client attribute rename.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
supervisor/homeassistant/api.py Adds Unix-socket-aware session/url selection and uses it for API requests (token auth only for TCP).
supervisor/homeassistant/websocket.py Splits WS connection logic into Unix-socket no-auth vs TCP auth; renames _clientclient.
supervisor/api/proxy.py Reuses WSClient for proxy WebSocket connection setup, including Unix-socket mode.
supervisor/docker/const.py Introduces ENV_CORE_API_SOCKET and MOUNT_CORE_RUN; extends bind options serialization.
supervisor/docker/homeassistant.py Mounts /run/supervisor into Core and sets socket path env var when supported.
supervisor/const.py Adds SOCKET_CORE path constant used for Unix connector.
supervisor/core.py Ensures the Unix-socket ClientSession is closed during Supervisor shutdown.
tests/docker/test_homeassistant.py Updates expected mounts and adds coverage for the Core socket env var.
tests/conftest.py Adjusts WebSocket client mocking for renamed attribute.
tests/homeassistant/test_module.py Updates tests to use websocket.client instead of protected _client.

Comment thread supervisor/homeassistant/websocket.py Outdated
Comment thread supervisor/homeassistant/websocket.py Outdated
Comment thread supervisor/api/proxy.py Outdated
Comment thread supervisor/docker/homeassistant.py Outdated
@agners agners force-pushed the use-unix-socket-for-core-api branch from 7c47347 to 2e7c612 Compare April 1, 2026 17:26
@agners agners requested review from mdegat01 and sairon April 1, 2026 17:48
@agners agners marked this pull request as ready for review April 1, 2026 17:48
agners and others added 7 commits April 8, 2026 16:50
Switch internal Supervisor-to-Core HTTP and WebSocket communication
from TCP (port 8123) to a Unix domain socket.

The existing /run/supervisor directory on the host (already mounted
at /run/os inside the Supervisor container) is bind-mounted into the
Core container at /run/supervisor. Core receives the socket path via
the SUPERVISOR_CORE_API_SOCKET environment variable, creates the
socket there, and Supervisor connects to it via aiohttp.UnixConnector
at /run/os/core.sock.

Since the Unix socket is only reachable by processes on the same host,
requests arriving over it are implicitly trusted and authenticated as
the existing Supervisor system user. This removes the token round-trip
where Supervisor had to obtain and send Bearer tokens on every Core
API call. WebSocket connections are likewise authenticated implicitly,
skipping the auth_required/auth handshake.

Key design decisions:
- Version-gated by CORE_UNIX_SOCKET_MIN_VERSION so older Core
  versions transparently continue using TCP with token auth
- LANDINGPAGE is explicitly excluded (not a CalVer version)
- Hard-fails with a clear error if the socket file is unexpectedly
  missing when Unix socket communication is expected
- WSClient.connect() for Unix socket (no auth) and
  WSClient.connect_with_auth() for TCP (token auth) separate the
  two connection modes cleanly
- Token refresh always uses the TCP websession since it is inherently
  a TCP/Bearer-auth operation
- Logs which transport (Unix socket vs TCP) is being used on first
  request

Closes #6626
Related Core PR: home-assistant/core#163907

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ensure the underlying WebSocket connection is closed before raising
when the handshake produces an unexpected message. Also validate that
the first TCP message is auth_required before sending credentials.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split use_unix_socket into two properties to handle the Supervisor
upgrade transition where Core is still running with a container
started by the old Supervisor (without SUPERVISOR_CORE_API_SOCKET):

- supports_unix_socket: version check only, used when creating the
  Core container to decide whether to set the env var
- use_unix_socket: version check + running container env check, used
  for communication decisions

This ensures TCP fallback during the upgrade transition while still
hard-failing if the socket is missing after Supervisor configured
Core to use it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove transport log from make_request that logged before Core
  container was attached, causing misleading connection logs
- Log "Connected to Core via ..." once on first successful API response
  in get_api_state, when the transport is actually known
- Remove explicit socket existence check from session property, let
  aiohttp UnixConnector produce natural connection errors during
  Core startup (same as TCP connection refused)
- Add validation in get_core_state matching get_config pattern
- Restore make_request docstring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add is_running() check to make_request and connect_websocket so no
HTTP or WebSocket connection is attempted when the Core container is
not running. This avoids misleading connection attempts during
Supervisor startup before Core is ready.

Also make use_unix_socket raise if container metadata is not available
instead of silently falling back to TCP. This is a defensive check
since is_running() guards should prevent reaching this state.

Add attached property to DockerInterface to expose whether container
metadata has been loaded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Listen for Core container STOPPED/FAILED events to reset the
connection state: clear the _core_connected flag so the transport
is logged again on next successful connection, and close any stale
Unix socket session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agners agners force-pushed the use-unix-socket-for-core-api branch from 0b9c0bd to b6c9704 Compare April 8, 2026 16:51
@agners agners force-pushed the use-unix-socket-for-core-api branch from 89f833f to c6d2a58 Compare April 8, 2026 16:59
@agners agners force-pushed the use-unix-socket-for-core-api branch from 8ccd272 to 6d32b01 Compare April 8, 2026 18:13
The is_running() guard in update_hass_panel is now redundant since
make_request checks is_running() internally. Also mock is_running
in the websession test fixture since tests using it need make_request
to proceed past the container running check.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes, looks great though. Should definitely be a performance boost, have you measured it at all?

Comment thread supervisor/api/proxy.py Outdated
Comment thread supervisor/api/proxy.py Outdated
Comment thread supervisor/api/proxy.py Outdated
except HomeAssistantAPIError as err:
_LOGGER.error("Error connecting to Home Assistant WebSocket: %s", err)
raise APIError() from err
except (RuntimeError, ValueError, TypeError) as err:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except (RuntimeError, ValueError, TypeError) as err:
except RuntimeError, ValueError, TypeError as err:

Looks like the AI is still not doing what you want here despite the instructions change. Is this a mypy problem by any chance? When it did it to me I noticed it was driven from a mypy error about it which was really because mypy was out of date in that environment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be that this code predates my CLAUDE.md file change. That said, I wonder why we need to capture RuntimeError here, that usually indicates a logic error in our code. I wonder if we should remove this from the list 🤔

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah with the centralized authentication logic in websocket.py:connect() we capture all exceptions there. Which is probably also not all that nice.

RuntimeError only really seems to be raised if the connection is closed. From what I can tell, this can't really happen as we open a new connection in that case. RuntimeError has been handled in proxy.py since it's inception, its not clear to me if that ever happened in practice.

Let's remove RuntimeError. In case it happens in practice, we'll discover it through Sentry then. I've also got rid of the broad except Exception in connect(), and convert those into HomeAssistantAPIError. I think with that we have quite a bit better exception handling than before.

Comment thread supervisor/docker/const.py Outdated

propagation: PropagationMode | None = None
read_only_non_recursive: bool | None = None
create_mountpoint: bool | None = None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem like this is used anywhere?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this came from an earlier iteration, where I used different paths in /run. But we need to have access to that location in Supervisor too, so it must exist at time we use in Supervisor. Hence I don't see any use case for that create mountpoint flag in this PR anymore. I drop it.

Comment thread supervisor/homeassistant/api.py Outdated
Comment thread supervisor/homeassistant/api.py Outdated
Comment on lines +227 to +228
# Unreachable, but satisfies type checker
raise HomeAssistantAPIError("Failed to connect WebSocket")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does but codecov won't like this, permanently uncoverable code. I can't really think of a better way to do this without unreachable code so maybe we just end with a comment that says there's no return because it can't reach here and tell the type checker to ignore here?

Copy link
Copy Markdown
Member Author

@agners agners Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have just a 80% requirement now, and I think this would bypass it as well: # pragma: no cover 🤷 . I made this a RuntimeError so it would be noticed.

@home-assistant home-assistant Bot marked this pull request as draft April 8, 2026 20:34
@home-assistant
Copy link
Copy Markdown

home-assistant Bot commented Apr 8, 2026

Please take a look at the requested changes, and use the Ready for review button when you are done, thanks 👍

Learn more about our pull request process.

Home Assistant OS (as well as the Supervised run scripts) bind mount
/run/supervisor to /run/os in Supervisor. Since we reuse this location
for the communication socket between Supervisor and Core, we need to
also bind mount /run/supervisor to Supervisor /run/os in CI.
@agners
Copy link
Copy Markdown
Member Author

agners commented Apr 8, 2026

Some small changes, looks great though. Should definitely be a performance boost, have you measured it at all?

Just did some rough test, this is the time it takes for the first connection (in get_api_state). The Unix socket doesn't need to do the auth dance, I guess that is mostly where the difference is coming from:

2026-04-08 23:07:17.317 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via Unix socket /run/os/core.sock (52ms)
2026-04-08 23:07:36.237 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via Unix socket /run/os/core.sock (31ms)
2026-04-08 23:07:48.488 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via Unix socket /run/os/core.sock (24ms)
2026-04-08 23:08:01.720 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via Unix socket /run/os/core.sock (45ms)
2026-04-08 23:08:56.262 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via TCP http://172.30.32.1:80 (87ms)
2026-04-08 23:09:45.408 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via TCP http://172.30.32.1:80 (44ms)
2026-04-08 23:10:04.411 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via TCP http://172.30.32.1:80 (85ms)
2026-04-08 23:10:18.631 INFO (MainThread) [supervisor.homeassistant.api] Connected to Core via TCP http://172.30.32.1:80 (53ms)

Measuring 10'000 API calls barley shows a difference:

{
  "result": "ok",
  "data": {
    "transport": "unix_socket",
    "requests": 10000,
    "errors": 0,
    "min_ms": 2.2,
    "max_ms": 157.8,
    "avg_ms": 2.9,
    "median_ms": 2.8,
    "p95_ms": 3.5
  }
}
{
  "result": "ok",
  "data": {
    "transport": "tcp",
    "requests": 10000,
    "errors": 0,
    "min_ms": 2.3,
    "max_ms": 124.8,
    "avg_ms": 2.8,
    "median_ms": 2.7,
    "p95_ms": 3.3
  }
}

From literature, the expected latency difference is single digit microseconds, Python being interpreted is just much to slow for the transport to make a difference.

I also tested throughput with a test endpoint returning 1GB of data:

    "throughput": {
      "transport": "tcp",
      "size_bytes": 1073741824,
      "duration_ms": 1798.4,
      "throughput_mb_s": 569.4
    }
    "throughput": {
      "transport": "unix_socket",
      "size_bytes": 1073741824,
      "duration_ms": 2057.8,
      "throughput_mb_s": 497.6
    }

It actually appears that Unix socket is slightly slower. Not sure if the aiohttp implementation is not quite optimized 🤔 .

agners and others added 4 commits April 9, 2026 13:39
Unexpected exceptions during the WebSocket handshake (KeyError,
ValueError, TypeError from malformed messages) are now wrapped in
HomeAssistantAPIError inside WSClient.connect/connect_with_auth.
This means callers only need to catch HomeAssistantAPIError.

Remove the now-unnecessary except (RuntimeError, ValueError,
TypeError) from proxy _websocket_client and add a proper error
message to the APIError per review feedback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace broad `except Exception` with specific exception types that
can actually occur during the WebSocket handshake: KeyError (missing
dict keys), ValueError (bad JSON), TypeError (non-text WS message),
aiohttp.ClientError (connection errors), and TimeoutError. This
avoids silently wrapping programming errors into HomeAssistantAPIError.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The field was added but never used. The /run/supervisor host path
is guaranteed to exist since HAOS creates it for the Supervisor
container mount, so auto-creating the mountpoint is unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move token clear before the attempt check in connect_websocket so
the stale token is always discarded, even when raising on the final
attempt. Without this, the next call would reuse the cached bad token
via _ensure_access_token's fast path, wasting a round-trip.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agners agners force-pushed the use-unix-socket-for-core-api branch from d79df39 to b289d9f Compare April 9, 2026 12:47
Add tests for the new Unix socket communication path and improve
existing test coverage:

- Version-based supports_unix_socket and env-based use_unix_socket
- api_url/ws_url transport selection
- Connection lifecycle: connected log after restart, ignoring
  unrelated container events
- get_api_state/check_api_state parameterized across versions,
  responses, and error cases
- make_request is_running guard and TCP flow with real token fetch
- connect_websocket for both Unix and TCP (with token verification)
- WSClient.connect/connect_with_auth handshake success, errors,
  cleanup on failure, and close with pending futures

Consolidate existing tests into parameterized form and drop synthetic
tests that covered very little.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@agners agners marked this pull request as ready for review April 9, 2026 13:47
@home-assistant home-assistant Bot requested a review from mdegat01 April 9, 2026 13:47
Copy link
Copy Markdown
Contributor

@mdegat01 mdegat01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Kind of thought we'd get more performance boost out of it but its a good change either way.

However I should note something. So I installed this on a local dev system to test. First off, I had no issues 👍 But it did take me a sec to figure out how to get it actually used. Just updating supervisor to the branch code isn't enough, you also have to rebuild the core container after the supervisor update. Otherwise the socket isn't mounted in core's container so it won't use it.

I fixed this for myself easily enough with ha core rebuild so I could do my testing. But our users aren't going to do that. Most likely the only time the core container is rebuilt for users is when they update core. Even a host reboot does not cause a rebuild of core container because unlike the addon containers we don't remove core's container automatically when we stop it. We only remove it if there's a recurring issue, an update or the user specifically asks us to with ha core rebuild or a safe mode restart.

This means you may not actually get real-world usage of this new code path for quite a while after we ship it unless core happens to need to ship a patch after. Maybe we want to consider a change to kind of force the issue there? Like something that detects the core container's config does not match expectations and sets a flag to make the next restart a rebuild automatically? This can be a separate PR if you want, its a related problem but not really the same task.

@agners
Copy link
Copy Markdown
Member Author

agners commented Apr 10, 2026

Just updating supervisor to the branch code isn't enough, you also have to rebuild the core container after the supervisor update. Otherwise the socket isn't mounted in core's container so it won't use it.

Yes, I am aware, and probably should have noted that in the PR description.

It does have the added side effect that the feature will be used slowly more and more, like a slow rollout. Which in case we see issues, we can fix them before all users are affected, or worst case even revert.

. Maybe we want to consider a change to kind of force the issue there? Like something that detects the core container's config does not match expectations and sets a flag to make the next restart a rebuild automatically?

So that would mean it will still take a while until the feature is rolled out. In my mind, we do point releases fairly often, so it will get adopted quite quickly. Not sure if it is really worth the additional complexity just for a slightly faster rollout.

@mdegat01
Copy link
Copy Markdown
Contributor

I was thinking restarts happen more frequently then updates but yea I guess that's fair. The main issue I see is just if you're hoping to collect info from beta before promoting to stable it may be tough to count on that. But I'm good with it either way, LGTM 👍

@agners agners merged commit 28fa0b3 into main Apr 10, 2026
21 checks passed
@agners agners deleted the use-unix-socket-for-core-api branch April 10, 2026 13:09
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants