Skip to content

ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474

Closed
adobrzyn wants to merge 1 commit into
vllm-project:mainfrom
adobrzyn:feat/harden-runner-audit
Closed

ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474
adobrzyn wants to merge 1 commit into
vllm-project:mainfrom
adobrzyn:feat/harden-runner-audit

Conversation

@adobrzyn
Copy link
Copy Markdown
Collaborator

@adobrzyn adobrzyn commented May 21, 2026

Summary

Adds step-security/harden-runner@v2.19.3 (SHA-pinned) as the first step of every CI job that consumes secrets.HF_TOKEN, configured with egress-policy: block and a curated allow-list of endpoints that the current build + test pipeline actually needs.

Allow-list (derived from code, not collected from runs)

Walked through .github/Dockerfile.ci, the three workflow YAMLs, and tests/full_tests/ci_e2e_discoverable_tests.sh:

Purpose Endpoints
GitHub Actions infra api.github.com, github.com, codeload.github.com, objects.githubusercontent.com, raw.githubusercontent.com, release-assets.githubusercontent.com, *.actions.githubusercontent.com, results-receiver.actions.githubusercontent.com, ghcr.io, pkg-containers.githubusercontent.com, *.blob.core.windows.net
Docker base image (build phase) vault.habana.ai
Python packages (build + test) pypi.org, files.pythonhosted.org, download.pytorch.org
Model weights (test phase) huggingface.co, cdn-lfs.huggingface.co, cdn-lfs.hf.co, cdn-lfs-us-1.hf.co, cas-bridge.xethub.hf.co, xet-lfs-us-1.hf.co

If something legitimate gets blocked, the harden-runner check-run identifies the denied host and we add it in a follow-up PR.

Why it covers the docker containers

Every test container is launched with --network=host, so the eBPF filter installed by harden-runner on the runner host sees and enforces on the container's outbound traffic — no per-container instrumentation needed.

Layered defense

Together with two prior changes, a planted payload in a PR cannot:

  1. Run at all without maintainer approval → Add pre-merge-approval for execute_pre_merge #1471 (merged)
  2. Receive HF_TOKEN without environment approval → ci: route HF_TOKEN-using jobs through approved-workflow environment #1473 (open)
  3. Exfiltrate to an attacker-controlled host → this PR

Affected jobs (15 — same set as #1473)

Workflow Jobs
pre-merge.yaml hpu_unit_tests, hpu_pd_tests, hpu_perf_tests, hpu_dp_tests, e2e, calibration_tests
hourly-ci.yaml run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test
create-release-branch.yaml run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test, run_hpu_perf_tests

Snippet inserted (identical in every job)

      - name: Harden runner (egress block)
        uses: step-security/harden-runner@ab7a9404c0f3da075243ca237b5fac12c98deaa5 # v2.19.3
        with:
          egress-policy: block
          disable-sudo: false
          allowed-endpoints: >
            api.github.com:443
            github.com:443
            codeload.github.com:443
            objects.githubusercontent.com:443
            raw.githubusercontent.com:443
            release-assets.githubusercontent.com:443
            *.actions.githubusercontent.com:443
            results-receiver.actions.githubusercontent.com:443
            ghcr.io:443
            pkg-containers.githubusercontent.com:443
            *.blob.core.windows.net:443
            vault.habana.ai:443
            pypi.org:443
            files.pythonhosted.org:443
            download.pytorch.org:443
            huggingface.co:443
            cdn-lfs.huggingface.co:443
            cdn-lfs.hf.co:443
            cdn-lfs-us-1.hf.co:443
            cas-bridge.xethub.hf.co:443
            xet-lfs-us-1.hf.co:443

Self-hosted runner notes

  • harden-runner installs a small monitoring agent on the runner host. Requires sudo (already available on pr-ci / hourly-ci pools).
  • disable-sudo: false is kept because some CI steps need docker via group/sudo.
  • The --privileged flag on test containers means a sufficiently sophisticated payload could try to tamper with the host firewall from inside the container. This is a residual risk; closing it would require moving the harden-runner step inside the container or dropping --privileged. Out of scope for this PR.

cc reviewers of #1473

Adds 'step-security/harden-runner@v2.19.3' (SHA-pinned) as the first
step of every job that consumes secrets.HF_TOKEN, in audit mode.

Audit mode does not block any traffic; it observes and records every
outbound network connection made by the job (host + containers running
under --network=host, which is what all these jobs use). A per-job
'Network Insights' report is published as a check-run annotation,
making it possible to build an evidence-based egress allow-list before
flipping the policy to 'block' in a follow-up PR.

This is defense-in-depth, layered on top of:
  - pre-merge-trigger approval gate (vllm-project#1471)
  - approved-workflow environment for HF_TOKEN (vllm-project#1473)

Even with both of those in place, a planted payload that activates
inside a trusted job today has unrestricted egress; this PR closes
the detection gap and prepares the data needed to close the
enforcement gap.

Affected jobs (15 - same set as vllm-project#1473):
  pre-merge.yaml:           hpu_unit_tests, hpu_pd_tests, hpu_perf_tests,
                            hpu_dp_tests, e2e, calibration_tests
  hourly-ci.yaml:           run_unit_tests, e2e, run_data_parallel_test,
                            run_pd_disaggregate_test
  create-release-branch:    run_unit_tests, e2e, run_data_parallel_test,
                            run_pd_disaggregate_test, run_hpu_perf_tests

Self-hosted runner notes:
  - harden-runner installs a small monitoring agent on the runner host.
    Requires sudo (already available on pr-ci / hourly-ci pools).
  - With --network=host containers (current setup), the host-level
    eBPF filter sees all container traffic.
  - --privileged containers can theoretically tamper with host filters
    from inside the container; this remains a residual risk for
    audit-only mode but is reduced when we move to 'block' mode in a
    follow-up PR (kernel-level enforcement, harder to bypass).

Follow-up PRs planned:
  1. After ~1 week of audit data: assemble allow-list per job
  2. Flip egress-policy from 'audit' to 'block' with allow-list

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
Copilot AI review requested due to automatic review settings May 21, 2026 11:35
@adobrzyn adobrzyn requested a deployment to pre-merge-approval May 21, 2026 11:35 — with GitHub Actions Waiting
@adobrzyn adobrzyn closed this May 21, 2026
@adobrzyn adobrzyn changed the title ci: add step-security/harden-runner in audit mode to secret-using jobs ci: enforce egress allow-list on jobs that consume HF_TOKEN May 21, 2026
@adobrzyn adobrzyn review requested due to automatic review settings May 21, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant