ci: enforce egress allow-list on jobs that consume HF_TOKEN by adobrzyn · Pull Request #1474 · vllm-project/vllm-gaudi

adobrzyn · 2026-05-21T11:35:48Z

Summary

Adds step-security/harden-runner@v2.19.3 (SHA-pinned) as the first step of every CI job that consumes secrets.HF_TOKEN, configured with egress-policy: block and a curated allow-list of endpoints that the current build + test pipeline actually needs.

Allow-list (derived from code, not collected from runs)

Walked through .github/Dockerfile.ci, the three workflow YAMLs, and tests/full_tests/ci_e2e_discoverable_tests.sh:

Purpose	Endpoints
GitHub Actions infra	`api.github.com`, `github.com`, `codeload.github.com`, `objects.githubusercontent.com`, `raw.githubusercontent.com`, `release-assets.githubusercontent.com`, `.actions.githubusercontent.com`, `results-receiver.actions.githubusercontent.com`, `ghcr.io`, `pkg-containers.githubusercontent.com`, `.blob.core.windows.net`
Docker base image (build phase)	`vault.habana.ai`
Python packages (build + test)	`pypi.org`, `files.pythonhosted.org`, `download.pytorch.org`
Model weights (test phase)	`huggingface.co`, `cdn-lfs.huggingface.co`, `cdn-lfs.hf.co`, `cdn-lfs-us-1.hf.co`, `cas-bridge.xethub.hf.co`, `xet-lfs-us-1.hf.co`

If something legitimate gets blocked, the harden-runner check-run identifies the denied host and we add it in a follow-up PR.

Why it covers the docker containers

Every test container is launched with --network=host, so the eBPF filter installed by harden-runner on the runner host sees and enforces on the container's outbound traffic — no per-container instrumentation needed.

Layered defense

Together with two prior changes, a planted payload in a PR cannot:

Run at all without maintainer approval → Add pre-merge-approval for execute_pre_merge #1471 (merged)
Receive HF_TOKEN without environment approval → ci: route HF_TOKEN-using jobs through approved-workflow environment #1473 (open)
Exfiltrate to an attacker-controlled host → this PR

Affected jobs (15 — same set as #1473)

Workflow	Jobs
`pre-merge.yaml`	`hpu_unit_tests`, `hpu_pd_tests`, `hpu_perf_tests`, `hpu_dp_tests`, `e2e`, `calibration_tests`
`hourly-ci.yaml`	`run_unit_tests`, `e2e`, `run_data_parallel_test`, `run_pd_disaggregate_test`
`create-release-branch.yaml`	`run_unit_tests`, `e2e`, `run_data_parallel_test`, `run_pd_disaggregate_test`, `run_hpu_perf_tests`

Snippet inserted (identical in every job)

      - name: Harden runner (egress block)
        uses: step-security/harden-runner@ab7a9404c0f3da075243ca237b5fac12c98deaa5 # v2.19.3
        with:
          egress-policy: block
          disable-sudo: false
          allowed-endpoints: >
            api.github.com:443
            github.com:443
            codeload.github.com:443
            objects.githubusercontent.com:443
            raw.githubusercontent.com:443
            release-assets.githubusercontent.com:443
            *.actions.githubusercontent.com:443
            results-receiver.actions.githubusercontent.com:443
            ghcr.io:443
            pkg-containers.githubusercontent.com:443
            *.blob.core.windows.net:443
            vault.habana.ai:443
            pypi.org:443
            files.pythonhosted.org:443
            download.pytorch.org:443
            huggingface.co:443
            cdn-lfs.huggingface.co:443
            cdn-lfs.hf.co:443
            cdn-lfs-us-1.hf.co:443
            cas-bridge.xethub.hf.co:443
            xet-lfs-us-1.hf.co:443

Self-hosted runner notes

harden-runner installs a small monitoring agent on the runner host. Requires sudo (already available on pr-ci / hourly-ci pools).
disable-sudo: false is kept because some CI steps need docker via group/sudo.
The --privileged flag on test containers means a sufficiently sophisticated payload could try to tamper with the host firewall from inside the container. This is a residual risk; closing it would require moving the harden-runner step inside the container or dropping --privileged. Out of scope for this PR.

cc reviewers of #1473

Adds 'step-security/harden-runner@v2.19.3' (SHA-pinned) as the first step of every job that consumes secrets.HF_TOKEN, in audit mode. Audit mode does not block any traffic; it observes and records every outbound network connection made by the job (host + containers running under --network=host, which is what all these jobs use). A per-job 'Network Insights' report is published as a check-run annotation, making it possible to build an evidence-based egress allow-list before flipping the policy to 'block' in a follow-up PR. This is defense-in-depth, layered on top of: - pre-merge-trigger approval gate (vllm-project#1471) - approved-workflow environment for HF_TOKEN (vllm-project#1473) Even with both of those in place, a planted payload that activates inside a trusted job today has unrestricted egress; this PR closes the detection gap and prepares the data needed to close the enforcement gap. Affected jobs (15 - same set as vllm-project#1473): pre-merge.yaml: hpu_unit_tests, hpu_pd_tests, hpu_perf_tests, hpu_dp_tests, e2e, calibration_tests hourly-ci.yaml: run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test create-release-branch: run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test, run_hpu_perf_tests Self-hosted runner notes: - harden-runner installs a small monitoring agent on the runner host. Requires sudo (already available on pr-ci / hourly-ci pools). - With --network=host containers (current setup), the host-level eBPF filter sees all container traffic. - --privileged containers can theoretically tamper with host filters from inside the container; this remains a residual risk for audit-only mode but is reduced when we move to 'block' mode in a follow-up PR (kernel-level enforcement, harder to bypass). Follow-up PRs planned: 1. After ~1 week of audit data: assemble allow-list per job 2. Flip egress-policy from 'audit' to 'block' with allow-list Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>

Copilot AI review requested due to automatic review settings May 21, 2026 11:35

adobrzyn requested review from PatrykWo, afierka-intel, iboiko-habana, jbyczkow, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners May 21, 2026 11:35

adobrzyn requested a deployment to pre-merge-approval May 21, 2026 11:35 — with GitHub Actions Waiting

Copilot started reviewing on behalf of adobrzyn May 21, 2026 11:36 View session

adobrzyn closed this May 21, 2026

adobrzyn changed the title ~~ci: add step-security/harden-runner in audit mode to secret-using jobs~~ ci: enforce egress allow-list on jobs that consume HF_TOKEN May 21, 2026

adobrzyn review requested due to automatic review settings May 21, 2026 11:57

adobrzyn mentioned this pull request May 21, 2026

ci: enforce egress allow-list on jobs that consume HF_TOKEN #1475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474

ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474
adobrzyn wants to merge 1 commit into
vllm-project:mainfrom
adobrzyn:feat/harden-runner-audit

adobrzyn commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adobrzyn commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Allow-list (derived from code, not collected from runs)

Why it covers the docker containers

Layered defense

Affected jobs (15 — same set as #1473)

Snippet inserted (identical in every job)

Self-hosted runner notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

adobrzyn commented May 21, 2026 •

edited

Loading