ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474
Closed
adobrzyn wants to merge 1 commit into
Closed
ci: enforce egress allow-list on jobs that consume HF_TOKEN#1474adobrzyn wants to merge 1 commit into
adobrzyn wants to merge 1 commit into
Conversation
Adds 'step-security/harden-runner@v2.19.3' (SHA-pinned) as the first step of every job that consumes secrets.HF_TOKEN, in audit mode. Audit mode does not block any traffic; it observes and records every outbound network connection made by the job (host + containers running under --network=host, which is what all these jobs use). A per-job 'Network Insights' report is published as a check-run annotation, making it possible to build an evidence-based egress allow-list before flipping the policy to 'block' in a follow-up PR. This is defense-in-depth, layered on top of: - pre-merge-trigger approval gate (vllm-project#1471) - approved-workflow environment for HF_TOKEN (vllm-project#1473) Even with both of those in place, a planted payload that activates inside a trusted job today has unrestricted egress; this PR closes the detection gap and prepares the data needed to close the enforcement gap. Affected jobs (15 - same set as vllm-project#1473): pre-merge.yaml: hpu_unit_tests, hpu_pd_tests, hpu_perf_tests, hpu_dp_tests, e2e, calibration_tests hourly-ci.yaml: run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test create-release-branch: run_unit_tests, e2e, run_data_parallel_test, run_pd_disaggregate_test, run_hpu_perf_tests Self-hosted runner notes: - harden-runner installs a small monitoring agent on the runner host. Requires sudo (already available on pr-ci / hourly-ci pools). - With --network=host containers (current setup), the host-level eBPF filter sees all container traffic. - --privileged containers can theoretically tamper with host filters from inside the container; this remains a residual risk for audit-only mode but is reduced when we move to 'block' mode in a follow-up PR (kernel-level enforcement, harder to bypass). Follow-up PRs planned: 1. After ~1 week of audit data: assemble allow-list per job 2. Flip egress-policy from 'audit' to 'block' with allow-list Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
step-security/harden-runner@v2.19.3(SHA-pinned) as the first step of every CI job that consumessecrets.HF_TOKEN, configured withegress-policy: blockand a curated allow-list of endpoints that the current build + test pipeline actually needs.Allow-list (derived from code, not collected from runs)
Walked through
.github/Dockerfile.ci, the three workflow YAMLs, andtests/full_tests/ci_e2e_discoverable_tests.sh:api.github.com,github.com,codeload.github.com,objects.githubusercontent.com,raw.githubusercontent.com,release-assets.githubusercontent.com,*.actions.githubusercontent.com,results-receiver.actions.githubusercontent.com,ghcr.io,pkg-containers.githubusercontent.com,*.blob.core.windows.netvault.habana.aipypi.org,files.pythonhosted.org,download.pytorch.orghuggingface.co,cdn-lfs.huggingface.co,cdn-lfs.hf.co,cdn-lfs-us-1.hf.co,cas-bridge.xethub.hf.co,xet-lfs-us-1.hf.coIf something legitimate gets blocked, the harden-runner check-run identifies the denied host and we add it in a follow-up PR.
Why it covers the docker containers
Every test container is launched with
--network=host, so the eBPF filter installed by harden-runner on the runner host sees and enforces on the container's outbound traffic — no per-container instrumentation needed.Layered defense
Together with two prior changes, a planted payload in a PR cannot:
HF_TOKENwithout environment approval → ci: route HF_TOKEN-using jobs through approved-workflow environment #1473 (open)Affected jobs (15 — same set as #1473)
pre-merge.yamlhpu_unit_tests,hpu_pd_tests,hpu_perf_tests,hpu_dp_tests,e2e,calibration_testshourly-ci.yamlrun_unit_tests,e2e,run_data_parallel_test,run_pd_disaggregate_testcreate-release-branch.yamlrun_unit_tests,e2e,run_data_parallel_test,run_pd_disaggregate_test,run_hpu_perf_testsSnippet inserted (identical in every job)
Self-hosted runner notes
harden-runnerinstalls a small monitoring agent on the runner host. Requiressudo(already available onpr-ci/hourly-cipools).disable-sudo: falseis kept because some CI steps needdockervia group/sudo.--privilegedflag on test containers means a sufficiently sophisticated payload could try to tamper with the host firewall from inside the container. This is a residual risk; closing it would require moving the harden-runner step inside the container or dropping--privileged. Out of scope for this PR.cc reviewers of #1473