feat: port user-overlay features + exec-arg-wildcards onto unified ContainerProfileCache#37
Closed
entlein wants to merge 17 commits intomerge/upstream-cp-cachefrom
Closed
feat: port user-overlay features + exec-arg-wildcards onto unified ContainerProfileCache#37entlein wants to merge 17 commits intomerge/upstream-cp-cachefrom
entlein wants to merge 17 commits intomerge/upstream-cp-cachefrom
Conversation
added 17 commits
April 30, 2026 11:13
Upstream PR kubescape#788 (Replace AP and NN cache with CP) collapsed the two legacy workload-keyed caches into a single ContainerProfileCache that reads ONE pod label `kubescape.io/user-defined-profile=<name>` and uses <name> as the lookup key for BOTH the user ApplicationProfile and the user NetworkNeighborhood. The fork's earlier two-label scheme (`user-defined-profile` for AP + separate `user-defined-network` for NN, with potentially different names) is no longer honored — the second label is silently ignored. Port: - tests/resources/nginx-user-defined-deployment.yaml: drop the `user-defined-network` label, point the surviving label at one shared name `curl-28-overlay`. - tests/component_test.go Test_28_UserDefinedNetworkNeighborhood: create both AP and NN under that single shared name. Assertions unchanged — the test still verifies that the user NN's egress restriction (only fusioncore.ai allowed on TCP/80) is enforced once the pod is running. Verified locally: go vet -tags=component ./tests/... clean; go test -tags=component -run='^$' ./tests/... compiles cleanly.
PR #35's wildcard-aware exec arg matching needs reapplication on top of the new ContainerProfileCache (upstream kubescape#788) baseline. The original PR sat on the legacy applicationprofilecache, which has been deleted; the call site now reads cp.Spec.Execs from a ContainerProfile. Same semantic change as PR #35: '⋯' (DynamicIdentifier) — matches exactly one argument position. '*' (WildcardIdentifier) — matches zero or more consecutive args. Wiring: - pkg/rulemanager/cel/libraries/applicationprofile/exec.go: drop slices.Compare exact-equality on the cp.Spec.Execs loop; route through dynamicpathdetector.CompareExecArgs. - go.mod: bump fork storage replace to feat/exec-arg-wildcards tip (3fc287210729) which carries the matcher. - exec_test.go: re-add TestExecWithArgsWildcardInProfile (13 subtests across curl --user ⋯, sh -c *, ls -l ⋯, echo hello *, plus negative literal-anchor / under-consumed-⋯ / mid-profile-* cases). Mirrors the test set that lived on PR #35 before the upstream merge. Verified: full applicationprofile package green (`go test ./pkg/rulemanager/cel/libraries/applicationprofile/`).
R0040 is an additive companion to R0001. It evaluates:
ap.was_executed(...) && !ap.was_executed_with_args(..., event.args)
so it ONLY fires when the exec'd path IS in the user-defined profile
(R0001 stays silent) but the runtime arg vector does not match any
profile entry's pattern. With wildcard tokens supported by
dynamicpathdetector.CompareExecArgs:
'⋯' (DynamicIdentifier) — exactly one argument position.
'*' (WildcardIdentifier) — zero or more consecutive args.
Use case: profile entry {Path: /bin/sh, Args: [sh, -c, *]} flags
'sh -x ...' as drift while permitting 'sh -c <anything>'.
Wiring:
- tests/chart/templates/node-agent/default-rules.yaml: new R0040
CEL rule definition immediately after R0001, same MITRE tagging
(TA0002/T1059) and same applicationprofile-anomaly tag set.
- tests/chart/templates/node-agent/default-rule-binding.yaml:
R0040 added to the all-rules-all-pods binding next to R0001.
- tests/resources/curl-exec-arg-wildcards-deployment.yaml: new
fixture, curl pod labelled with the unified
kubescape.io/user-defined-profile=curl-32-overlay label.
- tests/component_test.go: Test_32_UnexpectedProcessArguments with
4 subtests:
32a sh_dash_c_matches_wildcard_trailing — sh -c <cmd> matches
profile [sh, -c, *] — R0040 silent.
32b sh_dash_x_mismatches_R0040 — sh -x <cmd> mismatches the
literal -c anchor — R0040 fires.
32c echo_hello_matches_wildcard_trailing — echo hello world
matches [echo, hello, *] — R0040 silent.
32d echo_goodbye_mismatches_R0040 — echo goodbye world
mismatches the literal hello anchor — R0040 fires.
Verified locally: go vet -tags=component ./tests/... clean;
go test -tags=component -run='^$' ./tests/... compiles cleanly.
End-to-end alert assertions run in CI.
After rebasing storage feat/exec-arg-wildcards onto storage main, the matcher branch now sits on top of the latest fork main commit (352395a3 — Internal-field merge fix). Bump the node-agent storage replace to that new pseudo-version so this branch's tests run against storage main + matcher in one consistent baseline. Verified locally: 47/47 non-eBPF unit packages green; vet clean; the applicationprofile CEL package's TestExecWithArgsWildcardInProfile is 13/13 green; component-tests compile under the component tag. The two failing packages (pkg/containerwatcher/v2/tracers and pkg/validator) fail with the same pre-existing /sys/fs/bpf mount-permission error they have on every recent run — env, not code.
The component-tests workflow uses a hardcoded matrix list, not a dynamic discovery from the test source. Test_32 (added in a613cf6) must be listed explicitly to be picked up — without this entry the test is silently skipped.
Upstream PR kubescape#788 (Replace AP and NN cache with CP) deleted the legacy applicationprofilecache where the fork's emitTamperAlert (commit c2d681e 'Feat/tamperalert' #22) lived. After the merge, R1016 alerts no longer fired for tampered user-defined profiles, breaking Test_31_TamperDetectionAlert (passed 3/3 on main, fails on the merged branch — confirmed regression introduced by PR #36). This restores the contract: every cache load of a user-supplied ApplicationProfile or NetworkNeighborhood overlay re-verifies the signature, and emits an R1016 'Signed profile tampered' alert through the rule-alert exporter when the signature is present but no longer valid. Alert shape preserved from the legacy cache so dashboards and component tests keep matching. Implementation: - new file pkg/objectcache/containerprofilecache/tamper_alert.go: verifyUserApplicationProfile / verifyUserNetworkNeighborhood / emitTamperAlert / extractWlidFromContainerID. Self-contained; keeps containerprofilecache.go diff small. - containerprofilecache.go: new tamperAlertExporter field + SetTamperAlertExporter setter + verify hooks immediately after GetApplicationProfile / GetNetworkNeighborhood succeed in the user-overlay branch of addContainer. - cmd/main.go: wire the alert exporter via SetTamperAlertExporter after the cache constructor (kept the constructor signature unchanged to avoid blast radius on tests). The setter is nil-safe: when no exporter is wired, verification still runs and is logged but no alert is emitted — matches the legacy behavior for unit-tests-with-no-exporter. Test_31 expanded from one scenario to four subtests, each in its own namespace to avoid alert cross-contamination: 31a tampered_user_defined_AP_fires_R1016 — original regression case 31b untampered_signed_AP_no_R1016 — negative: clean signature 31c unsigned_AP_no_R1016 — signing is opt-in 31d tampered_user_defined_NN_fires_R1016 — parallel NN code path Verified locally: - go build ./... clean - go test ./pkg/objectcache/containerprofilecache/... green - go test ./pkg/signature/... green - go vet -tags=component ./tests/... clean - go test -tags=component -run='^$' ./tests/... compiles
Empirical finding from CI run 25178930763 — Test_32's positive
subtests (32a sh_dash_c_matches, 32c echo_hello_matches) fired
R0040 when they should not. Cause: at runtime, the eBPF tracer
captures argv[0] as the FULL exec path (e.g. "/bin/sh") rather
than the basename ("sh"). My profile entries used basenames, so
the matcher's first-position literal compare missed and the cache
fell through to 'no exec entry matches' — R0040 fires.
Aligns Test_32's profile with the convention already used by
Test_27's wildcard_yaml_profile_allowed_opens fixture
(known-application-profile-wildcards.yaml predecessor): argv[0]
is the full path, subsequent positions are flags/values.
Subtest expectations after this fix:
32a sh -c <cmd> → matches [/bin/sh, -c, *] → R0040 silent
32b sh -x <cmd> → -c anchor mismatch → R0040 fires
32c echo hello <…> → matches [/bin/echo, hello, *]→ R0040 silent
32d echo goodbye <…> → hello anchor mismatch → R0040 fires
Catches the class of bug Test_32 hit on its first CI run (PR #37 run 25178930763): profile entries used basename argv[0] ("sh") while the eBPF tracer captures the full path ("/bin/sh"), so the matcher silently misses and the rule fires when it shouldn't. Without a linter, this kind of fixture drift only surfaces in a 15-minute component-test run on a kind cluster — too late, too expensive. The linter (LintApplicationProfile / LintApplicationProfileYAML in tests/resources/aplint_test.go) is intentionally written as a pure function returning []Violation. Zero testing-package coupling on the hot path so it can be lifted into a future bobctl subcommand `bobctl lint <ap.yaml>` without rewrite — see backlog at ~/biz/sbob-business-plan/state.yaml. Rules: R-AP-01 — kind must be ApplicationProfile R-AP-02 — at least one container R-AP-03 — container.name non-empty R-AP-10 — exec.path absolute (catches relative paths) R-AP-11 — exec.path no wildcards (binary identity is exact) R-AP-12 — exec.args[0] equals exec.path or wildcard token (Test_32-style argv[0] basename trap) R-AP-13 — exec.args wildcard tokens are whole-word (no embedding) R-AP-20 — open.path non-empty + absolute R-AP-21 — open.flags non-empty (real auto-recorded opens always have ≥1) R-AP-22 — open.flags from known O_* set (catches typos) Each rule has a dedicated self-test that constructs a minimal-bad YAML and asserts the rule fires (5 negative tests). One positive test (TestLinter_canonical_AP_passes) parses the fork's reference known-application-profile.yaml — extracted from a real auto-recorded AP for curlimages/curl:8.5.0 in fea3b06 — and asserts zero violations. The reference YAML is restored to tests/resources/ so the canonical shape is in-tree and visible to humans + CI. Why a Go test rather than a shell linter: keeps the rule set in the same language as the storage matcher (`dynamicpathdetector`), so extending CompareExecArgs and the linter together stays cheap. Local-cluster organic learning was the original plan but k3s on OrbStack is currently flapping (LXC-related boot loop). The fea3b06 profile was extracted from real auto-learning at an earlier moment of stability, which is the next-best ground truth.
Switch verifyUser{ApplicationProfile,NetworkNeighborhood} from strict
VerifyObject to VerifyObjectAllowUntrusted. The strict variant requires
a Sigstore Fulcio trust chain and rejects locally-signed profiles even
when the signature against the embedded cert is valid. That made
Test_31b 'untampered_signed_AP_no_R1016' fire R1016 against an
untampered AP, and broke Test_30's 'tampered_profile_loaded_without_
enforcement' subtest the same way.
The intent is: tamper detection, not trust-chain enforcement. Matches
cmd/sign-object/main.go's default verifier.
…Eventually The single-shot wget exec before Eventually was racy: if the eBPF event landed before the CP cache projected the user-defined AP, the rule manager evaluated against an empty baseline and R0001 never fired within the 60s polling window. Same race Test_29 already documents. Drive the wget exec inside the Eventually loop (10s tick, 120s deadline) so cache-load latency is absorbed by retries. Filter R0001 to comm=wget to make the assertion specific instead of catching any R0001. Drops the blind 15s pre-sleep and the redundant settle-then-recount block.
Picks up the upstream-PR-kubescape#316 review fix: trailing WildcardIdentifier now requires at least one regular-path segment, matching standard glob semantics. Closes the R0002 blind spot where '/etc/*' would silently match the bare '/etc' directory.
Pulls in the full PR-kubescape#316 review fix set that just landed on storage main: proper splitPath-based trailing-* anchoring, DefaultCollapseConfigs() defensive-copy accessor, FindConfigForPath value-return, splitEndpoint defensive guard, plus the BenchmarkCompareDynamic baseline.
End-to-end pin of the storage-side CompareDynamic contract through R0002. Each subtest deploys a fresh nginx pod with a user-defined AP carrying ONE Opens entry, then `cat`s a target path that probes a boundary case from the storage analyzer fixes (kubescape/storage kubescape#316 review by matthyx + entlein): - Anchored trailing `*` matches one OR MORE remaining segments — never zero. So /etc/* matches /etc/passwd but NOT bare /etc. - DynamicIdentifier (⋯) consumes EXACTLY ONE segment. - Mid-path `*` is zero-or-more, so /etc/*/* matches /etc/ssh (inner * consumes zero, trailing * consumes one). - Mixed ⋯/* combinations: ⋯ pins one, * consumes the rest. - splitPath normalises trailing slashes on both sides. 11 subtests covering: trailing_star_matches_immediate_child — basic /etc/* match trailing_star_matches_deep_child — multi-segment under prefix trailing_star_does_not_match_bare_parent — the security fix deep_prefix_trailing_star_does_not_match_parent — same rule, deeper ellipsis_pin_one_segment_then_literal — ⋯ rejects zero ellipsis_then_trailing_star_matches_two_* — ⋯/* combo, 2 levels ellipsis_then_trailing_star_matches_three_* — ⋯/* combo, 3 levels double_trailing_matches_one_child — /*/* mid-zero double_trailing_matches_deep_child — /*/* mid-one double_trailing_does_not_match_parent — /*/* needs ≥1 child trailing_slash_in_profile_normalises_to_literal — splitPath on profile Pinned at component level on TOP of the unit suite in storage/pkg/registry/file/dynamicpathdetector/tests/coverage_test.go. Both layers must agree — a drift in either lights up R0002 with a false positive or false negative. Matrix entry added to component-tests.yaml so the test runs in CI.
… monitored prefix
R0002's CEL ruleExpression has a strict path-prefix filter:
event.path.startsWith('/etc/') || event.path.startsWith('/var/log/') || ...
All with trailing slash. Bare /etc and /var/log don't pass the filter,
so R0002 never evaluates on those events — the matcher's bare-parent
anchoring contract stays invisible at runtime even though the storage
unit tests pin it.
Probe one level deeper instead: /etc/ssl IS under the /etc/ monitored
prefix, so the rule CAN see whether a /etc/ssl/* profile entry matches
the bare /etc/ssl parent. Same security guarantee, observable layer.
Reworked subtests:
- trailing_star_does_not_match_bare_parent_under_monitored_prefix:
profile /etc/ssl/*, cat /etc/ssl → R0002 fires
- deep_prefix_trailing_star_does_not_match_parent:
profile /etc/ssl/certs/*, cat /etc/ssl/certs → R0002 fires
- ellipsis_requires_one_segment_not_zero:
profile /etc/passwd/⋯, cat /etc/passwd → ⋯ requires one more segment
- double_trailing_does_not_match_parent_under_monitored_prefix:
profile /etc/ssl/*/*, cat /etc/ssl → R0002 fires
The 7 positive subtests that already passed are untouched. Added a
comment block documenting why we probe at /etc/ssl rather than /etc.
Two distinct fixes for what looked like the same intermittent failures across PR #37 runs: Test_31 31b 'untampered_signed_AP_no_R1016' — root cause: storage's PreSave runs DeflateSortString on Syscalls (and Capabilities, Architectures), which sorts + dedupes. The signSignedAP helper signed the AP BEFORE pushing, against unsorted syscalls {socket, connect, read, write, close, openat}. After PreSave the stored AP had sorted {close, connect, openat, read, socket, write}, so the content hash differed from the signature → server-side verify correctly failed → R1016 fired even though the profile was untampered. Test_29 + Test_30 30b had the same fixture but didn't observe the bug because they only assert R0001 counts, never R1016. Pre-sort the syscalls in all three test fixtures so storage's normalization is a no-op on round-trip. Test_28 28a 'allowed_fusioncore_no_alert' — root cause: 15s post-deploy sleep wasn't always enough for the upstream ContainerProfileCache to project the user-defined NN. Failure mode is alert payload `profileMetadata.errorMessage:"waiting for profile update"` — the rule manager evaluated against an unloaded NN and fired R0005/R0011 spuriously. Bumped to 30s with a comment documenting why. A real fix would poll a cache-loaded signal but no such signal is exposed from outside the node-agent today.
…ift R1016 false positive
Test_31 31b 'untampered_signed_AP_no_R1016' kept flaking because the
AP's content hash drifted between client-side sign and server-side
verify across the K8s/storage roundtrip. Sources of drift include
storage's PreSave normalisation (DeflateSortString, DeflateStringer,
DeflateRulePolicies), signature/profiles GetContent's nil→empty-map
mutation on PolicyByRuleId, and any K8s server-side defaulting of
spec/metadata fields. Pre-sorting Syscalls in the previous fix only
covered one of these.
Sign-after-roundtrip closes the whole class:
1. Push the AP UNSIGNED to storage. PreSave runs, normalises content.
2. Read it back — this is what node-agent will see at verify time.
3. Sign THAT normalised content.
4. Delete the unsigned in-storage copy so deployAndWait can Create
the signed version without an AlreadyExists conflict.
5. Strip server-managed metadata (resourceVersion / uid / etc.) from
the returned AP so the second Create succeeds cleanly.
Second push goes through deflate again. Idempotent on already-normalised
content → stored bytes identical to signed bytes → content hash matches
→ verify succeeds → no R1016 false positive.
Tampered subtests (31a, 31d) keep working: signSignedAP returns a
known-good signed AP, the test mutates it post-helper, deployAndWait
Creates the mutated version, storage round-trip preserves the mutation,
and verify correctly detects the divergence.
…re tolerance Test_33 deploys 11 fresh pods sequentially, one per subtest. Later subtests race against an increasingly loaded kind cluster — CP cache reconciler, alertmanager, prometheus all chew CPU at boot. 80s WaitForReady deadline timed out on the post-23ea224 run with 'workload not ready in ns ...' for early subtests once the cluster got busy. 180s gives headroom without changing total runtime regime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on top of #36 (the upstream-cache-rewrite merge). Restores the
fork-only features that depended on the deleted legacy AP/NN caches and
re-applies the exec-arg-wildcard work onto the new
ContainerProfileCachebaseline.What is in this PR
e2286031kubescape.io/user-defined-profileand uses its value as the lookup key for BOTH the user AP and the user NN. Test_28's two-label scheme (separateuser-defined-network) is silently ignored on the new cache, so the test had to be reframed: AP and NN now share one name (curl-28-overlay); the obsolete second label is dropped from the deployment fixture.f8df60f1cp.Spec.Execscall site inpkg/rulemanager/cel/libraries/applicationprofile/exec.go, bumps the storagereplaceto the matcher branch (3fc287210729), and re-addsTestExecWithArgsWildcardInProfile(13 subtests).a613cf64sh -c/sh -x/echo hello/echo goodbyeagainst a profile with mixed*,⋯, and literal anchors.Design call locked in this PR
Upstream's CP cache deliberately reads a SINGLE label and treats AP+NN
as a paired-name overlay. The fork's earlier two-label flexibility
(separate
user-defined-networkfor NN-only overlays) is dropped.Rationale: tracks upstream consensus, minimises forever-divergence,
and the rare "different name for AP vs NN" case is solvable by
naming the bundle deliberately.
Verified locally
go vet -tags=component ./tests/...cleango test -tags=component -run='^$' ./tests/...compiles cleanlygo test -count=1 ./pkg/...— 47 packages pass; the 2 failing arepkg/containerwatcher/v2/tracersandpkg/validator, bothpre-existing eBPF/BPF-mount env issues unrelated to this PR.
TestExecWithArgsWildcardInProfile13/13 green.Out of scope (separate follow-ups)
UserDefinedNetworkfield +UserDefinedNetworkMetadataKeyconst on
shared_container_data.go(set by label-detect, read byno one after upstream's CP-cache rewire). Cleanup in a janitorial PR.
replacedoes NOT yet include the Internal-field merge fix(storage main
352395a3); the matcher branch is a sibling. Eitherrebase the matcher branch on storage main or wait for storage PR we lost files during rebase and cherry pick #23
to merge, then bump.
Stack
merge/upstream-cp-cache)feat/port-user-overlays-to-cp