Skip to content

refactor: prod-readiness + clean-code wave (no narrative comments, hardened defaults)#110

Merged
mghabin merged 1 commit intomainfrom
refactor/prod-readiness-clean-code
May 2, 2026
Merged

refactor: prod-readiness + clean-code wave (no narrative comments, hardened defaults)#110
mghabin merged 1 commit intomainfrom
refactor/prod-readiness-clean-code

Conversation

@mghabin
Copy link
Copy Markdown
Owner

@mghabin mghabin commented May 2, 2026

Audit-driven cleanup wave addressing the user's three complaints: (1) too much narrative commentary, (2) not prod-ready, (3) not following Microsoft prod-code conventions / .NET design patterns / clean code.

What changed (3 axes)

1. Comment cull (user's #1 complaint)

Centralized doctrine, removed duplicated narrative essays, kept only the comments that prevent real bugs (security rationale, non-obvious gotchas — max 1-2 lines each).

  • Bicep: cd-bootstrap.bicep 16-line role-assignment recovery essay → 2-line pointer; aca-stack.bicep 11-line BCP138 paragraph → succinct doc + structured @metadata; app-registrations.bicep 12-line FIC essay → 5-line block.
  • Bash scripts: bootstrap-env.sh 25-line + provision-apps.sh 38-line headers → 5-9 line orientation + pointers to docs/.
  • Authorization doctrine: 5+ duplicated narrative blocks across Orders.Api and Restaurants.Api source files → centralized in DOCTRINE.md new section Authorization-policy doctrine (canonical) covering the two-named-policy rule, the [Authorize(Roles=...)] trap with MapInboundClaims=false, and the azp allow-list rationale. Source files now reference DOCTRINE.md in 1-line summaries.

2. DevOps fixes

  • pinact job rename: verifypinact to match the required-status-check name in the ruleset. Real bug — currently dormant because all merges use --admin bypass.
  • secrets: inherit → explicit pass-through: declared secrets: block in deploy-env.yml's workflow_call, pass AZURE_TENANT_ID: explicitly from cd.yml. Defense in depth.
  • timeout-minutes on every job (5-45 min depending on workload).
  • Bash hardening: IFS=$'\\n\\t', trap ERR, split SC2155 declare/assign in provision-apps.sh.
  • Bicep API bumps (3 of 14 resources were stale): Microsoft.App/{containerApps,managedEnvironments} preview → 2025-07-01 GA; Microsoft.KeyVault/vaults preview → 2024-11-01 GA. The other 11 resources were already on the latest GA.

3. Prod-readiness

  • Kestrel hardening: ConfigureEntraAuthKestrel() pins MaxRequestBodySize=10 MB, RequestHeadersTimeout=30s, KeepAliveTimeout=65s, AddServerHeader=false. Wired into all 3 web services. Closes a DoS gap where defaults are effectively unlimited.
  • HTTPS redirect at app level: AddEntraAuthHttpsRedirection() registers a 308 permanent redirect; UseHttpsRedirection() in middleware. Defense in depth alongside ACA's allowInsecure: false edge enforcement.
  • EntraAuthRateLimiterOptions validator: bounds checks on PermitLimit/AppPermitLimit/Window/SegmentsPerWindow. AddEntraAuthRateLimiter now fails fast on misconfig instead of silently breaking.
  • Test coverage gap closed: DownstreamProbeServiceTests (success-path log + exception-path ExitCode=1 + StopApplication) — the worker startup probe was untested. EntraAuthRateLimiterOptionsValidatorTests covers fail-fast behaviour.
  • nuget.config with packageSourceMapping pinning all packages to nuget.org (defense against transitive package-source injection).

What I deliberately did NOT change

  • Comments that prevent real bugs: security rationale on MixedClaimsPolicies, RequireClientApp, EntraAuthSecurityHeadersExtensions. The audit's first-pass agent argued these are "all justified" — I disagreed for the doctrine duplication but agreed for the security-policy rationale (1-2 lines each, prevents subtle bugs).
  • CAF resource naming (cosmetic).
  • Diagnostic settings on every PaaS (KV + LAW already have them; App Insights doesn't really need them for this sample).
  • TimeProvider injection (premature — no time-sensitive business logic).

Verification

  • dotnet test EntraAuthPatterns.slnx72/72 (was 66; +6 from this PR).
  • az bicep build on every .bicep + .bicepparam — no errors.
  • actionlint .github/workflows/*.yml — clean.
  • shellcheck scripts/*.sh — clean.

Out of scope (still tracked separately)

…rdened defaults)

Three-pronged audit response:

## Comment cull (user complaint: 'comments more than code, smell')

* infra/bicep/modules/cd-bootstrap.bicep — 16-line role-assignment recovery essay → 2-line pointer to docs/operations.md § 'CD identity (UAMI) recovery'.
* infra/bicep/modules/aca-stack.bicep — 11-line BCP138/positional-keying paragraph + 5-line @description prose → succinct doc + structured @metadata.
* infra/bicep/modules/app-registrations.bicep — 12-line FIC essay → 5-line block pointing at the same runbook.
* scripts/{bootstrap-env,provision-apps}.sh — 25-line + 38-line headers → 5-9 line orientation, pointers to docs/.
* DOCTRINE.md — new 'Authorization-policy doctrine (canonical)' section with the two-named-policy rule, [Authorize(Roles=...)] trap, and azp allow-list rationale.
* src/Ftgo.{Orders,Restaurants}.Api/{*AuthorizationPolicies,Program}.cs — 5+ duplicated narrative XML/inline blocks → 1-line summary 'see DOCTRINE.md § ...'.

## DevOps fixes

* .github/workflows/pinact.yml — job renamed 'verify' → 'pinact' to match the required-status-check name in the ruleset (currently dormant: only --admin merges hide the mismatch).
* .github/workflows/{cd,deploy-env}.yml — replaced 'secrets: inherit' with explicit 'AZURE_TENANT_ID:' pass-through (declared in deploy-env's workflow_call.secrets); defense in depth even for first-party reusable workflow.
* .github/workflows/*.yml — every job gains 'timeout-minutes' (5-45 m depending on workload). Default 6h is too lax for hangs to alert.
* scripts/*.sh — added 'IFS=$'\n\t'' + 'trap ERR' + split SC2155 declare/assign so shellcheck/CI catch shell-script regressions cleanly.
* infra/bicep/modules/{container-app,container-apps-environment,key-vault,key-vault-rbac}.bicep — bumped 3 stale APIs to the latest GA: Microsoft.App/{containerApps,managedEnvironments} preview → 2025-07-01; Microsoft.KeyVault/vaults preview → 2024-11-01.

## Prod-readiness

* src/Ftgo.Auth/EntraAuthHostingExtensions.cs (new) — ConfigureEntraAuthKestrel() pins MaxRequestBodySize=10MB, RequestHeadersTimeout=30s, KeepAliveTimeout=65s, AddServerHeader=false. AddEntraAuthHttpsRedirection() registers a 308 permanent redirect.
* src/Ftgo.{ApiGateway,Orders.Api,Restaurants.Api}/Program.cs — wire Kestrel hardening + UseHttpsRedirection (defense in depth alongside ACA's allowInsecure=false edge enforcement).
* src/Ftgo.Auth/EntraAuthRateLimiterExtensions.cs — added EntraAuthRateLimiterOptionsValidator (bounds checks for PermitLimit, AppPermitLimit, Window, SegmentsPerWindow); AddEntraAuthRateLimiter throws ArgumentException early on misconfig.
* tests/Ftgo.Auth.Tests/DownstreamProbeServiceTests.cs (new) — covers success-path log, exception-path ExitCode=1 + StopApplication for the worker probe.
* tests/Ftgo.Auth.Tests/EntraAuthRateLimiterOptionsValidatorTests.cs (new) — covers fail-fast on bad bounds + happy-path defaults.
* nuget.config (new) — packageSourceMapping pinning all packages to nuget.org (defense against transitive package-source injection).

## Verification

* dotnet test EntraAuthPatterns.slnx → 72/72 (was 66; 6 new from this PR)
* az bicep build on every .bicep + .bicepparam (no errors)
* actionlint on every workflow (no errors)
* shellcheck on both scripts (clean)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mghabin mghabin merged commit 4f8176b into main May 2, 2026
12 checks passed
@mghabin mghabin deleted the refactor/prod-readiness-clean-code branch May 2, 2026 01:30
mghabin added a commit that referenced this pull request May 2, 2026
Audit-driven cleanup folding in **all 6 open Dependabot PRs** + closing
the 4 high-impact findings from this session's parallel audit run +
addressing all 3 open issues.

## Headline change: 3 docs prescribed a **forbidden** pattern

`docs/validation.md:54,57,156` and `docs/decision-trees.md:76,82` showed
`[Authorize(Roles = "...")]` — explicitly forbidden by `DOCTRINE.md`
(added in PR #110): silently no-ops on Microsoft.Identity.Web JWT
schemes (`MapInboundClaims=false` looks for `ClaimTypes.Role`, Entra
emits short-name `roles`). Readers copying these snippets would deploy a
privilege-escalation bug.

## Six Dependabot PRs absorbed

Each as part of this single squash, with their breakage transitively
fixed:

- **#104** (actions group) — `markdownlint-cli2-action v20→v23`
introduced new MD060 enforcement. Realigned 5 doc tables + disabled
MD060 in `.markdownlint.yaml` (emoji-width breaks 'aligned' detection).
- **#105** Microsoft.Identity.Web 4.8 → 4.9
- **#106** Azure.Monitor.OpenTelemetry.Exporter 1.7 → 1.8
- **#107** Meziantou.Analyzer + SonarAnalyzer.CSharp
- **#108** Microsoft.NET.Test.Sdk 18.4.0 → 18.5.1
- **#109** Scalar.AspNetCore 2.14.4 → 2.14.9

## All 4 audit waves

- **A — Doc doctrine**: validation.md + decision-trees.md fixed;
operations.md env-rename runbook marked historical.
- **B — Test coverage** (+21 new tests, 72 → 93):
- new `EntraAuthForwardedHeadersExtensionsTests` (X-Forwarded-* +
ForwardLimit cap)
- rate-limiter long-form claim Theory
(`http://schemas.microsoft.com/identity/claims/{tid,oid,appid}`) +
Retry-After RFC 6585 + 2-user partition isolation
- claims-challenge RFC 7235 edge-case Theory (case-insensitive params,
comma-in-quoted, empty Bearer, multi-Bearer first-wins, non-Bearer
scheme)
- **bug fix surfaced by Retry-After test**: `MetadataName.RetryAfter`
isn't always populated; falls back to `opts.Window` so 429 responses
always carry the header.
- **C — Infra**:
- `log-analytics.bicep` `retentionInDays` parameterized (prod=90d audit
floor, others=30d)
- new `modules/budget.bicep` (Microsoft.Consumption/budgets,
Forecasted+Actual at 80/100%, prod $50/mo, others $10/mo)
- new `drift-detection.yml` workflow (weekly `az deployment group
what-if`, matrix [ci, ppe], `::warning::` on drift)
  - container-app `transport='https'` for prod
- **D — Polish**:
- `DownstreamApiClient.ProbeAsync` `ConfigureAwait(false)` (library
code)
- `cd-cleanup.yml` PPE auth failure fixed via pre-check before
`azure/login`

## Issues closed

- **#80** — re-spiked twice (xunit.v3 3.2.2 + M.T.E.* still
incompatible). Closing as `wontfix`. Added `Microsoft.Testing.*` to
dependabot test-group so future bumps land grouped.
- **#83** — keeping open as a focused follow-up (WAF<Program> harness
migration is independent doctrine work; existing `IHostBuilder +
TestServer` test covers the same matrix).
- **#85** — keeping open as a focused follow-up (kitchen-worker → ACA
Job is real architectural change; new module + permission-grant rewiring
+ CD pipeline change deserves its own PR).

## Verification

| Gate | Result |
|---|---|
| `dotnet test EntraAuthPatterns.slnx` | **93/93** |
| `az bicep build` (every `.bicep` + `.bicepparam`) | clean |
| `actionlint` | clean |
| `shellcheck` | clean |
| `markdownlint-cli2` (CI version) | clean |

Co-authored-by: mghabin <81494213+MohammadGhabin@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant