Skip to content
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
---
id: B-0109
priority: P0
status: open
title: Dependency status tracking surface — outages and issues affecting us (Aaron 2026-04-30, urgent)
tier: design + implementation
effort: M
ask: Aaron 2026-04-30 (autonomous-loop channel input — verbatim "we need somewhere that list the status of our dependinces and issues that could affect us" + 6 source URLs + urgency clarification "github can erase stuff from master when we use the merge queue sometimes")
created: 2026-04-30
last_updated: 2026-04-30
composes_with: [B-0086, B-0096]
tags: [dependency-status, outages, github-incidents, supply-chain, observability, factory-resilience, urgent]
---

# Dependency status tracking surface — outages and issues affecting us

> **First-class factory surface** (Aaron 2026-04-30):
> *"looking at github status should be first class for us we
> live on git and github for now until we get a 2nd host in
> the future."* This is not a "design something later" row —
> it's the visibility layer the factory's hot path runs
> through. Until a second git host exists, GitHub status IS
> factory status, and the surface should reflect that
> operational reality.

Aaron sent input on 2026-04-30 via the autonomous-loop maintainer
channel asking the factory to land a surface that lists the
status of the dependencies the factory relies on and any issues
in those dependencies that could affect us. The framing came
with 6 source URLs covering a GitHub-availability incident class
(merge queue bug + general availability degradation), a
follow-up urgency clarification (*"github can erase stuff from
master when we use the merge queue sometimes"*), and a
first-class-priority elevation (*"looking at github status
should be first class for us we live on git and github for
now until we get a 2nd host in the future"*). The first-class
framing composes with the existing 3-tier multi-remote design
work (Amara packet 3 2026-04-29, task #341): the
status-tracking surface IS the tier-0 visibility layer that
the multi-remote design assumes is in place.

This is a **P0** row (escalated from P1 on Aaron's urgency
clarification) because:

1. **Live evidence found while filing this row.** The GitHub
status API at write-time shows an ACTIVE incident:
*"Incomplete pull request results in repositories"* —
started 2026-04-30T03:49:37Z, status `investigating`,
component `Pull Requests` flagged `degraded_performance`,
ongoing for 7+ hours. The factory has PR #911 in flight
during this incident; our 0-unresolved-threads count and
our auto-merge readiness signals could both be based on
incomplete API results. **Auto-merge on PR #911 was
disabled at 2026-04-30T~11:14Z while filing this row, as
the conservative response to live degradation evidence.**
2. **The dead-air polling loop earlier this session may have
been exacerbated by this incident.** It ran ~2.5 hours
(08:19Z merge → 10:50Z catch); the incident has been
active since 03:49Z. Without a dependency-status surface,
future-Otto can't disambiguate own-discipline-failure
from external-dependency-degradation when both compose.
Both happened simultaneously this session.
3. **The factory's hot path runs through GitHub.** Any GitHub
degradation IS a factory degradation, with no current
visibility surface naming it as such.

## Repo merge-mechanism verification (2026-04-30)

Aaron's urgency clarification mentioned "the merge queue."
Verified at write-time via `gh api`:

- **`merge_queue: None`** in `branches/main/protection` — we
do NOT use GitHub's merge-queue feature. The Trunk.io
post about merge-queue-builds-on-wrong-commit is a
different bug class than what currently affects us
directly.
- **Auto-merge with squash** (`auto_merge: true`,
`squash: true`, `merge_commit: false`, `rebase: false`)
is what we use.
- **`allow_update_branch: true`** is the relevant safety
here — auto-merge auto-rebases stale branches before
merging, reducing (but not eliminating) the
stale-base-merge risk.
- **Required status checks**: lint (semgrep, shellcheck,
actionlint, markdownlint) + build-and-test (macos-26,
ubuntu-24.04, ubuntu-24.04-arm). 7 required contexts.

So: the *specific* merge-queue bug Aaron worried about
doesn't apply to our setup directly. The *broader* concern
(GitHub backend bugs producing wrong-state results, of
which the live PR-degradation incident is a current
example) absolutely does. The status-tracking surface is
P0 against the broader concern, not the specific
merge-queue concern.

## Aaron's verbatim input (channel preservation per Otto-363)

> we need somewhere that list the status of our dependinces
> and issues that could affect us
> https://github.com/orgs/community/discussions/193645
> https://www.githubstatus.com/
> https://news.ycombinator.com/item?id=47881672
> https://github.blog/news-insights/company-news/an-update-on-github-availability/
> https://www.youtube.com/watch?v=b13m-iuu4XU&t=288s
> https://trunk.io/blog/what-happens-if-a-merge-queue-builds-on-the-wrong-commit
> this can affect us

The "this can affect us" closing is Aaron-as-second-person
framing the relevance: not abstract dependency-management, but
*specifically* the merge-queue / GitHub-availability class of
issue that hits the factory's PR-driven workflow directly.

## Why this matters

- **The factory's hot path runs through GitHub.** Auto-merge,
the every-minute autonomous-loop cron, the Scorecard rolling
window, CodeQL analyses, the AceHack→LFG forward-sync flow,
Copilot/Codex PR reviews — all are GitHub-mediated. A GitHub
outage is a factory outage; a GitHub merge-queue bug is a
potential commit-corruption surface (per the Trunk.io post,
merge queues can build on wrong commits under specific
edge-case conditions).
- **Silent dependency degradation is the worst kind.** When
GitHub Actions runners are slow but functional, a polling
loop watching CI looks indistinguishable from a real wait.
Without a surface naming "GitHub Actions runner queue is
currently degraded," future-Otto can't disambiguate
honest-wait from external-incident.
- **Quantum-resistant crypto and supply-chain discipline both
assume we know what's running.** The
`feedback_all_cryptography_quantum_resistant_even_one_gap_is_attack_vector_2026_04_23.md`
rule and the absorb-and-contribute community-dependency
discipline both presume the factory knows its dependency
surface — but currently we only know what's in
`Directory.Packages.props` / `package.json` /
`tools/setup/*.sh`, not what's *currently failing or
flagged* in those dependencies.
- **The 6 source URLs Aaron sent are a worked example of the
class.** Each describes either a current GitHub incident,
the GitHub availability surface, an HN discussion of the
fallout, or a Trunk.io technical post on merge-queue
edge cases. The factory needs a place where a future tick
sees "GitHub merge-queue had a bug yesterday — check if
your auto-merge fired on the right commit."

## Scope (design + implementation row)

This row produces, in order:

1. **Design pass** — what shape does this surface take?
Candidate shapes (each has tradeoffs):
- **Static markdown file** at
`docs/dependency-status.md`: cheap, version-controlled,
manually-updated. Good for "known-watched dependencies"
list; bad for "current incident state."
- **Cron-driven scraper** that polls
`https://www.githubstatus.com/api/v2/summary.json`
(and equivalent for other dependencies) and writes a
`docs/dependency-status/current.json`. Self-updating,
surface-able to agents and humans. Adds a workflow
and a script.
- **Issue-tracker integration** — open a tracking issue
in the LFG repo per dependency we monitor; status
updates flow to the issue. Discoverable via
`gh issue list` filters. Adds GitHub-issue dependency.
- **Hybrid** — static markdown for the watched-list +
cron scraper for current state + per-incident issues
for active investigation. Most coverage; most surface
to maintain.
2. **Watched-dependencies enumeration** — what do we depend
on operationally? Initial set: GitHub (Actions, Copilot
review, Codex review, hosting, merge-queue, Scorecard);
Anthropic (Claude Code harness, model availability);
OpenAI (Codex, ChatGPT for Aaron's substrate channel);
Google (Gemini for substrate channel); npm registry; bun
registry; mise; rustup; .NET runtime; PostgreSQL (if
used); Linear (if used). Cross-reference with
`tools/setup/*.sh` install paths.
3. **Status-source enumeration** — where does each
dependency publish status? GitHub:
`https://www.githubstatus.com/api/v2/`. Anthropic:
`https://status.anthropic.com/`. OpenAI:
`https://status.openai.com/`. Google: per-product
pages. The status-source-list itself is data the
surface must capture.
4. **Implementation** — start with the chosen shape;
expand if the static markdown turns out to be enough,
stay there.

## Adjacent merge-risk classes in scope

Aaron's named concern was the merge-queue-builds-on-wrong-commit
class (which we don't trigger directly because we don't use
merge-queue). The broader class — *GitHub backend producing
wrong-state results that auto-merge can fire against* — IS in
scope. Specific failure modes the surface should help future-Otto
notice:

- **Auto-merge against stale base.** Our auto-merge with
`allow_update_branch: true` setting auto-rebases stale
branches before merging — but if the rebase itself fires
during a degraded-API window, the result might not be what
the diff preview showed.
- **`allow_update_branch` auto-rebase producing unexpected
merge content.** When auto-merge updates a stale branch,
the resulting tree is whatever the rebase produces. If the
rebase happens during incomplete-API-state, the branch state
observed by reviewers can differ from the state actually
merged.
- **Force-push race with auto-merge firing.** If a force-push
and auto-merge fire near-simultaneously, the merged commit
may be whichever the GitHub backend resolved first — not
necessarily the head observed during review.
- **Incomplete API results during merge decision.** This round's
active incident ("Incomplete pull request results in
repositories") is exactly this class. A 0-unresolved-threads
count from a degraded API can satisfy auto-merge's
required_conversation_resolution gate while threads exist
unseen.

The status-tracking surface flags these conditions; it does not
mitigate them. Mitigation rules (e.g., "when GitHub Pull Requests
component is degraded, do not arm auto-merge") belong in
follow-up rows.

## Sharpening points (Claude.ai 2026-04-30 review)

Three operational details to settle in the design pass:

1. **Polling cadence cost-vs-freshness tradeoff.** Polling every
minute would be noisy and might hit GitHub's rate limits;
polling every hour might miss short incidents that fall
wholly within the gap. Reasonable shape: poll on
freshness-pass triggers (before mutating actions like merge,
force-push, auto-merge arming), poll opportunistically when
ticks are otherwise idle, treat any non-operational status as
a freshness gap that propagates to dependent decisions.
2. **Distinguish factory-relevant components from unrelated
incidents.** A GitHub Pages outage doesn't affect the
factory's PR pipeline; a Pull Requests degradation does.
Without that distinction, every minor unrelated incident
becomes noise and the surface trains future-Otto to ignore
it. Initial factory-relevant component allowlist for GitHub:
Pull Requests, Actions, API Requests, Webhooks. Other
dependencies (Anthropic, OpenAI, Google) get their own
allowlist when their status sources are wired in.
3. **Historical record for retrospective correlation.** Log
incidents to a durable file (e.g.,
`docs/dependency-status/incident-log.jsonl`) so future-Otto
can correlate "session-time anomalies" against
"session-time incidents." Without this, the diagnostic
question Deepseek's framing introduced ("if I do nothing,
will the signal change on its own?") can't be answered
retrospectively — the substrate gains nothing from past
incidents.

## Out of scope for this row

- Building a full incident-management system. The factory
needs *visibility*, not Pagerduty.
- Real-time alerting / paging / on-call rotation. If
dependencies fail, the factory pauses, files an incident
note, and waits for restoration. No auto-paging.
- Per-dependency mitigation plans. Those go in separate
rows when concrete (e.g., "if GitHub merge-queue is
flagged, switch from auto-merge to manual-merge for the
duration").
- Replacing or vendoring degraded dependencies preemptively.
Vendoring discussions belong in B-0086 (TS+Bun migration)
for the dependencies that ARE in-scope for vendoring.

## When this is "done"

Done = a surface exists that any future-Otto (or human
contributor) can query in under 30 seconds to answer:

1. *What does the factory depend on?* (watched list)
2. *Are any of those dependencies currently flagged or
degraded?* (current state)
3. *Is there a known issue affecting our merge / CI /
review pipeline right now?* (active incidents)

The surface must be discoverable from CLAUDE.md and AGENTS.md
(at minimum a pointer line) so cold-start sessions find it.

## Composes with

- **B-0086** (TS+Bun migration) — dependency reduction is
itself a dependency-status mitigation strategy. The fewer
external runtimes, the smaller the status-tracking
surface.
- **B-0096** (Forbidden Pattern Quarantine) — a category of
issue worth tracking is "patterns we have used that
external sources later flagged." Composes naturally if
both surfaces share a vocabulary.
- `memory/feedback_all_cryptography_quantum_resistant_even_one_gap_is_attack_vector_2026_04_23.md`
— quantum-resistant crypto policy presumes we know the
current state of our crypto primitives. Same shape:
presume-known-state requires a state-knowing-surface.
- `memory/feedback_absorb_and_contribute_community_dependency_discipline_2026_04_22.md`
— the absorb-and-contribute discipline presumes we know
what we depend on; this row makes the dependency list
legible.
- `memory/feedback_amara_poll_gate_not_ending_holding_is_not_status_2026_04_30.md`
(landing in PR #911 alongside this row) — the
poll-the-gate rule says "watch the gate, not the
Comment thread
AceHack marked this conversation as resolved.
ending." Knowing whether the gate (CI, merge queue,
reviewer presence) itself is dependency-degraded is part
of the gate-state. A degraded GitHub Actions queue makes
Comment thread
AceHack marked this conversation as resolved.
"in-progress" mean something different than usual.
- `docs/AUTONOMOUS-LOOP.md` — autonomous-loop runs on
GitHub-mediated state. Loop-tick-history rows could
cross-reference the dependency-status surface when
external incidents shape the tick.

## Source links (verbatim from Aaron's channel, 2026-04-30)

- [GitHub Community discussion 193645](https://github.com/orgs/community/discussions/193645)
- [GitHub Status page](https://www.githubstatus.com/)
- [Hacker News discussion 47881672](https://news.ycombinator.com/item?id=47881672)
- [GitHub Blog — An update on GitHub availability](https://github.blog/news-insights/company-news/an-update-on-github-availability/)
- [YouTube video b13m-iuu4XU (segment at 4:48)](https://www.youtube.com/watch?v=b13m-iuu4XU&t=288s)
- [Trunk.io — What happens if a merge queue builds on the wrong commit](https://trunk.io/blog/what-happens-if-a-merge-queue-builds-on-the-wrong-commit)

The Trunk.io post on merge-queue-builds-on-wrong-commit is
the most operationally-load-bearing of the six — it
describes a class of bug that, if present in our path,
would silently produce wrong commits while our auto-merge
plus CI gates report green. The "wrong commit" failure
mode is exactly the silent-failure shape the factory has
Comment thread
AceHack marked this conversation as resolved.
rules against. Worth a careful read on first absorb pass.
1 change: 1 addition & 0 deletions memory/MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

**📌 Fast path: read `CURRENT-aaron.md` and `CURRENT-amara.md` first.** <!-- latest-paired-edit: fork-audit R/C/T diff-filter coverage + plumbing-vs-porcelain note (2026-04-29 round-10 Amara). NOTE: this comment is a single-slot "latest paired edit" marker (not a paired-edit log). Per the round-10 Amara framing the slot semantics are now explicit. -->

- [**GitHub status — first-class dependency reference (Aaron 2026-04-30)**](reference_github_status_first_class_aaron_2026_04_30.md) — Aaron 2026-04-30: GitHub is our only host; status URL is first-class repo-and-loop substrate. Pins canonical URLs (status page + summary.json API), names factory-relevant component allowlist (Pull Requests / Actions / API Requests / Webhooks / Git Operations / Issues), defines freshness-check rule on three triggers: cadence (every 10-15 min when in-flight, less when idle — *"every loop tick might be excessive but on some cadence"*), on-suspicion (anomaly investigation asks "is GitHub degraded?" before "is my logic wrong?"), and pre-mutation (strictest gate). Aaron 2026-04-30 reinforcement *"all our assumptions are based on them being healthy today which is not always true as we can see todya"*. Origin: live "Incomplete pull request results" GitHub PR-degradation incident discovered while filing B-0109 (PR #912).
- [**Kernel-pipe vs JS-space stream ordering — TS+Bun port pattern (Otto, 2026-04-30)**](feedback_kernel_pipe_vs_js_space_stream_ordering_ts_bun_port_pattern_2026_04_30.md) — TS+Bun port discipline: when porting bash `$(... 2>&1)` to `spawnSync`, merge stdout+stderr via shell-side `bash -c "<cmd> 2>&1"` (preserves chronological ordering at the kernel pipe boundary), NOT `result.stdout + result.stderr` concat in JS-space (loses ordering when child interleaves writes). Origin: PR #901 slice-18 Copilot P1 round 2. Composes with `classifySpawnFailure` 4-case helper + Otto-363 substrate-or-it-didn't-happen.
- [**DST + code coverage are universal best practices for every Zeta language (Aaron 2026-04-30)**](feedback_dst_and_coverage_universal_every_language_aaron_2026_04_30.md) — Generalises Otto-272 / Otto-281 / Otto-273 to all languages. SQLSharp is the named TS+Bun reference. Pin seeds, fake clocks, no test retries; tests cover public API surface, CI surfaces coverage, reductions fail. Per-language tooling lives in the runtime layer (`docs/best-practices/`).
- [**Host mutation receipt — ruleset 15256879 code_quality rule removed (Aaron-authorized 2026-04-29)**](feedback_host_mutation_receipt_2026_04_29_ruleset_15256879_code_quality_removed.md) — Receipt for a live host (GitHub) mutation made before executable-host-settings tooling exists. PUT /repos/Lucent-Financial-Group/Zeta/rulesets/15256879 removed `code_quality severity=all` rule (host-side / non-git-declared CodeQL owner injecting `event=dynamic` "Code Quality" runs that bypassed the source-presence gate from PR #857). Made the git-visible advanced workflow `.github/workflows/codeql.yml` the sole CodeQL owner; resolved multi-master conflict that blocked PR #849. Aaron auth: *"if the org-recommended are legacy we can remove, declarative is better."* Per Amara *"Clickops used to restore declarative ownership must become a receipt, or it becomes the next drift"* — this receipt makes the live mutation visible to future executable-host-settings reconciler. NOT precedent for casual ruleset mutations; hook denial during episode was healthy; future apply path is host-reconciler-mediated with WorkClaim + policy + receipt; do NOT broaden `gh api ... rulesets/PUT` permission. Composes with executable-host-settings design packet, Otto-363, task #342 (completed) + #343.
Expand Down
Loading
Loading