-
Notifications
You must be signed in to change notification settings - Fork 20
fix: close monitoring + deploy gaps from post-incident audit #775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| -- migrate:up | ||
|
|
||
| -- Soft-delete feeds whose (connector_key, organization_id) has no active | ||
| -- connector_definition row. | ||
| -- | ||
| -- Why: the 2026-05-16 audit found feeds 117-155 (and others) referencing | ||
| -- connector_key='website' in orgs that have no active definition for it | ||
| -- (only one org has `website` active; one definition is archived). Every | ||
| -- CheckDueFeeds tick (every minute) tried to materialize a sync run for | ||
| -- these feeds and threw "No active connector definition found for X." — | ||
| -- producing ~380 error logs / minute that masked real signal in stdout. | ||
| -- | ||
| -- The app-side code path now warns + skips (no throw) for the same case | ||
| -- so future orphans don't spam logs either. This migration is the one-time | ||
| -- cleanup of the existing data. | ||
| -- | ||
| -- Conservative criteria — match exactly the set CheckDueFeeds processes | ||
| -- (so we only soft-delete feeds that actually produce the error stream). | ||
| -- - feed has no pinned_version (= would have looked up connector_definitions) | ||
| -- - feed.deleted_at IS NULL (still considered active) | ||
| -- - feed.status = 'active' (CheckDueFeeds filters on this — see | ||
| -- packages/server/src/scheduled/check-due-feeds.ts:36-43) | ||
| -- - connection.deleted_at IS NULL AND connection.status = 'active' (same) | ||
| -- - NO active connector_definition exists for that (key, organization) pair | ||
| -- | ||
| -- Feeds in paused / pending_auth / error / revoked states are left alone | ||
| -- — operators may be mid-recovery on them and they don't contribute to | ||
| -- the error spam (CheckDueFeeds skips them anyway). | ||
| -- | ||
| -- The same feed remains recoverable: clearing `deleted_at` + reinstalling | ||
| -- the connector definition for the org restores it. | ||
|
|
||
| UPDATE public.feeds f | ||
| SET deleted_at = now() | ||
| FROM public.connections c | ||
| WHERE f.connection_id = c.id | ||
| AND f.deleted_at IS NULL | ||
| AND f.pinned_version IS NULL | ||
| AND f.status = 'active' | ||
| AND c.deleted_at IS NULL | ||
| AND c.status = 'active' | ||
| AND NOT EXISTS ( | ||
| SELECT 1 | ||
| FROM public.connector_definitions cd | ||
| WHERE cd.key = c.connector_key | ||
| AND cd.organization_id = f.organization_id | ||
| AND cd.status = 'active' | ||
| ); | ||
|
|
||
| -- migrate:down | ||
|
|
||
| -- No-op: re-attaching the orphan feeds would require knowing which were | ||
| -- soft-deleted by this migration vs. by an operator action. The original | ||
| -- error condition is fixed in code; this migration is a one-shot data | ||
| -- cleanup. To recover specific feeds in prod, clear `deleted_at` on the | ||
| -- targeted rows manually and re-install the connector definition. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,3 +1,4 @@ | ||
| import * as Sentry from '@sentry/node'; | ||
| import pino from 'pino'; | ||
|
|
||
| /** | ||
|
|
@@ -22,9 +23,6 @@ const getLogLevel = (): pino.Level => { | |
| return 'debug'; | ||
| }; | ||
|
|
||
| /** | ||
| * Create a Pino logger instance | ||
| */ | ||
| // pino's default error serializer only fires for the `err` key, so | ||
| // `logger.error({ error }, '...')` silently logs `error: {}` (Error's own | ||
| // fields are non-enumerable). Register the same serializer on the `error` | ||
|
|
@@ -33,20 +31,121 @@ const getLogLevel = (): pino.Level => { | |
| // and hid `column "events.search_tsv" does not exist`. | ||
| const errSerializer = pino.stdSerializers.err; | ||
|
|
||
| const logger = pino({ | ||
| level: getLogLevel(), | ||
| browser: { | ||
| asObject: false, | ||
| /** | ||
| * Sentry forwarding for logger.error() and logger.fatal(). | ||
| * | ||
| * Prior to this hook, `logger.error(...)` only wrote to stdout. The Sentry | ||
| * capture middleware in server.ts:85-113 only fires on HTTP 500 responses, | ||
| * so error-logged failures inside background jobs (CheckDueFeeds, runs | ||
| * queue, scheduled tasks) were invisible to monitoring. The 2026-05-16 | ||
| * audit found ~1914 errors / 5 min in stdout with zero Sentry issues. | ||
| * | ||
| * In-process dedupe: repeating errors are common (e.g. an orphan feed | ||
| * fails every 1-min CheckDueFeeds tick). We fingerprint by | ||
| * (msg, err.type, top stack frame) and only forward once per | ||
| * SENTRY_DEDUPE_WINDOW_MS per fingerprint. Sentry has its own grouping | ||
| * but every captureException still incurs an HTTP call + cost; this | ||
| * cuts the load without losing signal. | ||
| */ | ||
| const SENTRY_DEDUPE_WINDOW_MS = 60_000; | ||
| const SENTRY_DEDUPE_MAX_ENTRIES = 1000; | ||
| const sentryDedupe: Map<string, number> = new Map(); | ||
|
|
||
| function fingerprintAndCapture(parsed: Record<string, unknown>): void { | ||
| const level = parsed.level; | ||
| if (level !== 'error' && level !== 'fatal') return; | ||
|
|
||
| // Caller already captured this to Sentry (see server.ts onError + | ||
| // 500-response middleware). Skip to avoid duplicate events. | ||
| if (parsed.sentryReported === true) return; | ||
|
|
||
| const msg = typeof parsed.msg === 'string' ? parsed.msg : 'logger.error'; | ||
| // pino.stdSerializers.err normalises both `err` and `error` (see serializers | ||
| // config below) to objects with `type` / `message` / `stack`. | ||
| const errObj = | ||
| (parsed.err as { type?: string; message?: string; stack?: string } | undefined) ?? | ||
| (parsed.error as { type?: string; message?: string; stack?: string } | undefined); | ||
|
|
||
| // Include err.message in the fingerprint — pre-fix, "(msg, err.type, | ||
| // top stack frame)" grouped distinct errors raised from the same | ||
| // catch site (same Error type, same wrapping log line). One legit | ||
| // incident could be masked by a noisy unrelated one within the 60s | ||
| // window. err.message disambiguates them. | ||
| const errType = errObj?.type ?? ''; | ||
| const errMessage = errObj?.message ?? ''; | ||
| const stackTop = (errObj?.stack ?? '').split('\n')[1]?.trim() ?? ''; | ||
| const fingerprint = `${msg}|${errType}|${errMessage}|${stackTop}`; | ||
|
|
||
| const now = Date.now(); | ||
| const last = sentryDedupe.get(fingerprint); | ||
| if (last !== undefined && now - last < SENTRY_DEDUPE_WINDOW_MS) return; | ||
| sentryDedupe.set(fingerprint, now); | ||
|
|
||
| // Bound the dedupe map so a long-running pod doesn't grow it without limit. | ||
| if (sentryDedupe.size > SENTRY_DEDUPE_MAX_ENTRIES) { | ||
| const oldest = sentryDedupe.keys().next().value; | ||
| if (oldest !== undefined) sentryDedupe.delete(oldest); | ||
| } | ||
|
|
||
| try { | ||
| if (errObj?.message) { | ||
| // Reconstruct an Error so Sentry's grouping works on the stack. | ||
| const reconstructed = new Error(errObj.message); | ||
| if (errObj.stack) reconstructed.stack = errObj.stack; | ||
| Sentry.captureException(reconstructed, { | ||
| extra: parsed, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When an error log includes application payloads, this forwards the entire parsed log line to Sentry as Useful? React with 👍 / 👎. |
||
| tags: { source: 'pino', level: String(level) }, | ||
| }); | ||
| } else { | ||
| Sentry.captureMessage(msg, { | ||
| level: level === 'fatal' ? 'fatal' : 'error', | ||
| extra: parsed, | ||
| tags: { source: 'pino' }, | ||
| }); | ||
| } | ||
| } catch { | ||
| // Sentry not initialised (test envs) or transient SDK failure — never | ||
| // crash the logger over telemetry. | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * pino destination that mirrors lines to stdout AND inspects each line for | ||
| * Sentry forwarding. Sync write is intentional: pino's default stdout path | ||
| * is sync too, and the JSON.parse + dedupe lookup is sub-microsecond. | ||
| */ | ||
| const sentryAwareStream: pino.DestinationStream = { | ||
| write(line: string): void { | ||
| process.stdout.write(line); | ||
| let parsed: unknown; | ||
| try { | ||
| parsed = JSON.parse(line); | ||
| } catch { | ||
| return; | ||
| } | ||
| if (parsed && typeof parsed === 'object') { | ||
| fingerprintAndCapture(parsed as Record<string, unknown>); | ||
| } | ||
| }, | ||
| formatters: { | ||
| level: (label) => { | ||
| return { level: label }; | ||
| }; | ||
|
|
||
| const logger = pino( | ||
| { | ||
| level: getLogLevel(), | ||
| browser: { | ||
| asObject: false, | ||
| }, | ||
| formatters: { | ||
| level: (label) => { | ||
| return { level: label }; | ||
| }, | ||
| }, | ||
| serializers: { | ||
| err: errSerializer, | ||
| error: errSerializer, | ||
| }, | ||
| }, | ||
| serializers: { | ||
| err: errSerializer, | ||
| error: errSerializer, | ||
| }, | ||
| }); | ||
| sentryAwareStream | ||
| ); | ||
|
|
||
| export default logger; | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Does Helm'sdefaultfunction treat numeric0as empty (therefore returning the fallback value)?💡 Result:
Yes. Helm’s
defaultfunction treats the numeric value0as “empty”, so it will return the fallback value instead of0. This comes from Sprig’sdefaultsemantics: the definition of “empty” includes numeric0, and for “non-empty” values the given value is used, otherwise the default is returned [1]. Helm documentsdefaultasdefault DEFAULT_VALUE GIVEN_VALUEand that it returns the default “if [the given value] is empty” [2]. Helm issues report that when a value is set to0(integer),defaultbehaves as if it were unset/empty and the fallback is used [3][4]. Practical implication: if0is a valid value in your chart, don’t usedefaultdirectly; instead use an explicit check (for example, distinguish “not set” vs “set to 0”) as suggested by Helm community discussions [3][4].Citations:
defaulttemplate function treats 0 value as unspecified helm/helm#13036🏁 Script executed:
fd -type f -name 'values.yaml' charts/Repository: lobu-ai/lobu
Length of output: 227
🏁 Script executed:
Repository: lobu-ai/lobu
Length of output: 857
🏁 Script executed:
Repository: lobu-ai/lobu
Length of output: 1255
defaulthere overrides explicit0and breaks configurability.On Line 50 and Line 122, Helm
defaultwill treat0as empty and fall back to45/15. That means operators cannot intentionally setpreStopDelaySeconds: 0(disable delay) orterminationGracePeriodSeconds: 0.Suggested fix
🤖 Prompt for AI Agents