Skip to content

fix(core): add AWS env vars to subprocess allowlist for Bedrock auth#1060

Closed
liorfranko wants to merge 1 commit intocoleam00:devfrom
liorfranko:fix/bedrock-env-allowlist
Closed

fix(core): add AWS env vars to subprocess allowlist for Bedrock auth#1060
liorfranko wants to merge 1 commit intocoleam00:devfrom
liorfranko:fix/bedrock-env-allowlist

Conversation

@liorfranko
Copy link
Copy Markdown
Contributor

@liorfranko liorfranko commented Apr 10, 2026

Summary

  • AWS credentials (AWS_PROFILE, AWS_REGION, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_DEFAULT_REGION) were missing from the subprocess env allowlist
  • This caused 401 authentication failures when using Claude Code via AWS Bedrock, because the Claude subprocess couldn't see the AWS credentials it needs
  • The allowlist already included CLAUDE_CODE_USE_BEDROCK but not the AWS vars it depends on

Test plan

  • Verified fix locally: server chat with Bedrock auth works after adding AWS vars to allowlist
  • Verify no regression for direct Anthropic API users (no AWS vars present = no change)
  • Verify Vertex users are unaffected

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • AWS environment variables are now available during code execution.

AWS credentials (AWS_PROFILE, AWS_REGION, AWS_ACCESS_KEY_ID, etc.) were
stripped from Claude Code subprocesses by the env allowlist, causing 401
auth failures when using Bedrock. The allowlist already included
CLAUDE_CODE_USE_BEDROCK but not the AWS vars it depends on.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

The subprocess environment allowlist was extended to include six AWS-related environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_PROFILE, AWS_REGION, AWS_DEFAULT_REGION) to enable their propagation into Claude Code subprocesses.

Changes

Cohort / File(s) Summary
Environment Allowlist Configuration
packages/core/src/utils/env-allowlist.ts
Added six AWS environment variables to SUBPROCESS_ENV_ALLOWLIST to permit AWS credentials and configuration propagation to subprocesses.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Poem

🐰 AWS keys hop into the list,
No Claude Code command shall be missed!
Credentials flow swift, secure, and bright,
AWS magic takes its flight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description addresses the core problem and solution but lacks several required sections from the template including architecture diagrams, validation evidence, security impact analysis, compatibility assessment, human verification details, and rollback plan. Complete the required template sections: provide validation evidence (test results), document security impact analysis for credential handling, confirm backward compatibility, detail human verification steps performed, describe rollback plan, and include architecture diagrams.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and specifically describes the main change: adding AWS environment variables to the subprocess allowlist for Bedrock authentication.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
packages/core/src/utils/env-allowlist.ts (1)

37-43: Add a regression test for required Bedrock AWS keys in the allowlist.

Given this was a production auth break, a small unit test asserting these keys are preserved by buildCleanSubprocessEnv() would prevent accidental regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/core/src/utils/env-allowlist.ts` around lines 37 - 43, Add a unit
test that asserts buildCleanSubprocessEnv() preserves the required Bedrock AWS
keys in the allowlist: 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY',
'AWS_SESSION_TOKEN', 'AWS_PROFILE', 'AWS_REGION', and 'AWS_DEFAULT_REGION'. Mock
or supply an env object containing these keys with sample values, call
buildCleanSubprocessEnv(env) and assert each of the listed AWS keys exists and
keeps its value in the returned object; place the test alongside existing
env-related tests and use the same test helpers/assertion style as the repo.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@packages/core/src/utils/env-allowlist.ts`:
- Around line 37-43: Add a unit test that asserts buildCleanSubprocessEnv()
preserves the required Bedrock AWS keys in the allowlist: 'AWS_ACCESS_KEY_ID',
'AWS_SECRET_ACCESS_KEY', 'AWS_SESSION_TOKEN', 'AWS_PROFILE', 'AWS_REGION', and
'AWS_DEFAULT_REGION'. Mock or supply an env object containing these keys with
sample values, call buildCleanSubprocessEnv(env) and assert each of the listed
AWS keys exists and keeps its value in the returned object; place the test
alongside existing env-related tests and use the same test helpers/assertion
style as the repo.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8fe2a73-d23f-40bc-95b1-4af04559d69f

📥 Commits

Reviewing files that changed from the base of the PR and between e8334b3 and 40aca38.

📒 Files selected for processing (1)
  • packages/core/src/utils/env-allowlist.ts

Wirasm added a commit that referenced this pull request Apr 12, 2026
…only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see #1097)
- It attracted a constant stream of PRs adding keys (#1060, #1093, #1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 12, 2026

Addressed by the allowlist removal in #1092SUBPROCESS_ENV_ALLOWLIST has been deleted. All user-configured env vars (including AWS Bedrock vars like AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, etc.) now pass through to the subprocess automatically via process.env. Once #1092 merges, this PR can be closed.

Wirasm added a commit that referenced this pull request Apr 12, 2026
…t timeout (#1067, #1030, #1098, #1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes #1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR #1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate #1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per #1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see #1097)
- It attracted a constant stream of PRs adding keys (#1060, #1093, #1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
@Wirasm
Copy link
Copy Markdown
Collaborator

Wirasm commented Apr 12, 2026

Resolved by #1092 (allowlist removed — all AWS Bedrock vars pass through automatically).

@Wirasm Wirasm closed this Apr 12, 2026
prospapledge88 pushed a commit to prospapledge88/Archon that referenced this pull request Apr 14, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
Tyone88 pushed a commit to Tyone88/Archon that referenced this pull request Apr 16, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
joaobmonteiro pushed a commit to joaobmonteiro/Archon that referenced this pull request Apr 26, 2026
…t timeout (coleam00#1067, coleam00#1030, coleam00#1098, coleam00#1070)

* fix: strip CWD .env leak, enable platform adapters in serve, add first-event timeout (coleam00#1067)

Three bugs fixed: (1) Bun auto-loads CWD .env files before user code, leaking
non-overlapping keys into the Archon process — new stripCwdEnv() boot import
removes them before any module reads env. (2) archon serve hardcoded
skipPlatformAdapters:true, preventing Slack/Telegram/Discord from starting.
(3) Claude SDK query had no first-event timeout, causing silent 30-min hangs
when the subprocess wedges — new withFirstMessageTimeout wrapper races the
first event against a configurable deadline (default 60s).

Changes:
- Add @archon/paths/strip-cwd-env and strip-cwd-env-boot modules
- Import boot module as first import in CLI entry point
- Remove skipPlatformAdapters: true from serve.ts
- Add withFirstMessageTimeout + diagnostics to ClaudeClient
- Add CLAUDECODE=1 nested-session warning to CLI
- Add 9 unit tests (6 strip-cwd-env + 3 timeout)

Fixes coleam00#1067

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address review findings for PR coleam00#1092

Fixed:
- Clear setTimeout timer in withFirstMessageTimeout finally block (HIGH-1)
- Add strip-cwd-env-boot to server/src/index.ts for direct dev:server path (MEDIUM-1)
- Warn to stderr on non-ENOENT errors in stripCwdEnv (MEDIUM-2)
- Update stale configuration.md docs for new env-loading mechanism (HIGH-2)
- Add ARCHON_CLAUDE_FIRST_EVENT_TIMEOUT_MS and ARCHON_SUPPRESS_NESTED_CLAUDE_WARNING env vars to docs
- Add nested Claude Code hang troubleshooting entry
- Fix boot module JSDoc: "CLI and server" → "CLI" only
- Fix stripCwdEnv JSDoc: remove stale "override: true" reference
- Update .claude/rules/cli.md startup behavior section
- Update CLAUDE.md @archon/paths description with new exports

Tests added:
- Assert controller.signal.aborted on timeout
- Handle generator that completes immediately without yielding
- Strip distinct keys from different .env files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* simplify: replace string sentinel with typed error class in withFirstMessageTimeout

Replace the '__timeout__' string sentinel used to identify timeout rejections
with a dedicated FirstEventTimeoutError class. instanceof checks are more
explicit and robust than string comparison on error messages.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address review findings — dotenv version, docs, server warning, marker strip, tests

1. Align dotenv to ^17 (was ^16, rest of monorepo uses ^17.2.3)
2. Remove incorrect SUBPROCESS_ENV_ALLOWLIST claim from docs — the SDK
   bypasses the env option and uses process.env directly (coleam00#1097)
3. Add CLAUDECODE=1 warning to server entry point (was only in CLI)
4. Add diagnostic payload content test for withFirstMessageTimeout
5. Integrate coleam00#1097's finding: strip CLAUDECODE + CLAUDE_CODE_* session
   markers (except auth vars) + NODE_OPTIONS + VSCODE_INSPECTOR_OPTIONS
   from process.env at entry point. Pattern-matched on CLAUDE_CODE_*
   prefix rather than hardcoding 6 names, so future Claude Code markers
   are handled automatically. Auth vars (CLAUDE_CODE_OAUTH_TOKEN,
   CLAUDE_CODE_USE_BEDROCK, CLAUDE_CODE_USE_VERTEX) are preserved.

   Root cause per coleam00#1097: the Claude Agent SDK leaks process.env into the
   spawned child regardless of the explicit env option, so the only way
   to prevent the nested-session deadlock is to delete the markers from
   process.env at the entry point.

Validation: bun run validate passes, 125 paths tests (6 new marker
tests), 60 claude tests (1 new diagnostic test), DATABASE_URL leak
verified stripped (target repo .env DATABASE_URL does not affect Archon
DB selection).

* refactor: remove SUBPROCESS_ENV_ALLOWLIST — trust user config, strip only CWD

The allowlist was wrong for a single-developer tool:
- It blocked keys the user intentionally set in ~/.archon/.env
  (ANTHROPIC_API_KEY, AWS_*, CLAUDE_CONFIG_DIR, MiniMax vars, etc.)
- It was bypassed by the SDK anyway (process.env leaks to subprocess
  regardless of the env option — see coleam00#1097)
- It attracted a constant stream of PRs adding keys (coleam00#1060, coleam00#1093, coleam00#1099)

New model: CWD .env keys are the only untrusted source. stripCwdEnv()
at entry point handles that. Everything in ~/.archon/.env + shell env
passes through to the subprocess. No filtering, no second-guessing.

Changes:
- Delete env-allowlist.ts and env-allowlist.test.ts
- Simplify buildSubprocessEnv() to return { ...process.env } with
  auth-mode logging (no token stripping — user controls their config)
- Replace 4 allowlist-based tests with 1 pass-through test
- Remove env-allowlist.test.ts from core test batch
- Update security.md and cli.md docs to reflect the new model

The CLAUDECODE + CLAUDE_CODE_* marker strip and NODE_OPTIONS strip
remain in stripCwdEnv() at entry point — those are process-level
safety (not per-subprocess filtering) and are needed regardless.

* fix: restore override:true for archon env, add integration tests

The integration tests caught a real issue: without override:true, the
~/.archon/.env load doesn't win over shell-inherited env vars. If the
user's shell profile exports PORT=9999 and ~/.archon/.env has PORT=3000,
the user expects Archon to use 3000.

stripCwdEnv() handles CWD .env files (untrusted). override:true handles
shell-inherited vars (trusted but less specific than ~/.archon/.env).
Different concerns, both needed.

Also adds 6 integration tests covering the full entry-point flow:
1. Global auth user with ANTHROPIC_API_KEY in CWD .env — stripped
2. OAuth token in archon env + random key in CWD — CWD stripped, archon kept
3. General leak test — nothing from CWD reaches subprocess
4. Same key in both CWD and archon — archon value wins
5. CLAUDECODE markers stripped even when not from CWD .env
6. CLAUDE_CODE_OAUTH_TOKEN survives marker strip

* test: add DATABASE_URL leak scenarios to env integration tests

* fix: move CLAUDECODE warning into stripCwdEnv, remove dead useGlobalAuth logic

Review findings addressed:

1. CLAUDECODE warning was dead code — the boot import deleted CLAUDECODE
   from process.env before the warning check in cli.ts/server/index.ts
   could fire. Moved the warning into stripCwdEnv() itself, emitted
   BEFORE the deletion. Removed duplicate warning code from both entry
   points.

2. useGlobalAuth token stripping removed (intentional, not regression) —
   the old code stripped CLAUDE_CODE_OAUTH_TOKEN and CLAUDE_API_KEY when
   useGlobalAuth=true. Per design discussion: the user controls
   ~/.archon/.env and all keys they set are intentional. If they want
   global auth, they just don't set tokens. Simplified buildSubprocessEnv
   to log auth mode for diagnostics only, no filtering.

3. Docs "no override needed" corrected — cli.md and configuration.md
   now reflect the actual code (override: true).

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Rasmus Widing <rasmus.widing@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants