From 297e34adf70dea8aa31ccfbf99c1f43d9def80c0 Mon Sep 17 00:00:00 2001 From: myelinated-wackerow <263208946+myelinated-wackerow@users.noreply.github.com> Date: Sun, 19 Apr 2026 10:38:54 -0700 Subject: [PATCH 1/4] docs(intl-pipeline): add orchestration section Document pending-as-baseline behavior, temp branch lifecycle, base-branch-moved-during-run handling, non-English file edit policy with stamp-only escape hatch, and the orchestration contract. Rename "staging branch" to "pending branch" throughout for consistency with intl/pending- naming. Co-Authored-By: Claude Opus 4.7 Co-Authored-By: wackerow <54227730+wackerow@users.noreply.github.com> --- tests/specs/PIPELINE-SPEC.md | 66 ++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/tests/specs/PIPELINE-SPEC.md b/tests/specs/PIPELINE-SPEC.md index fba2ec00a4d..c329f876da6 100644 --- a/tests/specs/PIPELINE-SPEC.md +++ b/tests/specs/PIPELINE-SPEC.md @@ -302,6 +302,72 @@ Same pattern for JSON fixtures (per language). --- +## Orchestration + +The per-file pipeline (Phases 1-6) is a pure function. The orchestration layer wraps it to coordinate multiple pipeline runs over time against a shared base branch (e.g., `dev`). + +### Pending branch as durable cursor + +Each base branch has a corresponding pending branch: `intl/pending-` (for example, `intl/pending-dev`). The pending branch is the durable accumulator of all translations against that base -- it advances forward across pipeline runs until its contents are merged back into the base. + +**Lifecycle:** + +1. **First run (pending does not exist):** + - Create `intl/pending-` from `` HEAD + - Translate, stamp manifests, merge into pending + - Open PR: `intl/pending-` → `` + +2. **Subsequent run (pending exists):** + - Merge `` into pending first. This brings in any new English that landed on base since the previous run. **Fail fast** if the merge conflicts -- do not do any translation work. + - Use pending's state as the baseline: the pipeline's local working tree and temp branch both derive from pending (not base). Drift detection compares current English against the manifests on pending (which are stamped to the previous run's commit), not against base. + - Translate only what changed since the last stamp. + - Merge the run's temp branch back into pending. The existing PR gets updated. + +3. **After pending PR is merged:** + - The pending branch is deleted (by the normal PR merge flow or manually). The next pipeline run starts fresh, creating a new pending branch. + +### Why pending-as-baseline + +Without this, a second run against the same base would re-translate English that the first run already handled. Non-deterministic LLM output means Run 2's translations would differ from Run 1's for the same sections, producing merge conflicts on the pending branch after expensive translation work. + +With pending-as-baseline: +- The manifests on pending are authoritative. "What changed" is measured from the last stamp, not from base. +- Sections already translated in Run 1 are unchanged for Run 2 (same English → same stamped hash → no drift). +- Run 2 translates only the delta introduced since Run 1 (new PRs merged to base between runs). +- Merges back into pending are always fast-forward or clean, never conflicting. + +### Temp branch lifecycle + +Each pipeline run creates an ephemeral temp branch (`tmp-intl/run-`) to accumulate its commits before merging into pending. + +- **Created from:** pending's HEAD (if pending exists), otherwise base's HEAD. +- **Deleted:** after successful merge into pending. Temp branches are not audit artifacts -- once their commits are on pending, they serve no purpose. +- **Preserved:** only when the pipeline fails partway through translation or when the final merge into pending fails. This is a debug aid for manual recovery. + +### Base-branch-moved-during-run + +If the base branch advances while the pipeline is running (a new PR merges to `` between `start` and `end` of a run), the run's output is based on a slightly stale English. This is acceptable: the next pipeline run will see the new English state and translate the delta. No special handling required -- the orchestration naturally catches up. + +### Non-English file edit policy + +Non-English translation files should not be manually edited once the pipeline is in production. The pipeline is the single writer for locale files. Manual edits break the manifest's "source of truth" model and risk conflicts with pipeline output. + +**Escape hatch for rare cases:** +When a manual non-English edit is genuinely needed (fixing a translation error, emergency patch): +1. Only do this when the pending branch for the relevant base does not exist (i.e., no pending PR against that base). If one exists, merge or close it first. +2. Make the edit directly to ``. +3. Run the pipeline in `--stamp-only` mode to update manifests to reflect the current file state without calling the LLM. This tells the next incremental run that the current state is the canonical state. + +### Summary: orchestration contract + +Given a sequence of pipeline runs against the same base: +- Each run's output is deterministic given its inputs (current English + pending manifests). +- The pending branch is the sole accumulator. Each run advances it forward. +- Merge conflicts (base-into-pending, tmp-into-pending) abort the run with a clear error. They never corrupt existing translations or silently drop work. +- Successful runs leave the repository in a state where the next run is idempotent: if nothing changed in base, a rerun produces zero drift and zero LLM calls. + +--- + ## Open questions - **Structural mismatch handling:** When a locale file has fewer inline elements than English (e.g., Urdu drops 2 of 4 links in a sentence), this is a structural integrity violation. The pipeline should flag this for human review rather than silently skipping. How this flag is surfaced (PR comment, log warning, separate report) is TBD. From 3b794ee4bd2b8aad79b166a5f2579bd980477f16 Mon Sep 17 00:00:00 2001 From: myelinated-wackerow <263208946+myelinated-wackerow@users.noreply.github.com> Date: Sun, 19 Apr 2026 10:40:26 -0700 Subject: [PATCH 2/4] fix(intl-pipeline): use pending as baseline on re-run When the pending branch already exists for the base (prior run), merge base into pending first and fail fast on conflict, then use pending as the baseline for drift detection and the temp branch source. This prevents non-deterministic LLM output from re-translating unchanged sections and causing merge conflicts on subsequent runs against the same base. Rename ensureStagingBranch to ensurePendingBranch for consistency with intl/pending- naming. Add deleteBranch helper to clean up tmp-intl/ branches after successful merge into pending. Drop "v5" from the pipeline log header. Co-Authored-By: Claude Opus 4.7 Co-Authored-By: wackerow <54227730+wackerow@users.noreply.github.com> --- .../intl-pipeline/lib/github/branches.ts | 23 ++++++- src/scripts/intl-pipeline/main.ts | 65 +++++++++++++++++-- 2 files changed, 79 insertions(+), 9 deletions(-) diff --git a/src/scripts/intl-pipeline/lib/github/branches.ts b/src/scripts/intl-pipeline/lib/github/branches.ts index add67815122..d0f88cc09dd 100644 --- a/src/scripts/intl-pipeline/lib/github/branches.ts +++ b/src/scripts/intl-pipeline/lib/github/branches.ts @@ -85,9 +85,26 @@ export const branchExists = async (branchName: string): Promise => { return res.ok } +/** + * Delete a branch on GitHub. Returns true if deleted or already absent. + * Returns false with a warning on API failure. Never throws. + */ +export const deleteBranch = async (branchName: string): Promise => { + const url = `https://api.github.com/repos/${config.ghOrganization}/${config.ghRepo}/git/refs/heads/${branchName}` + const res = await fetchWithRetry(url, { + method: "DELETE", + headers: gitHubBearerHeaders, + }) + // 204: deleted, 422: ref does not exist + if (res.ok || res.status === 422) return true + const body = await res.text().catch(() => "") + console.warn(`[branch] Delete ${branchName} failed (${res.status}): ${body}`) + return false +} + /** * Merge a base branch into a head branch via the GitHub API. - * Used to keep the staging branch up-to-date with dev. + * Used to keep the pending branch up-to-date with dev. * Returns true if merge succeeded (or was already up-to-date). */ export const mergeBranchInto = async ( @@ -131,11 +148,11 @@ export const mergeBranchInto = async ( } /** - * Ensure a staging branch exists and is up-to-date with its base. + * Ensure a pending branch exists and is up-to-date with its base. * Creates the branch if it doesn't exist; merges base into it if it does. * Returns the branch name. */ -export const ensureStagingBranch = async ( +export const ensurePendingBranch = async ( branchName: string, baseBranch: string ): Promise => { diff --git a/src/scripts/intl-pipeline/main.ts b/src/scripts/intl-pipeline/main.ts index 7fd59a0b289..0adabb03d64 100644 --- a/src/scripts/intl-pipeline/main.ts +++ b/src/scripts/intl-pipeline/main.ts @@ -23,7 +23,10 @@ import { import i18nConfig from "../../../i18n.config.json" import { - ensureStagingBranch, + branchExists, + createBranchFromSha, + deleteBranch, + ensurePendingBranch, getBranchObject, mergeBranchInto, } from "./lib/github/branches" @@ -541,7 +544,7 @@ async function runIncremental( async function main() { const startTime = Date.now() - logSection("Incremental Translation Pipeline v5") + logSection("Incremental Translation Pipeline") if (!config.targetPaths.length) { console.error("[ERROR] TARGET_PATH is required") @@ -558,10 +561,52 @@ async function main() { log(`Mode: ${config.mode}`) log(`Concurrency: ${config.concurrency}`) - // Create temp working branch for crash safety + // If the pending branch already exists (prior run against the same base), + // use it as the baseline: merge current base into it first (fail-fast on + // conflict), sync local working tree from it so drift detection reads the + // latest stamped manifests, and branch the temp branch off of it. + const pendingExists = await branchExists(targetBranch) + let tempBranchSourceSha: string + + if (pendingExists) { + log(`Pending branch exists: ${targetBranch}`) + log(`Merging ${baseBranch} into ${targetBranch}...`) + const merged = await mergeBranchInto(baseBranch, targetBranch) + if (!merged) { + throw new Error( + `Cannot merge ${baseBranch} into ${targetBranch}. ` + + `Either resolve conflicts on ${targetBranch} manually, or delete the branch and retry. ` + + `Aborting before any translation work.` + ) + } + tempBranchSourceSha = (await getBranchObject(targetBranch)).sha + + log(`Syncing local working tree from ${targetBranch}...`) + execFileSync( + "git", + ["fetch", "origin", `+${targetBranch}:${targetBranch}`], + { stdio: "inherit" } + ) + execFileSync( + "git", + [ + "checkout", + targetBranch, + "--", + ".manifests", + "public/content", + "src/intl", + ], + { stdio: "inherit" } + ) + } else { + tempBranchSourceSha = (await getBranchObject(baseBranch)).sha + } + + // Create temp working branch for crash safety (from pending if it exists, otherwise base) const tempBranch = generateTempBranchName() log(`Temp branch: ${tempBranch}`) - await ensureStagingBranch(tempBranch, baseBranch) + await createBranchFromSha(tempBranch, tempBranchSourceSha) const baseBranchSha = (await getBranchObject(baseBranch)).sha const committer = new SharedCommitter(tempBranch) await committer.init() @@ -711,10 +756,13 @@ async function main() { } } - // Merge temp branch into target branch + // Merge temp branch into pending, then clean up the temp. + // If pending didn't exist at the start, create it from base now. if (committedFiles.length > 0 || hasCommits) { log(`Merging ${tempBranch} -> ${targetBranch}`) - await ensureStagingBranch(targetBranch, baseBranch) + if (!pendingExists) { + await ensurePendingBranch(targetBranch, baseBranch) + } const merged = await mergeBranchInto(tempBranch, targetBranch) if (!merged) { throw new Error( @@ -722,8 +770,13 @@ async function main() { ) } log(`Merged successfully`) + + // Clean up temp branch -- its work is now on pending + await deleteBranch(tempBranch) } else { log(`No changes to merge`) + // Nothing landed on the temp branch -- clean it up + await deleteBranch(tempBranch) } // Create or update PR unless skipped From 4d99f256ef00a50e990e254f859a9cb4afa25285 Mon Sep 17 00:00:00 2001 From: myelinated-wackerow <263208946+myelinated-wackerow@users.noreply.github.com> Date: Mon, 20 Apr 2026 10:07:31 -0700 Subject: [PATCH 3/4] docs(intl): refresh pipeline and review docs - PIPELINE-SPEC: add forward reference to Orchestration under Inputs; drop stale 'Git operations' and 'PR creation' entries from "What this spec does NOT cover"; reframe non-English edit policy from flat "do not hand-edit" to the manifest-mapping rule, with explicit Allowed / Not allowed examples; route the escape hatch through intl-pipeline.yml stamp_only input. - AGENTS.md: consolidate Internationalization section; capture the pipeline-is-single-propagator rule, the stamp_only workflow escape hatch, the allowance for translation-error fixes, and a pointer to PIPELINE-SPEC. - main.ts: comment the local-tree reset block as CI-only. - review-translations: migrate from the deprecated ethereum.org glossary endpoint to ETHGlossary; resolve the base URL at runtime from GLOSSARY_API_URL env var or config.ts default so the skill stays correct when the host changes; fetch llms.txt as the canonical endpoint reference; prefer POST /filter (per-file scope) over full-language fetches for efficiency; tighten the Phase 3 mandatory block and bottom notes. Co-Authored-By: Claude Opus 4.7 Co-Authored-By: wackerow <54227730+wackerow@users.noreply.github.com> --- .claude/commands/review-translations.md | 66 +++++++++++-------------- AGENTS.md | 8 +-- src/scripts/intl-pipeline/main.ts | 6 +++ tests/specs/PIPELINE-SPEC.md | 17 ++++--- 4 files changed, 48 insertions(+), 49 deletions(-) diff --git a/.claude/commands/review-translations.md b/.claude/commands/review-translations.md index 55d10e8fef6..09346228c78 100644 --- a/.claude/commands/review-translations.md +++ b/.claude/commands/review-translations.md @@ -224,38 +224,39 @@ Read `.claude/translation-review/known-patterns.md` — this contains all issue ### Translation Glossary (AUTHORITATIVE SOURCE) -The EthGlossary API (`https://ethereum.org/api/glossary`) is the **authoritative source** for all Ethereum term translations across the entire pipeline. Community-voted glossary terms are not suggestions — they are the required translations. +**ETHGlossary** is the authoritative source for Ethereum term translations. Deviations are critical issues, not warnings. -**Fetch live from the API first, fall back to cache only if the API is unreachable:** +Resolve the base URL from the pipeline config (env var wins; default lives in `src/scripts/intl-pipeline/config.ts` under `GLOSSARY_API_URL`): ```bash -# Fetch live glossary -GLOSSARY_CACHE="$HOME/.claude/translation-review/fetch-translation-glossary.json" -GLOSSARY_URL="https://ethereum.org/api/glossary" - -# Try live fetch first -if curl -sf "$GLOSSARY_URL" -o "$TMPDIR/glossary-live.json" 2>/dev/null; then - # Update cache with fresh data - cp "$TMPDIR/glossary-live.json" "$GLOSSARY_CACHE" - echo "Glossary fetched live from API and cache updated." -else - echo "WARNING: API unreachable, using cached glossary." -fi +GLOSSARY_API_URL="${GLOSSARY_API_URL:-$(grep -oE 'https://[^"]+/api/v[0-9]+' "$WORKTREE_PATH/src/scripts/intl-pipeline/config.ts" | head -1)}" +GLOSSARY_HOST="${GLOSSARY_API_URL%/api/*}" ``` -Schema: `Array<{ string_term, translation_text, language_code, total_votes }>`. +Fetch `llms.txt` first as the canonical reference for endpoints and languages; if examples below disagree, llms.txt wins: -For each language being reviewed, extract relevant glossary terms: +```bash +curl -sf "$GLOSSARY_HOST/llms.txt" \ + -o "$TMPDIR/ethglossary-llms.txt" \ + && cp "$TMPDIR/ethglossary-llms.txt" "$HOME/.claude/translation-review/ethglossary-llms.txt" ``` -Filter entries where language_code matches the target locale. -Sort by total_votes descending. -Include ALL terms for the language (not just top 50) — these are authoritative. + +**Preferred — per-file filter** (`POST /filter`): returns only the glossary terms that appear in the English source, with translations sorted by occurrence. Avoids pulling hundreds of irrelevant terms into agent context. + +```bash +ENGLISH_SOURCE=$(cat "$WORKTREE_PATH/public/content/{path}.md") +curl -sf -X POST "$GLOSSARY_API_URL/filter" \ + -H "Content-Type: application/json" \ + -d "$(jq -n --arg text "$ENGLISH_SOURCE" --arg lang "{LANGUAGE_CODE}" '{text: $text, language: $lang}')" +``` + +**Fallback — full language** when filtering per file is impractical or the endpoint is unreachable: + +```bash +curl -sf "$GLOSSARY_API_URL/translations/{LANGUAGE_CODE}" ``` -**The glossary is used in every subsequent phase:** -- **Phase 3 (Review):** Agents treat glossary deviations as CRITICAL, not warnings -- **Phase 5 (Auto-Fix):** Glossary deviations are auto-corrected to the top-voted translation -- **Phase 8 (Knowledge Base):** New deviations discovered are logged for future reviews +Used in Phase 3 (review — deviations are CRITICAL), Phase 5 (auto-fix corrects to ETHGlossary translation), Phase 8 (new deviations logged). ### Per-Language Prior Findings Check if `.claude/translation-review/per-language/{LANGUAGE_CODE}.md` exists. If so, read it and inject relevant prior findings into the agent prompt. @@ -331,22 +332,11 @@ The community has voted on these translations for key Ethereum terms. Use these - Review the entire current content of each file - Compare against English source files from the worktree -## MANDATORY: Fetch Ethereum Glossary FIRST - -**Before reviewing ANY translation, you MUST fetch the official Ethereum glossary for the language(s) being reviewed.** This is non-negotiable. The glossary contains community-approved translations for key terms. - -```bash -# Fetch full glossary (all languages): -curl -s "https://ethereum.org/api/glossary/" - -# Fetch glossary for a specific language (optional lang param, one at a time): -curl -s "https://ethereum.org/api/glossary/?lang=fr" -curl -s "https://ethereum.org/api/glossary/?lang=ja" -``` +## MANDATORY: Use ETHGlossary for the target language -The glossary returns approved translations per language. Use these as the authority for how technical terms SHOULD be translated. Flag any deviations as warnings with "Glossary mismatch" in the issue column. +Use the ETHGlossary terms fetched in Phase 2 as the authority for technical term translations. Report deviations as **critical** issues (not warnings), with the current (wrong) translation and the expected (ETHGlossary) translation so Phase 5 can auto-fix them. -**If you skip the glossary, the entire review is invalid.** +**If you skip ETHGlossary, the entire review is invalid.** ## Review Checklist @@ -733,6 +723,6 @@ ETH, Wei, Gwei, Gas - Use `--model=sonnet` or `--model=haiku` for faster reviews - Build verification is opt-in: `--build-local` for local scoped builds, `--netlify-check` for Netlify deploy preview checks - If an agent exceeds context limits with Opus, fall back to Sonnet with Grep-based file inspection -- **EthGlossary API** (`https://ethereum.org/api/glossary`) is fetched live in Phase 2 and is the authoritative source for term translations across the entire pipeline — review (Phase 3), auto-fix (Phase 5), and knowledge base (Phase 8). The local cache at `~/.claude/translation-review/fetch-translation-glossary.json` is a fallback only. +- **ETHGlossary** is the authoritative source for term translations across review (Phase 3), auto-fix (Phase 5), and knowledge base (Phase 8). See Phase 2 for usage; `llms.txt` is the canonical endpoint reference. - Knowledge base at `.claude/translation-review/` accumulates findings across reviews (committed to repo) - `gh` CLI commands require `dangerouslyDisableSandbox: true` due to TLS certificate verification issues in sandbox mode diff --git a/AGENTS.md b/AGENTS.md index 170aa511c3d..db3ad436a6b 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -119,10 +119,10 @@ pnpm events-import # Import community events ### Internationalization -- **25 languages** supported via Crowdin (canonical list: `i18n.config.json`) -- **RTL support** for Arabic, Urdu -- Translation files (JSON format) in `src/intl/[locale]/` -- Content translations managed through Crowdin platform +- **25 languages** supported (canonical list: `i18n.config.json`); **RTL support** for Arabic, Urdu +- JSON UI strings in `src/intl/[locale]/`; translated markdown content in `public/content/translations/[locale]/` +- Non-English markdown is propagated by the **intl-pipeline** (`src/scripts/intl-pipeline/`, entry `main.ts`). **Do not hand-propagate English changes into non-English files** -- let the pipeline run, or trigger `intl-pipeline.yml` with `stamp_only: true` if manifests must catch up urgently (e.g. unblocking a build). Hand-fixing a translation error is fine when the English side hasn't moved, since the manifest mapping stays valid. Spec: `tests/specs/PIPELINE-SPEC.md`. +- Glossary: base URL from `GLOSSARY_API_URL` env var; default in `src/scripts/intl-pipeline/config.ts`. ETHGlossary is authoritative for Ethereum term translations. ### Markdown Content diff --git a/src/scripts/intl-pipeline/main.ts b/src/scripts/intl-pipeline/main.ts index 0adabb03d64..d43e55bdc7a 100644 --- a/src/scripts/intl-pipeline/main.ts +++ b/src/scripts/intl-pipeline/main.ts @@ -581,6 +581,12 @@ async function main() { } tempBranchSourceSha = (await getBranchObject(targetBranch)).sha + // Force-update the local ref and check out pending's versions of the + // manifest and content paths. This is destructive to any local edits in + // those paths and is intended to run in CI (GitHub Actions) only, where + // the working tree is ephemeral. The pipeline requires GEMINI_API_KEY + // which is loaded from GH Secrets, so accidental local invocation is + // unlikely, but edits in the listed paths will be clobbered if it happens. log(`Syncing local working tree from ${targetBranch}...`) execFileSync( "git", diff --git a/tests/specs/PIPELINE-SPEC.md b/tests/specs/PIPELINE-SPEC.md index c329f876da6..a3a4c3b5b5e 100644 --- a/tests/specs/PIPELINE-SPEC.md +++ b/tests/specs/PIPELINE-SPEC.md @@ -28,6 +28,8 @@ Given an English content change (A -> B), update all locale translations with mi - **source manifest**: Merkle tree hashes of the English content at the time of last pipeline run (used for quick "did anything change?" check via rootHash comparison; `sourceCommitSha` enables retrieval of english-A) - **translation manifest**: Merkle tree of the locale content, mirroring the English tree structure. Tracks per-section hashes so the pipeline knows which sections are up to date in each locale. +> **Note on baseline selection:** across multiple runs against the same base branch, these inputs are drawn from a **pending branch** rather than directly from base. The pending branch accumulates translations and stamped manifests from prior runs, and serves as the baseline for drift detection on subsequent runs. See the [Orchestration](#orchestration) section for details. + ## Output - **locale-B**: the updated translation reflecting all changes from A -> B @@ -350,13 +352,16 @@ If the base branch advances while the pipeline is running (a new PR merges to `< ### Non-English file edit policy -Non-English translation files should not be manually edited once the pipeline is in production. The pipeline is the single writer for locale files. Manual edits break the manifest's "source of truth" model and risk conflicts with pipeline output. +The pipeline is the single propagator of English changes into non-English files. The rule is not "never hand-edit locales" -- it is "do not hand-propagate English updates." The manifest maps each locale section to a specific English state; edits that preserve that mapping are fine, edits that break it are not. + +**Allowed:** Fixing a translation error when the English side has not moved (e.g. a correction made during `/review-translations` on a pipeline-generated PR). The manifest's English -> locale mapping remains accurate, so the next incremental run treats the corrected locale content as canonical. + +**Not allowed:** Hand-editing a locale file to reflect an English change. This desynchronises the manifest from reality; the next run will either re-translate over your edit or produce merge conflicts. -**Escape hatch for rare cases:** -When a manual non-English edit is genuinely needed (fixing a translation error, emergency patch): -1. Only do this when the pending branch for the relevant base does not exist (i.e., no pending PR against that base). If one exists, merge or close it first. +**If an English-to-locale sync is genuinely needed** (e.g. a structural change that would break the build if not propagated immediately): +1. Only do this when the pending branch for the base does not exist. If one exists, merge or close it first. 2. Make the edit directly to ``. -3. Run the pipeline in `--stamp-only` mode to update manifests to reflect the current file state without calling the LLM. This tells the next incremental run that the current state is the canonical state. +3. Trigger `intl-pipeline.yml` with `stamp_only: true`. This updates the manifests to reflect the current file state without calling the LLM, telling the next incremental run that the current state is canonical. ### Summary: orchestration contract @@ -381,9 +386,7 @@ Given a sequence of pipeline runs against the same base: - Gemini API integration (mocked in tests) - GitHub Actions workflow (tested separately) -- Git operations (file retrieval via sha, committing results) - Multi-file batching (test is per-file) - Chunking for large files - Post-import sanitization -- PR creation - Image alt text translation (known gap; alt text in markdown images is not currently classified as translatable by the parser) From b1be2dc28d6e39bd8633ad9bbc45f5703b8ae3c0 Mon Sep 17 00:00:00 2001 From: myelinated-wackerow <263208946+myelinated-wackerow@users.noreply.github.com> Date: Mon, 20 Apr 2026 10:39:55 -0700 Subject: [PATCH 4/4] chore(intl-pipeline): prune FUTURE.md Remove items that have shipped: comment-restoration concatenation fix (gemini.ts passes strippedCode into restoreComments); glossary enforcement (covered by ETHGlossary migration plus the updated review-translations skill); transliteration banks (now in language-groups.ts and prompt-builder.ts). Relabel split-PRs as nice-to-have and note that the revival must adapt to today's per-base pending-branch orchestration rather than cherry-pick the old SPLIT_PRS implementation. Co-Authored-By: Claude Opus 4.7 Co-Authored-By: wackerow <54227730+wackerow@users.noreply.github.com> --- src/scripts/intl-pipeline/FUTURE.md | 44 ++++++----------------------- 1 file changed, 9 insertions(+), 35 deletions(-) diff --git a/src/scripts/intl-pipeline/FUTURE.md b/src/scripts/intl-pipeline/FUTURE.md index 661c3966333..f97c51124b7 100644 --- a/src/scripts/intl-pipeline/FUTURE.md +++ b/src/scripts/intl-pipeline/FUTURE.md @@ -6,35 +6,9 @@ ## Pipeline Quality -### 1. Fix Comment Restoration Concatenation Bug +### 1. Deep JSON Validation -**Problem:** Translated code comments are concatenated with the original instead of replacing them. Example: `// **** REMOVE LIQUIDITY **** // **** ...Arabic... ****` - -**Root cause:** `restoreComments()` in `lib/llm/code-block-extractor.ts` appends the translated comment to the existing line content instead of replacing. `translateCodeComments()` should use `strippedCode` (comments removed) as the base for restoration, not the original `block.content`. - -**Complexity:** Low. ~5 line change. - -### 2. Stronger Glossary Enforcement - -**Problem:** High-frequency glossary terms like "mint" are translated inconsistently. The glossary is sent in the prompt but Gemini doesn't always adhere strictly. - -**Proposed solution:** -- Post-translation pass that scans output for known English glossary terms that should have been translated, and flags or auto-corrects them -- Consider a validation step that compares glossary term frequency in source vs translation -- May overlap with existing sanitizer `fixKnownBrandGarbles` pattern -- extend to glossary terms - -### 3. Transliteration During Translation - -**Problem:** Gemini regresses on transliterations (author names, brand names like "Proto-danksharding") that the sanitizer then has to catch. - -**Proposed solution:** -- Include transliteration banks directly in the translation prompt for non-Latin locales -- Add language-group-specific transliteration rules to `lib/llm/prompt-builder.ts` -- Ensure the translation prompt and sanitizer are aligned on the same transliteration bank - -### 4. Deep JSON Validation - -**Problem:** Current validation only checks top-level JSON keys. Nested namespaces can have dropped or renamed keys at depth > 1 without detection. +**Problem:** Current validation only checks top-level JSON keys (`validateTranslatedJson` in `lib/llm/output-validation.ts`). Nested namespaces can have dropped or renamed keys at depth > 1 without detection. **Proposed solution:** Recursive key comparison that walks the full object tree, reporting missing/added/renamed keys at any depth. @@ -42,17 +16,17 @@ ## Pipeline Features -### 5. Split PRs (one PR per language) +### 2. Restore Split PRs (one PR per language) -- nice-to-have -**Problem:** Large multi-language runs produce a single massive PR that's hard to review. +**Problem:** Large multi-language runs produce a single massive PR that's hard to review. This previously worked via a `SPLIT_PRS` workflow input (commit `a52be9ddd9`) but was removed during the pipeline rewrite. -**Proposed solution:** A workflow input `split_prs` (boolean, default false) that creates a separate branch and PR per language. +**Nuance:** Today's orchestration assumes one `intl/pending-` per base. Restoring split PRs means applying the full orchestration contract (base-into-pending merge, fail-fast, local-tree sync, temp branch, PR) independently per language -- e.g. `intl/pending--`. The old implementation predates this model, so it needs adaptation rather than cherry-pick. --- ## Automation -### 6. Auto-trigger Translations on Content Merge +### 3. Auto-trigger Translations on Content Merge **Problem:** Content changes merged to dev currently require manual triggering of the translation pipeline. @@ -61,7 +35,7 @@ - Automatically triggers the translation workflow for changed files - Should respect a cooldown/batch window to avoid triggering on every small merge -### 7. Full-language Retroactive Cleanup +### 4. Full-language Retroactive Cleanup **Problem:** Many languages were translated before current pipeline improvements. Those translations have the same class of issues found in Arabic (brand garbles, wrong compounds, etc.). @@ -74,7 +48,7 @@ ## Image Translation -### 8. Translate Text in Diagrams and Infographics +### 5. Translate Text in Diagrams and Infographics **Problem:** Educational diagrams and infographics contain English text that remains untranslated, creating a jarring experience on otherwise fully translated pages. @@ -89,7 +63,7 @@ ## Package Extraction -### 9. Extract i18n Tooling into Standalone Packages +### 6. Extract i18n Tooling into Standalone Packages **Problem:** Glossary, translation pipeline, and (future) image pipeline are embedded in the repo. Creates bloat and prevents reuse.