Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
62cf334
add: intl-pipeline SKILL.md entry point
myelinated-wackerow May 15, 2026
ffc3416
add: intl-review SKILL.md entry point
myelinated-wackerow May 15, 2026
691eabc
add: intl-pipeline architecture + recovery references
myelinated-wackerow May 15, 2026
02a0789
add: intl-pipeline ethglossary + non-english-edits references
myelinated-wackerow May 15, 2026
d987ac0
add: intl-pipeline orchestration + gotchas references
myelinated-wackerow May 15, 2026
5e1a012
add: intl-pipeline manifests + sanitizer + fix-sanitizer-bug runbook
myelinated-wackerow May 15, 2026
3c2acd7
add: intl-review known-patterns + language-rules references
myelinated-wackerow May 15, 2026
e75fe72
add: intl-review scoring + ethglossary-usage + per-language-tracking …
myelinated-wackerow May 15, 2026
2bc4b31
add: intl-review critical-vs-warning + agent-roles + gotchas refs
myelinated-wackerow May 15, 2026
1767958
chore(intl): point AGENTS.md at skills, trim Crowdin-era docs
myelinated-wackerow May 15, 2026
4478973
add: trigger evals for intl-pipeline + intl-review skills
myelinated-wackerow May 15, 2026
2211641
revise: ethglossary references -- lead with llms.txt, drop cross-repo PR
myelinated-wackerow May 18, 2026
201f5d0
fix: code-fence language identifiers on three references
myelinated-wackerow May 18, 2026
7bc67fb
refactor(evals): per-skill evals.json matching skill-creator conventions
myelinated-wackerow May 18, 2026
beffe2f
Merge branch 'dev' into intl-knowledge-base
myelinated-wackerow May 18, 2026
8d830ce
refactor(sanitizer): drop transliteration bank, ETHGlossary canonical
myelinated-wackerow May 18, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions .claude/skills/intl-pipeline/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
name: intl-pipeline
description: Use when working on the translation pipeline (`src/scripts/intl-pipeline/`), the post-import sanitizer (`intl-sanitizer.ts`), the `intl/pending-*` branches, the `intl-pipeline.yml` GitHub Actions workflow, ETHGlossary integration, or any translation-related issue (broken translations, missing locales, manifest drift, sanitizer false positives, ETHGlossary term lookups, language-group transliteration questions). Provides the manifest-driven incremental pipeline mental model, the "don't hand-propagate" rule, recovery procedures, the ETHGlossary-as-canonical-source policy, and the `intl/pending-{base}` orchestration contract.
---

# intl-pipeline

LLM-based, manifest-driven, incremental translation pipeline for ethereum.org. Translates English content (markdown in `public/content/`, JSON UI strings in `src/intl/en/`) into 24 target languages via Gemini, with a post-import sanitizer normalizing common artifacts. ETHGlossary (https://ethglossary.visual-20-hoists.workers.dev) is the authoritative source for all Ethereum-ecosystem term translations consumed by the pipeline. Read this file fully on activation; pull from `references/` only when the listed trigger applies.

## The Core Rule: Don't Hand-Propagate English Changes

The single highest-leverage habit for keeping translations correct: **never hand-edit translated content to reflect an English change.** The pipeline tracks state via manifests (`.manifest-source.json` + `.manifest-translation.json` per file+locale); hand-edits that follow an English change desynchronize the manifest from reality, and the next pipeline run either re-translates over your edit or produces merge conflicts.

The rule is NOT "never edit locales." It IS "don't hand-propagate English updates":

- **Allowed:** Fixing a translation error when the English side hasn't moved (review-time corrections, sanitizer follow-ups). The manifest's English-to-locale map stays valid.
- **Not allowed:** Editing a locale to reflect a new English value (URL change, attribute change, restructured paragraph). The manifest map becomes wrong.
- **If English-to-locale sync is genuinely urgent** (build-breaking structural change with no pending PR open): make the English edit, then trigger `intl-pipeline.yml` with `stamp_only: true` to refresh manifests without translation. Safe only when no `intl/pending-{base}` branch exists for that base.

## Top Rules

1. **ETHGlossary is canonical for term translations.** Brand names, people's names, programming languages, OS/platform names, concept terms — all live in ETHGlossary. Don't maintain parallel term banks. The pipeline queries the API via `GLOSSARY_API_URL` (default in `src/scripts/intl-pipeline/config.ts`).
2. **One translation PR at a time per base branch.** The pipeline commits to `intl/pending-{base}` (e.g., `intl/pending-dev`). Subsequent runs merge `{base}` into pending first, then translate the delta. Parallel translation PRs against the same base will conflict.
3. **The pipeline only targets `dev` in production.** Hot fixes to `staging` / `master` go out English-only and catch up via prepare-release. Don't translate against `staging` / `master` unless you have a specific reason and use a custom `translation_branch`.
4. **Manifests are inseparable from their locale file.** Each translated file has two manifests next to it: `.manifest-source.json` and `.manifest-translation.json`. Delete one, you must regenerate both — easiest via the pipeline in `full` mode for that file+locale. Never hand-edit a manifest.
5. **The sanitizer runs post-translation, not pre.** Its job is to fix Gemini-introduced artifacts (BiDi mistakes, code-fence drift, brand-name mistranslations). It receives translation outputs, never English source.
6. **Don't add transliteration data here.** All term/brand/person transliteration policy lives in ETHGlossary's `docs/translation-policy.md` and per-language entries. The intl-pipeline consumes; it does not author.
7. **Sanitizer fixes must split on code blocks first.** Every text transformation in `intl-sanitizer.ts` MUST start with the code-block split pattern. Modifying code-fence contents breaks Solidity / Python / TypeScript examples in tutorials.
8. **Pipeline failures are not always pipeline bugs.** A "translation looks wrong" report may be: bad Gemini output (file upstream), missing ETHGlossary term (add there), correct per language-group policy (read translation-policy.md), or an actual pipeline bug. Triage before patching.

## Highest-Value Gotchas

These are landmines where the obvious-looking action is wrong. The full set is in `references/gotchas.md`; these come up most often.

### Naming legacy

The pipeline was historically called the "Gemini translation pipeline" and the "i18n pipeline." Old docs and external references may still use those names. Canonical names today:

- Module path: `src/scripts/intl-pipeline/` (was `src/scripts/i18n/`)
- Workflow file: `.github/workflows/intl-pipeline.yml` (was `gemini-translations.yml`)
- Pipeline entry: `src/scripts/intl-pipeline/main.ts` (was `main-incremental.ts` / `main-gemini.ts`)
- Module name in prose: `intl-pipeline`. Avoid `i18n-pipeline` and `gemini-translations` in new content.

If you see one of the old names in a doc or comment, update it.

### Manifests live next to the locale file, not centralized

For `public/content/translations/ja/some-page/index.md`, the manifests are:

- `public/content/translations/ja/some-page/index.md.manifest-source.json`
- `public/content/translations/ja/some-page/index.md.manifest-translation.json`

Same pattern for JSON locale files in `src/intl/{locale}/`. Don't move a locale file without moving its manifests; renaming a content file means renaming six files per locale (file + 2 manifests, times the en source).

### `intl/pending-{base}` branch lifecycle

The pipeline creates `intl/pending-{base}` on first run, commits translations to it, opens a PR against `{base}`. Subsequent runs MERGE `{base}` into pending first, then translate the delta. After the pending PR is merged, the branch is deleted; the next run creates a fresh one.

Do NOT rebase, squash, or force-push `intl/pending-{base}`. The pipeline depends on its history.

### `stamp_only: true` is the override hatch

If a structural change in English would break the build for translated locales (a removed component the locale still references, an MDX-incompatible change), trigger `intl-pipeline.yml` with `stamp_only: true` AFTER fixing the English side. This regenerates manifests without translation. Safe only when no pending branch exists for that base.

### SOV-language inline element reordering

For Korean, Urdu, and other SOV (subject-object-verb) languages, inline elements (links, inline code, JSX components) often appear in REVERSE order vs. English. The pipeline's inert-propagation pass matches elements by **value**, not by position. Any logic that assumes positional order will silently corrupt those languages.

### JSX attribute translation is a separate pass

JSX attribute values are NOT translated in the main Phase 4 LLM call. Phase 4b is a dedicated pass with an allow-list of translatable attribute names (`title`, `description`, `alt`, `label`, `aria-label`, `placeholder`, etc.) defined in `src/scripts/intl-pipeline/lib/shared-patterns.ts`. Touching attribute translation means touching that pass, not Phase 4.

### Sanitizer test scope is per-file, never per-language sweep

NEVER run the sanitizer against an entire language. It processes thousands of files and hangs for 30+ minutes. Always scope to specific files via `TARGET_FILES` env var. The slash command `/fix-sanitizer-bug` enforces this; if you script around it, preserve the constraint.

## Quick "Where Do I Look?" Cheatsheet

| I need... | Path |
|---|---|
| Pipeline entry | `src/scripts/intl-pipeline/main.ts` |
| Sanitizer | `src/scripts/intl-pipeline/intl-sanitizer.ts` |
| Gemini adapter | `src/scripts/intl-pipeline/lib/llm/gemini.ts` |
| Prompt builder | `src/scripts/intl-pipeline/lib/llm/prompt-builder.ts` |
| Content normalizer | `src/scripts/intl-pipeline/lib/llm/content-normalizer.ts` |
| Shared patterns (JSX attrs allow-list) | `src/scripts/intl-pipeline/lib/shared-patterns.ts` |
| Glossary config | `src/scripts/intl-pipeline/config.ts` |
| Workflow file | `.github/workflows/intl-pipeline.yml` |
| Per-file pipeline spec (canonical) | `tests/specs/PIPELINE-SPEC.md` |
| Concurrency / chunking spec | `tests/specs/CONCURRENCY-SPEC.md` |
| Test fixture mutation table | `tests/specs/SPEC.md` |
| Sanitizer test suite | `tests/unit/intl-pipeline/sanitizer/` |
| Pipeline test suite | `tests/unit/intl-pipeline/` |
| Future-work backlog | `src/scripts/intl-pipeline/FUTURE.md` |
| Language config (canonical list) | `i18n.config.json` |
| ETHGlossary repo | https://github.com/wackerow/ethglossary |
| ETHGlossary API root | https://ethglossary.visual-20-hoists.workers.dev |
| Slash command: fix sanitizer bug | `/fix-sanitizer-bug` (`.claude/commands/fix-sanitizer-bug.md`) |
| Slash command: review translations | `/review-translations` (`.claude/commands/review-translations.md`) |

## When to Load Each Reference

Pull these in only when the trigger applies. Don't read them all upfront.

- **`references/architecture.md`** — Load when debugging pipeline behavior or wanting the phase-by-phase walkthrough (change detection, routing, deterministic propagation, LLM translation, JSX attribute pass, assembly, manifest update).
- **`references/manifests.md`** — Load when manifests misbehave (mismatched hashes, missing sections, drift on supposedly-unchanged files), or when changing what the pipeline tracks.
- **`references/orchestration.md`** — Load when working with `intl/pending-{base}` branches: base-into-pending merges, temp-branch lifecycle, multi-run coordination, what happens when base moves during a run.
- **`references/recovery.md`** — Load when translations are broken and you need to recover (bad LLM output, corrupted manifests, accidental hand-edits, build failure on a locale). The first place to go when triaging "the pipeline did something wrong."
- **`references/sanitizer.md`** — Load when investigating sanitizer behavior, false positives / negatives, or adding fix functions. Pattern catalog lives in `docs/solutions/integration-issues/sanitizer-test-research.md`.
- **`references/runbooks/fix-sanitizer-bug.md`** — Load when you have a confirmed sanitizer bug and need the test-first workflow (triage / write failing test / implement / verify).
- **`references/ethglossary.md`** — Load when ETHGlossary is the question: how the pipeline queries terms, when to add to ETHGlossary, how `script_rule` and `term_role` map to pipeline behavior, what to do when a term is missing.
- **`references/non-english-edits.md`** — Load when about to edit a translated file by hand. Tells you when it's safe (English unchanged) and when it isn't (sync after English change).
- **`references/gotchas.md`** — Load when something feels off and you can't find it inline above. The long tail of landmines.

## Other Project Skills That May Apply

- **`intl-review`** — For translation quality review (scoring rubric, language-group rules, brand-name policies, per-language findings). The review side of the same coin; the pipeline produces, intl-review evaluates.
- **`data-layer`** — For data fetching, if pipeline changes touch externally-sourced content.

## Pre-Merge Smoke Test

Before opening a PR that touches the pipeline:

- [ ] Sanitizer test suite passes (`npx playwright test --project=unit tests/unit/intl-pipeline/sanitizer/`)
- [ ] Pipeline test suite passes (`npx playwright test --project=unit tests/unit/intl-pipeline/`)
- [ ] No hand-edits to `public/content/translations/` or `src/intl/{non-en}/` files
- [ ] No manual changes to `*.manifest-*.json` files
- [ ] If renaming workflow / config paths, all doc references updated (search for old names)
- [ ] If adding pipeline-affecting code, `src/scripts/intl-pipeline/FUTURE.md` updated (or item removed if completed)
- [ ] If new sanitizer fix function, the code-block-split pattern is the first operation inside it
- [ ] If touching term policy or transliteration, change goes in ETHGlossary first — never duplicate term data here
- [ ] If touching the orchestration model (pending branch, temp branch, stamp_only), `references/orchestration.md` updated alongside
76 changes: 76 additions & 0 deletions .claude/skills/intl-pipeline/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
{
"skill_name": "intl-pipeline",
"notes": "Output evals for the intl-pipeline skill. Each prompt exercises a distinct high-leverage rule: don't-hand-propagate on English change, sanitizer pattern-gap workflow, manifest corruption recovery, missing ETHGlossary term protocol, hot-fix translation policy. Run with the skill-creator workflow (`python -m scripts.run_loop` against the eval JSON) or manually by reading SKILL.md and answering each prompt, then grading against the assertions.",
"evals": [
{
"id": 1,
"name": "dont-hand-propagate-on-english-update",
"prompt": "I just updated the description in the frontmatter of public/content/staking/index.md (English) -- it used to say 'staking ETH' but now says 'staking your ETH for rewards'. There are 24 translated versions of this page. Should I just go update each translated frontmatter to match, or is there a special process? Deadline is Monday.",
"expected_output": "Skill should advise against hand-editing the 24 translated frontmatter files. Should recommend triggering the pipeline (intl-pipeline.yml) on the changed English file -- the pipeline will incremental-translate just the changed `description` field across all locales. Should explain that hand-edits desync the manifest and cause future runs to misclassify. Should NOT suggest stamp_only as the primary fix (that's for build-breaking emergencies, not routine content updates).",
"files": [],
"assertions": [
{"text": "Output explicitly advises against hand-editing the 24 translated frontmatter files to match"},
{"text": "Output recommends triggering the pipeline (intl-pipeline.yml or workflow_dispatch) for the changed English file"},
{"text": "Output references manifest tracking (or the manifest map between English and locale) as the reason hand-edits are problematic"},
{"text": "Output does NOT recommend `stamp_only: true` as the primary fix for this scenario"},
{"text": "Output mentions or implies the `references/non-english-edits.md` material (allowed vs not-allowed edits) OR the don't-hand-propagate rule from SKILL.md"}
]
},
{
"id": 2,
"name": "sanitizer-pattern-gap-workflow",
"prompt": "running into a weird thing with the Japanese translation of dapps/index.md -- when the sanitizer runs, it's leaving `\\</strong>` (with a backslash before the slash) and it's breaking the build for the Japanese locale. seems like this happens on multiple files. is this a sanitizer bug? how do I add a fix?",
"expected_output": "Skill should recommend the `/fix-sanitizer-bug` slash command (test-first workflow). Should mention that fix functions MUST start with the code-block split pattern. Should point to sanitizer-test-research.md as the pattern catalog. Should NOT recommend hand-fixing individual files as the primary fix when the pattern is systemic.",
"files": [],
"assertions": [
{"text": "Output recommends the `/fix-sanitizer-bug` slash command (or its test-first workflow if not by name)"},
{"text": "Output mentions writing a failing test before implementing the fix"},
{"text": "Output mentions the code-block-split pattern as the required first operation in any fix function"},
{"text": "Output points to `docs/solutions/integration-issues/sanitizer-test-research.md` OR the runbook at `references/runbooks/fix-sanitizer-bug.md`"},
{"text": "Output does NOT recommend hand-fixing the affected files as the primary fix for a systemic pattern"}
]
},
{
"id": 3,
"name": "manifest-corruption-recovery",
"prompt": "i think one of the manifest files got corrupted somehow. The build is failing on the Korean version of beginner-intro-to-ethereum/index.md and when I look at .manifest-source.json it has weird hashes that don't match the English. what's the recovery procedure? can I just delete the bad one?",
"expected_output": "Skill should explain that both manifests (source + translation) must be handled together -- delete one, must delete the other. Then re-run the pipeline (full mode) for that file+locale to regenerate. Should NOT recommend hand-editing the manifest JSON. References recovery.md.",
"files": [],
"assertions": [
{"text": "Output explains that both manifest files (.manifest-source.json AND .manifest-translation.json) must be deleted together, not just one"},
{"text": "Output recommends running the pipeline (mode: full or equivalent) for that file+locale to regenerate manifests"},
{"text": "Output does NOT recommend hand-editing the manifest JSON content"},
{"text": "Output mentions or references `references/recovery.md` material (the recovery procedures)"},
{"text": "Output explains why the manifest matters (e.g., pipeline uses it for drift detection / change classification)"}
]
},
{
"id": 4,
"name": "missing-ethglossary-term-protocol",
"prompt": "reviewing some translations and noticed that 'Tornado Cash' isn't in ETHGlossary -- it's just rendering whatever Gemini produced for each language, which varies. should I just add the translations to the local file at .claude/translation-review/transliterations/? or is there a better way?",
"expected_output": "Skill should say DON'T add a local override. Flag the missing term in the review report. Optionally note in per-language findings file. ETHGlossary is the canonical source -- coordinate upstream rather than maintaining a parallel local bank. Should NOT recommend committing a PR against the ETHGlossary repo as part of the review workflow (separate maintainer task).",
"files": [],
"assertions": [
{"text": "Output explicitly says NOT to add a local override for the term in `.claude/translation-review/transliterations/` (or anywhere local)"},
{"text": "Output recommends flagging the missing term in the review report (and optionally per-language findings)"},
{"text": "Output mentions ETHGlossary as the canonical / authoritative source for term translations"},
{"text": "Output does NOT direct the user to immediately open a PR or issue against `wackerow/ethglossary` as part of the review workflow"},
{"text": "Output does NOT suggest editing the deprecated `.claude/translation-review/transliterations/*.json` files"}
]
},
{
"id": 5,
"name": "hot-fix-staging-translation-policy",
"prompt": "we have a hotfix going to staging tomorrow that updates a fee disclosure on the staking page. it's a copy change -- adds two sentences. do we need to translate this for the staging deploy or how does that work?",
"expected_output": "Skill should say hot fixes to staging or master go out English-only. Translations catch up on the next release cycle when dev (with translations) flows through prepare-release. Should NOT recommend dispatching the pipeline against staging as the standard answer (it's possible but non-standard).",
"files": [],
"assertions": [
{"text": "Output states that hot fixes to staging or master go out English-only (not translated for the immediate deploy)"},
{"text": "Output explains that translations catch up on the next release cycle when dev flows through prepare-release"},
{"text": "Output does NOT recommend dispatching `intl-pipeline.yml` against staging or master with a custom branch as the standard answer"},
{"text": "Output mentions or implies that the pipeline only targets `dev` in production by design"},
{"text": "Output does NOT block the hotfix on translations"}
]
}
]
}
Loading
Loading