i18n: automation mvp by wackerow · Pull Request #16954 · ethereum/ethereum-org-website

wackerow · 2025-12-18T16:23:43Z

Description

Refactored Crowdin translation import pipeline with automated syntax fixing and modular architecture.

Post-Import Sanitization (post_import_sanitize.ts)

Auto-fix broken markdown links
Fix translated internal hrefs (paragraph-scoped matching)
Normalize ButtonLink/Button formatting
Restore line breaks between collapsed MDX components
Protected brand name warnings
Header structure validation

JSX Attribute Translation

Extract/translate/reinsert JSX attributes (alt, title, label)
Standalone workflow option

Syntax Validation

MDX syntax tree validation before PR creation
PR comments with file/line errors

Architecture Refactor

main.ts broken into workflow modules
Separated Crowdin/GitHub API utilities

add QA polish pass via Crowdin ai_prompt override; workflow inputs for pre_translate and qa_check prompts

Switch QA to AI Prompt Completions (qa_check) Resolve user id via GET /api/v2/user (no secret needed) Remove CROWDIN_USER_ID from workflow env Read PRE_TRANSLATE_PROMPT_ID from env Add QA summary to PR body Tidy main.ts by removing unused env references

Adds backup for existing Crowdin glossary/TM; syncs updates from EthGlossary supabase db with Crowdin

Major refactor to simplify the translation automation workflow: REMOVED: - Supabase glossary integration and all related files - Multi-tier QA system with trust matrix and language scoring - OpenAI trust matrix generation - Post-translation Crowdin re-sync logic - Multiple PR workflow (high/medium/low trust tiers) - 13 files deleted total NEW FEATURES: - Unified target_path input (auto-detects file vs directory) - Smart translation modes: single file, directory, or full translation - Intelligent timeout handling (waits for file/dir, exits for full) - Pre-translation artifact with job metadata for resuming - Verbose logging flag for cleaner output - Single PR workflow for all languages UPDATES: - GitHub workflow: simplified inputs, removed unused env vars - config.ts: replaced fileLimit/startOffset with targetPath - main.ts: complete rewrite (667 → 422 lines) - lib/github/files.ts: smart path detection with excluded-paths.json - lib/crowdin/files.ts: verbose logging support - .gitignore: added artifacts/ directory The pipeline now focuses on Spanish MVP with ability to expand to 25+ languages defined in canonical-llm-language-list.json. Post-import sanitization ensures build stability without additional QA overhead.

Implements upstream fixes to prevent MDX parser errors in AI-translated content: - Update existing Crowdin files before translation to ensure latest English source - Add PUT /files/{fileId} request when file exists instead of skipping update - Add 10s parsing delay after file updates - Add defensive sanitizer for block component line breaks - New fixBlockComponentLineBreaks() function catches inline tags - Fixes 12 component types (Card, ExpandableCard, Alert, etc.) - Reports fix count in sanitizer issues - Scope sanitizer to current translation job languages - Change from all configured languages to languagePairs from job - Fetch AI model name dynamically from Crowdin API - New getPromptInfo() function with PromptResource type - Replace hard-coded "Gemini 2.5 Pro" in PR body with actual model Fixes critical MDX parser errors like "Expected the closing tag </ExpandableCard> either after the end of paragraph" caused by Crowdin AI using stale English files and outputting inline block component tags.

Changed sanitizer from processing all files in target languages to only processing the specific files that were just committed in the current job. Changes: - Track committed file paths during translation loop in committedFilePaths array - Pass specific file paths to runSanitizer() instead of language codes - Update runSanitizer() to accept specificFiles parameter - When specificFiles provided, only process those exact files - Falls back to language-based scanning when specificFiles not provided This prevents sanitizer from touching hundreds of unrelated translation files when only translating a single file or directory.

Removed protectNames() function that was changing translated content strings: - Was capitalizing 'ethereum.org' to 'Ethereum.org' in URLs - Was replacing translated terms like 'Etéreo' with 'Ethereum' - Was changing brand name capitalization in prose content The sanitizer should ONLY fix code syntax issues that break the build: - Block component line breaks (MDX parser requirement) - Block HTML tag line breaks - Header ID ASCII normalization - Validation reporting (broken links, malformed markdown) Content terminology and brand name handling should be done by the LLM or a separate content sanitizer with more nuanced rules (e.g., 'Ethereum' vs 'ethereum.org', never touch URLs/hrefs).

Sanitizer was reading stale files from disk instead of just-committed translated content, causing it to overwrite translations with English. Changes: - Pass in-memory content from committed files to sanitizer - Sanitizer operates on provided content, only reads disk as fallback - Support both .md and .json files with in-memory content - Fix header ID sync to match by structure/position, not text - Extract header structure (level + position) from both files - Match headers by index: 1st H2 → 1st H2, etc. - Copy English IDs to corresponding translated headers - Warn on structure mismatches - Add JSON sanitization: BOM removal, smart quote normalization Previous flow (broken): 1. Commit translated file to branch 2. Sanitizer reads same path from LOCAL DISK (gets old English file) 3. Sanitizer processes English, commits back → translation overwritten New flow (fixed): 1. Commit translated file, keep content in memory 2. Pass in-memory content to sanitizer 3. Sanitizer processes actual translation 4. Commits sanitized translation → translation preserved

Fixes two formatting issues in sanitized translations: 1. Opening tags with inline content (especially other tags) - Changed regex from [^\n<] to [^\n] to match ANY character after tag - Now catches: <AlertDescription><strong>text</strong> - Previously missed when content started with < character 2. Missing blank lines after headers and block components - Added restoreBlankLinesFromEnglish() function - Compares translation structure with English source - Adds blank lines where English has them for readability - Preserves proper Markdown/MDX formatting conventions 3. Improved PR body formatting to list translated files These fixes prevent MDX parser errors like: 'Expected a closing tag for <AlertDescription> before the end of paragraph'

Add .join('\n') to prevent array-to-string conversion from inserting commas between bullet points in the PR description.

… strategy - Add translateAttributes parser config for markdown files to enable translation of component attributes (title, description, alt, etc.) - Add UPDATE_OPTION workflow input with radio choices: - keep_translations_and_approvals (default, preserves existing work) - keep_translations (preserves translations only) - clear_translations_and_approvals (full reset) - Configure both file creation and update operations to use translateAttributes - Whitelist 12 human-readable attributes while excluding technical properties like emoji, eventCategory, href, etc.

Revert recent attempts to configure Crowdin parser via API. - Remove workflow input and env plumbing for update_option - Drop PATCH /files/{id} for /parserOptions/translateAttributes - Simplify PUT updates to only set storageId - Rely on Crowdin UI-managed parser settings for now Context: PATCH path required 'parserOptions' which may not exist and 'translateAttributes' expects a boolean. Failed attempts indicate this is best managed in Crowdin UI or a different endpoint. We’ll coordinate with our Crowdin liaison and restore a minimal, stable flow in code.

Sanitizer improvements: - Brand name detection: warns when protected brands (Solidity, Alchemy, MetaMask, etc.) appear in English but are missing from translation - Duplicated headings: fixes "## Text? Text? {#id}" → "## Text? {#id}" - Broken markdown links: fixes "] (https://" → "](https://" Updated pre-translate prompt to clarify that brand names should not be translated even when they have common translations in target languages.

Add fixTranslatedHrefs() to detect and auto-fix incorrectly translated internal hrefs using set comparison. Only fixes unambiguous cases (1 wrong + 1 missing); warns otherwise. Add fixCollapsedComponentLineBreaks() to restore line breaks between consecutive MDX components when translators collapse them onto single lines. Extract BLOCK_MDX_COMPONENTS constant to DRY up component lists. Add escapeRegex() and isInternalHref() helpers.

- Use escapeRegex() in checkProtectedBrandNames to handle special characters in brand names - Simplify fixBrokenMarkdownLinks using replace callback - Change type HeaderInfo to interface per TS conventions

Rewrite fixTranslatedHrefs() to compare hrefs within paragraph blocks instead of globally. Handles grammatical reordering in non-English languages. Only auto-fixes 1:1 mismatches within blocks; warns otherwise.

- Remove Spanish default from target_languages input - Remove exposed timeout/poll workflow inputs (use code defaults) - Delete unused scripts: check-translation-status, unhide-strings - Delete dead code: prompt-model.ts, pr-review-comments.ts - Delete standalone translate-jsx-attributes.yml workflow - Clean up verbose logging and stale workflow references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Crowdin returns "created" when a job is queued behind other jobs. Previously this would cause polling to fail. Now we continue polling for both "created" and "in_progress" states. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Add LanguageJobInfo type for tracking per-language jobs - Move prompt creation from file-preparation to pre-translation phase - Create one ephemeral prompt per language with language-specific glossary - Poll all jobs in parallel with continue-on-error - Log comma-separated job IDs for easy resume copy-paste Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

PRETRANSLATION_ID now accepts comma-separated values (e.g., "abc123,def456") for resuming multiple per-language jobs. Resume polls all jobs in parallel with continue-on-error. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Collect languageIds from all responses instead of single response. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When SPLIT_PRS=true, creates one PR per language instead of a single combined PR. Useful for large translation batches where individual PRs are easier to review. - Each language gets its own branch: i18n/import/{ts}-{langCode} - Continue-on-error: failed languages don't block others - Summary printed at end with PR URLs and failures Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace per-file commits with batch commits using GitHub's Git Data API. Each workflow phase now creates a single commit instead of one per file, reducing commit noise from hundreds to ~3 commits per language. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Deprecates canonical-llm-language-list.json

wackerow · 2026-02-04T22:33:02Z

@pettinarip Pulling this in since we've already been using it for all the jobs so far. Can iterate as needed from here. Please feel free to note any suggestions for the next iteration.

netlify · 2026-02-04T23:57:09Z

❌ Deploy Preview for ethereumorg failed.

Name	Link
🔨 Latest commit	`f982de6`
🔍 Latest deploy log	https://app.netlify.com/projects/ethereumorg/deploys/6983c8dfbd704d00081acbd0

wackerow added 30 commits November 26, 2025 16:02

feat: add start offset and post_import_sanitize

25c3452

feat: update to using gemini prompt

85a6784

feat: adaptive pre-translate polling + timeouts

888eb37

feat(i18n): add qa_check, workflow inputs

7918f6b

add QA polish pass via Crowdin ai_prompt override; workflow inputs for pre_translate and qa_check prompts

patch: revert QA prompt to pre_translate endpoint

ed59f54

debug: QA check endpoint

be02aa7

debug: completions check

837d6fd

fix: chunk requests by file, 500 at a time

f54fc54

refactor: modularize script

e6b3fa0

feat: initialize trust tiers for LLM-language quality

6235601

feat: chunk trust-tiers into separate PRs

c859ca5

refactor: accept internal lang codes

754c364

fix: json punctuation sanitization

5aba5dc

feat: implement supabase glossary/TM sync

6071764

Adds backup for existing Crowdin glossary/TM; syncs updates from EthGlossary supabase db with Crowdin

feat: commit initial pre-translate prompt

09a370c

update(i18n): pre-translate prompt

1351ba6

feat: use in-repo prompt as canonical

68d4e39

Merge branch 'dev' into i18n-offset-sanitizer

135c8eb

fix(i18n): remove commas from PR file list

ffe02a4

Add .join('\n') to prevent array-to-string conversion from inserting commas between bullet points in the PR description.

patch: translateAttributes parser options

153f8a4

fix(i18n): list all translated files & diagnose AI model

b8841c4

wackerow and others added 22 commits December 22, 2025 13:53

refactor: do-not-translate path list

b19a1a6

update: excluded-paths to include terms-and-conditions

811a43f

refactor(i18n): improve regex safety and code style

5be8450

- Use escapeRegex() in checkProtectedBrandNames to handle special characters in brand names - Simplify fixBrokenMarkdownLinks using replace callback - Change type HeaderInfo to interface per TS conventions

feat(i18n): use paragraph-scoped href matching for safer fixes

5ce97dd

Rewrite fixTranslatedHrefs() to compare hrefs within paragraph blocks instead of globally. Handles grammatical reordering in non-English languages. Only auto-fixes 1:1 mismatches within blocks; warns otherwise.

Merge branch 'dev' into i18n-automation-mvp

d3432f3

fix: unused arg

c836325

merge: integrate i18n-glossary-integration branch

6849700

patch: logs and pr body details

e6d4518

fix(i18n): handle "created" status for queued jobs

c004ee1

Crowdin returns "created" when a job is queued behind other jobs. Previously this would cause polling to fail. Now we continue polling for both "created" and "in_progress" states. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

feat(i18n): support comma-separated resume IDs

c498f8a

PRETRANSLATION_ID now accepts comma-separated values (e.g., "abc123,def456") for resuming multiple per-language jobs. Resume polls all jobs in parallel with continue-on-error. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

refactor(i18n): adapt download for multi-response results

938c750

Collect languageIds from all responses instead of single response. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix: add SKIP_PRS boolean as action input

c374cbe

update(i18n): canonical llm language list

23c325f

fix(i18n): add rate limiting to prevent GitHub API abuse

fdd3c96

Merge branch 'dev' into i18n-automation-mvp

c3746b3

refactor: use i18n.config.json as canonical language list

03dce0c

- Deprecates canonical-llm-language-list.json

wackerow mentioned this pull request Jan 24, 2026

deprecate(i18n): legacy crowdin infrastructure #17165

Merged

wackerow added 2 commits February 4, 2026 17:27

Merge branch 'dev' into i18n-automation-mvp

a742514

revert: .gitignore artifacts addition

f982de6

wackerow merged commit 9d8da24 into dev Feb 4, 2026
2 checks passed

wackerow deleted the i18n-automation-mvp branch February 4, 2026 22:33

wackerow mentioned this pull request Feb 5, 2026

Deploy v10.23.0 #17252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

i18n: automation mvp#16954

i18n: automation mvp#16954
wackerow merged 105 commits into
devfrom
i18n-automation-mvp

wackerow commented Dec 18, 2025 •

edited

Loading

Uh oh!

wackerow commented Feb 4, 2026

Uh oh!

Uh oh!

netlify Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wackerow commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Post-Import Sanitization (post_import_sanitize.ts)

JSX Attribute Translation

Syntax Validation

Architecture Refactor

Uh oh!

wackerow commented Feb 4, 2026

Uh oh!

Uh oh!

netlify Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

❌ Deploy Preview for ethereumorg failed.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wackerow commented Dec 18, 2025 •

edited

Loading

netlify Bot commented Feb 4, 2026 •

edited

Loading