Skip to content

i18n: automation mvp#16954

Merged
wackerow merged 105 commits into
devfrom
i18n-automation-mvp
Feb 4, 2026
Merged

i18n: automation mvp#16954
wackerow merged 105 commits into
devfrom
i18n-automation-mvp

Conversation

@wackerow
Copy link
Copy Markdown
Member

@wackerow wackerow commented Dec 18, 2025

Description

Refactored Crowdin translation import pipeline with automated syntax fixing and modular architecture.

Post-Import Sanitization (post_import_sanitize.ts)

  • Auto-fix broken markdown links
  • Fix translated internal hrefs (paragraph-scoped matching)
  • Normalize ButtonLink/Button formatting
  • Restore line breaks between collapsed MDX components
  • Protected brand name warnings
  • Header structure validation

JSX Attribute Translation

  • Extract/translate/reinsert JSX attributes (alt, title, label)
  • Standalone workflow option

Syntax Validation

  • MDX syntax tree validation before PR creation
  • PR comments with file/line errors

Architecture Refactor

  • main.ts broken into workflow modules
  • Separated Crowdin/GitHub API utilities

add QA polish pass via Crowdin ai_prompt override; workflow inputs for pre_translate and qa_check prompts
Switch QA to AI Prompt Completions (qa_check)
Resolve user id via GET /api/v2/user (no secret needed)
Remove CROWDIN_USER_ID from workflow env
Read PRE_TRANSLATE_PROMPT_ID from env
Add QA summary to PR body
Tidy main.ts by removing unused env references
Adds backup for existing Crowdin glossary/TM; syncs updates from EthGlossary supabase db with Crowdin
Major refactor to simplify the translation automation workflow:

REMOVED:
- Supabase glossary integration and all related files
- Multi-tier QA system with trust matrix and language scoring
- OpenAI trust matrix generation
- Post-translation Crowdin re-sync logic
- Multiple PR workflow (high/medium/low trust tiers)
- 13 files deleted total

NEW FEATURES:
- Unified target_path input (auto-detects file vs directory)
- Smart translation modes: single file, directory, or full translation
- Intelligent timeout handling (waits for file/dir, exits for full)
- Pre-translation artifact with job metadata for resuming
- Verbose logging flag for cleaner output
- Single PR workflow for all languages

UPDATES:
- GitHub workflow: simplified inputs, removed unused env vars
- config.ts: replaced fileLimit/startOffset with targetPath
- main.ts: complete rewrite (667 → 422 lines)
- lib/github/files.ts: smart path detection with excluded-paths.json
- lib/crowdin/files.ts: verbose logging support
- .gitignore: added artifacts/ directory

The pipeline now focuses on Spanish MVP with ability to expand to
25+ languages defined in canonical-llm-language-list.json. Post-import
sanitization ensures build stability without additional QA overhead.
Implements upstream fixes to prevent MDX parser errors in AI-translated content:

- Update existing Crowdin files before translation to ensure latest English source
  - Add PUT /files/{fileId} request when file exists instead of skipping update
  - Add 10s parsing delay after file updates

- Add defensive sanitizer for block component line breaks
  - New fixBlockComponentLineBreaks() function catches inline tags
  - Fixes 12 component types (Card, ExpandableCard, Alert, etc.)
  - Reports fix count in sanitizer issues

- Scope sanitizer to current translation job languages
  - Change from all configured languages to languagePairs from job

- Fetch AI model name dynamically from Crowdin API
  - New getPromptInfo() function with PromptResource type
  - Replace hard-coded "Gemini 2.5 Pro" in PR body with actual model

Fixes critical MDX parser errors like "Expected the closing tag </ExpandableCard>
either after the end of paragraph" caused by Crowdin AI using stale English files
and outputting inline block component tags.
Changed sanitizer from processing all files in target languages to only
processing the specific files that were just committed in the current job.

Changes:
- Track committed file paths during translation loop in committedFilePaths array
- Pass specific file paths to runSanitizer() instead of language codes
- Update runSanitizer() to accept specificFiles parameter
- When specificFiles provided, only process those exact files
- Falls back to language-based scanning when specificFiles not provided

This prevents sanitizer from touching hundreds of unrelated translation files
when only translating a single file or directory.
Removed protectNames() function that was changing translated content strings:
- Was capitalizing 'ethereum.org' to 'Ethereum.org' in URLs
- Was replacing translated terms like 'Etéreo' with 'Ethereum'
- Was changing brand name capitalization in prose content

The sanitizer should ONLY fix code syntax issues that break the build:
- Block component line breaks (MDX parser requirement)
- Block HTML tag line breaks
- Header ID ASCII normalization
- Validation reporting (broken links, malformed markdown)

Content terminology and brand name handling should be done by the LLM or
a separate content sanitizer with more nuanced rules (e.g., 'Ethereum' vs
'ethereum.org', never touch URLs/hrefs).
Sanitizer was reading stale files from disk instead of just-committed
translated content, causing it to overwrite translations with English.

Changes:
- Pass in-memory content from committed files to sanitizer
- Sanitizer operates on provided content, only reads disk as fallback
- Support both .md and .json files with in-memory content
- Fix header ID sync to match by structure/position, not text
  - Extract header structure (level + position) from both files
  - Match headers by index: 1st H2 → 1st H2, etc.
  - Copy English IDs to corresponding translated headers
  - Warn on structure mismatches
- Add JSON sanitization: BOM removal, smart quote normalization

Previous flow (broken):
1. Commit translated file to branch
2. Sanitizer reads same path from LOCAL DISK (gets old English file)
3. Sanitizer processes English, commits back → translation overwritten

New flow (fixed):
1. Commit translated file, keep content in memory
2. Pass in-memory content to sanitizer
3. Sanitizer processes actual translation
4. Commits sanitized translation → translation preserved
Fixes two formatting issues in sanitized translations:

1. Opening tags with inline content (especially other tags)
   - Changed regex from [^\n<] to [^\n] to match ANY character after tag
   - Now catches: <AlertDescription><strong>text</strong>
   - Previously missed when content started with < character

2. Missing blank lines after headers and block components
   - Added restoreBlankLinesFromEnglish() function
   - Compares translation structure with English source
   - Adds blank lines where English has them for readability
   - Preserves proper Markdown/MDX formatting conventions

3. Improved PR body formatting to list translated files

These fixes prevent MDX parser errors like:
'Expected a closing tag for <AlertDescription> before the end of paragraph'
Add .join('\n') to prevent array-to-string conversion from inserting commas
between bullet points in the PR description.
… strategy

- Add translateAttributes parser config for markdown files to enable translation
  of component attributes (title, description, alt, etc.)
- Add UPDATE_OPTION workflow input with radio choices:
  - keep_translations_and_approvals (default, preserves existing work)
  - keep_translations (preserves translations only)
  - clear_translations_and_approvals (full reset)
- Configure both file creation and update operations to use translateAttributes
- Whitelist 12 human-readable attributes while excluding technical properties
  like emoji, eventCategory, href, etc.
Revert recent attempts to configure Crowdin parser via API.

- Remove workflow input and env plumbing for update_option
- Drop PATCH /files/{id} for /parserOptions/translateAttributes
- Simplify PUT updates to only set storageId
- Rely on Crowdin UI-managed parser settings for now

Context: PATCH path required 'parserOptions' which may not exist and
'translateAttributes' expects a boolean. Failed attempts indicate this
is best managed in Crowdin UI or a different endpoint. We’ll coordinate
with our Crowdin liaison and restore a minimal, stable flow in code.
wackerow and others added 22 commits December 22, 2025 13:53
Sanitizer improvements:
- Brand name detection: warns when protected brands (Solidity, Alchemy, MetaMask, etc.) appear in English but are missing from translation
- Duplicated headings: fixes "## Text? Text? {#id}" → "## Text? {#id}"
- Broken markdown links: fixes "] (https://" → "](https://"

Updated pre-translate prompt to clarify that brand names should not be translated even when they have common translations in target languages.
Add fixTranslatedHrefs() to detect and auto-fix incorrectly translated internal hrefs using set comparison. Only fixes unambiguous cases (1 wrong + 1 missing); warns otherwise.

Add fixCollapsedComponentLineBreaks() to restore line breaks between consecutive MDX components when translators collapse them onto single lines.

Extract BLOCK_MDX_COMPONENTS constant to DRY up component lists. Add escapeRegex() and isInternalHref() helpers.
- Use escapeRegex() in checkProtectedBrandNames to handle special characters in brand names
- Simplify fixBrokenMarkdownLinks using replace callback
- Change type HeaderInfo to interface per TS conventions
Rewrite fixTranslatedHrefs() to compare hrefs within paragraph blocks instead of globally. Handles grammatical reordering in non-English languages. Only auto-fixes 1:1 mismatches within blocks; warns otherwise.
- Remove Spanish default from target_languages input
- Remove exposed timeout/poll workflow inputs (use code defaults)
- Delete unused scripts: check-translation-status, unhide-strings
- Delete dead code: prompt-model.ts, pr-review-comments.ts
- Delete standalone translate-jsx-attributes.yml workflow
- Clean up verbose logging and stale workflow references

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Crowdin returns "created" when a job is queued behind other jobs.
Previously this would cause polling to fail. Now we continue
polling for both "created" and "in_progress" states.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add LanguageJobInfo type for tracking per-language jobs
- Move prompt creation from file-preparation to pre-translation phase
- Create one ephemeral prompt per language with language-specific glossary
- Poll all jobs in parallel with continue-on-error
- Log comma-separated job IDs for easy resume copy-paste

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PRETRANSLATION_ID now accepts comma-separated values (e.g.,
"abc123,def456") for resuming multiple per-language jobs.
Resume polls all jobs in parallel with continue-on-error.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Collect languageIds from all responses instead of single response.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When SPLIT_PRS=true, creates one PR per language instead of a single combined PR. Useful for large translation batches where individual PRs are easier to review.

- Each language gets its own branch: i18n/import/{ts}-{langCode}
- Continue-on-error: failed languages don't block others
- Summary printed at end with PR URLs and failures

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace per-file commits with batch commits using GitHub's Git Data API.
Each workflow phase now creates a single commit instead of one per file, reducing commit noise from hundreds to ~3 commits per language.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Deprecates canonical-llm-language-list.json
@wackerow
Copy link
Copy Markdown
Member Author

wackerow commented Feb 4, 2026

@pettinarip Pulling this in since we've already been using it for all the jobs so far. Can iterate as needed from here. Please feel free to note any suggestions for the next iteration.

@wackerow wackerow merged commit 9d8da24 into dev Feb 4, 2026
2 checks passed
@wackerow wackerow deleted the i18n-automation-mvp branch February 4, 2026 22:33
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 4, 2026

Deploy Preview for ethereumorg failed.

Name Link
🔨 Latest commit f982de6
🔍 Latest deploy log https://app.netlify.com/projects/ethereumorg/deploys/6983c8dfbd704d00081acbd0

@wackerow wackerow mentioned this pull request Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies 📦 Changes related to project dependencies tooling 🔧 Changes related to tooling of the project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant