i18n: automation mvp#16954
Merged
Merged
Conversation
add QA polish pass via Crowdin ai_prompt override; workflow inputs for pre_translate and qa_check prompts
Switch QA to AI Prompt Completions (qa_check) Resolve user id via GET /api/v2/user (no secret needed) Remove CROWDIN_USER_ID from workflow env Read PRE_TRANSLATE_PROMPT_ID from env Add QA summary to PR body Tidy main.ts by removing unused env references
Adds backup for existing Crowdin glossary/TM; syncs updates from EthGlossary supabase db with Crowdin
Major refactor to simplify the translation automation workflow: REMOVED: - Supabase glossary integration and all related files - Multi-tier QA system with trust matrix and language scoring - OpenAI trust matrix generation - Post-translation Crowdin re-sync logic - Multiple PR workflow (high/medium/low trust tiers) - 13 files deleted total NEW FEATURES: - Unified target_path input (auto-detects file vs directory) - Smart translation modes: single file, directory, or full translation - Intelligent timeout handling (waits for file/dir, exits for full) - Pre-translation artifact with job metadata for resuming - Verbose logging flag for cleaner output - Single PR workflow for all languages UPDATES: - GitHub workflow: simplified inputs, removed unused env vars - config.ts: replaced fileLimit/startOffset with targetPath - main.ts: complete rewrite (667 → 422 lines) - lib/github/files.ts: smart path detection with excluded-paths.json - lib/crowdin/files.ts: verbose logging support - .gitignore: added artifacts/ directory The pipeline now focuses on Spanish MVP with ability to expand to 25+ languages defined in canonical-llm-language-list.json. Post-import sanitization ensures build stability without additional QA overhead.
Implements upstream fixes to prevent MDX parser errors in AI-translated content:
- Update existing Crowdin files before translation to ensure latest English source
- Add PUT /files/{fileId} request when file exists instead of skipping update
- Add 10s parsing delay after file updates
- Add defensive sanitizer for block component line breaks
- New fixBlockComponentLineBreaks() function catches inline tags
- Fixes 12 component types (Card, ExpandableCard, Alert, etc.)
- Reports fix count in sanitizer issues
- Scope sanitizer to current translation job languages
- Change from all configured languages to languagePairs from job
- Fetch AI model name dynamically from Crowdin API
- New getPromptInfo() function with PromptResource type
- Replace hard-coded "Gemini 2.5 Pro" in PR body with actual model
Fixes critical MDX parser errors like "Expected the closing tag </ExpandableCard>
either after the end of paragraph" caused by Crowdin AI using stale English files
and outputting inline block component tags.
Changed sanitizer from processing all files in target languages to only processing the specific files that were just committed in the current job. Changes: - Track committed file paths during translation loop in committedFilePaths array - Pass specific file paths to runSanitizer() instead of language codes - Update runSanitizer() to accept specificFiles parameter - When specificFiles provided, only process those exact files - Falls back to language-based scanning when specificFiles not provided This prevents sanitizer from touching hundreds of unrelated translation files when only translating a single file or directory.
Removed protectNames() function that was changing translated content strings: - Was capitalizing 'ethereum.org' to 'Ethereum.org' in URLs - Was replacing translated terms like 'Etéreo' with 'Ethereum' - Was changing brand name capitalization in prose content The sanitizer should ONLY fix code syntax issues that break the build: - Block component line breaks (MDX parser requirement) - Block HTML tag line breaks - Header ID ASCII normalization - Validation reporting (broken links, malformed markdown) Content terminology and brand name handling should be done by the LLM or a separate content sanitizer with more nuanced rules (e.g., 'Ethereum' vs 'ethereum.org', never touch URLs/hrefs).
Sanitizer was reading stale files from disk instead of just-committed translated content, causing it to overwrite translations with English. Changes: - Pass in-memory content from committed files to sanitizer - Sanitizer operates on provided content, only reads disk as fallback - Support both .md and .json files with in-memory content - Fix header ID sync to match by structure/position, not text - Extract header structure (level + position) from both files - Match headers by index: 1st H2 → 1st H2, etc. - Copy English IDs to corresponding translated headers - Warn on structure mismatches - Add JSON sanitization: BOM removal, smart quote normalization Previous flow (broken): 1. Commit translated file to branch 2. Sanitizer reads same path from LOCAL DISK (gets old English file) 3. Sanitizer processes English, commits back → translation overwritten New flow (fixed): 1. Commit translated file, keep content in memory 2. Pass in-memory content to sanitizer 3. Sanitizer processes actual translation 4. Commits sanitized translation → translation preserved
Fixes two formatting issues in sanitized translations: 1. Opening tags with inline content (especially other tags) - Changed regex from [^\n<] to [^\n] to match ANY character after tag - Now catches: <AlertDescription><strong>text</strong> - Previously missed when content started with < character 2. Missing blank lines after headers and block components - Added restoreBlankLinesFromEnglish() function - Compares translation structure with English source - Adds blank lines where English has them for readability - Preserves proper Markdown/MDX formatting conventions 3. Improved PR body formatting to list translated files These fixes prevent MDX parser errors like: 'Expected a closing tag for <AlertDescription> before the end of paragraph'
Add .join('\n') to prevent array-to-string conversion from inserting commas
between bullet points in the PR description.
… strategy - Add translateAttributes parser config for markdown files to enable translation of component attributes (title, description, alt, etc.) - Add UPDATE_OPTION workflow input with radio choices: - keep_translations_and_approvals (default, preserves existing work) - keep_translations (preserves translations only) - clear_translations_and_approvals (full reset) - Configure both file creation and update operations to use translateAttributes - Whitelist 12 human-readable attributes while excluding technical properties like emoji, eventCategory, href, etc.
Revert recent attempts to configure Crowdin parser via API.
- Remove workflow input and env plumbing for update_option
- Drop PATCH /files/{id} for /parserOptions/translateAttributes
- Simplify PUT updates to only set storageId
- Rely on Crowdin UI-managed parser settings for now
Context: PATCH path required 'parserOptions' which may not exist and
'translateAttributes' expects a boolean. Failed attempts indicate this
is best managed in Crowdin UI or a different endpoint. We’ll coordinate
with our Crowdin liaison and restore a minimal, stable flow in code.
Sanitizer improvements:
- Brand name detection: warns when protected brands (Solidity, Alchemy, MetaMask, etc.) appear in English but are missing from translation
- Duplicated headings: fixes "## Text? Text? {#id}" → "## Text? {#id}"
- Broken markdown links: fixes "] (https://" → "](https://"
Updated pre-translate prompt to clarify that brand names should not be translated even when they have common translations in target languages.
Add fixTranslatedHrefs() to detect and auto-fix incorrectly translated internal hrefs using set comparison. Only fixes unambiguous cases (1 wrong + 1 missing); warns otherwise. Add fixCollapsedComponentLineBreaks() to restore line breaks between consecutive MDX components when translators collapse them onto single lines. Extract BLOCK_MDX_COMPONENTS constant to DRY up component lists. Add escapeRegex() and isInternalHref() helpers.
- Use escapeRegex() in checkProtectedBrandNames to handle special characters in brand names - Simplify fixBrokenMarkdownLinks using replace callback - Change type HeaderInfo to interface per TS conventions
Rewrite fixTranslatedHrefs() to compare hrefs within paragraph blocks instead of globally. Handles grammatical reordering in non-English languages. Only auto-fixes 1:1 mismatches within blocks; warns otherwise.
- Remove Spanish default from target_languages input - Remove exposed timeout/poll workflow inputs (use code defaults) - Delete unused scripts: check-translation-status, unhide-strings - Delete dead code: prompt-model.ts, pr-review-comments.ts - Delete standalone translate-jsx-attributes.yml workflow - Clean up verbose logging and stale workflow references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Crowdin returns "created" when a job is queued behind other jobs. Previously this would cause polling to fail. Now we continue polling for both "created" and "in_progress" states. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add LanguageJobInfo type for tracking per-language jobs - Move prompt creation from file-preparation to pre-translation phase - Create one ephemeral prompt per language with language-specific glossary - Poll all jobs in parallel with continue-on-error - Log comma-separated job IDs for easy resume copy-paste Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PRETRANSLATION_ID now accepts comma-separated values (e.g., "abc123,def456") for resuming multiple per-language jobs. Resume polls all jobs in parallel with continue-on-error. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Collect languageIds from all responses instead of single response. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When SPLIT_PRS=true, creates one PR per language instead of a single combined PR. Useful for large translation batches where individual PRs are easier to review.
- Each language gets its own branch: i18n/import/{ts}-{langCode}
- Continue-on-error: failed languages don't block others
- Summary printed at end with PR URLs and failures
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace per-file commits with batch commits using GitHub's Git Data API. Each workflow phase now creates a single commit instead of one per file, reducing commit noise from hundreds to ~3 commits per language. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Deprecates canonical-llm-language-list.json
Member
Author
|
@pettinarip Pulling this in since we've already been using it for all the jobs so far. Can iterate as needed from here. Please feel free to note any suggestions for the next iteration. |
❌ Deploy Preview for ethereumorg failed.
|
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Refactored Crowdin translation import pipeline with automated syntax fixing and modular architecture.
Post-Import Sanitization (post_import_sanitize.ts)
JSX Attribute Translation
Syntax Validation
Architecture Refactor