Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
---
title: "Sanitizer Bug: CJK Punctuation Not Auto-Fixed in Non-CJK Locales"
date: 2026-03-25
category: logic-errors
component: src/scripts/i18n/post_import_sanitize.ts
tags:
- i18n
- sanitizer
- cross-script
- cjk
- punctuation
- urdu
severity: medium
symptoms:
- "CJK ideographic full stop U+3002 appearing in Urdu (ur) translation files"
- "detectCrossScriptContamination warns but does not fix cross-script punctuation"
- "Detection pattern for non-CJK locales excludes CJK punctuation range U+3000-U+303F"
root_cause: |
Two gaps in cross-script handling:
1. Detection patterns for non-CJK locales (ar, ur, hi, etc.) only covered CJK Unified
Ideographs (U+4E00-U+9FFF) but NOT CJK Symbols and Punctuation (U+3000-U+303F).
The ideographic full stop U+3002 falls in the latter range, so it was not detected.
2. detectCrossScriptContamination() only emits warnings -- it never modifies content.
Even if the character were detected, it would not be fixed automatically.
resolution: |
Added fixCrossScriptPunctuation() that replaces CJK ideographic full stop U+3002
with locale-appropriate equivalents outside code fences:
- ar/ur: U+06D4 (Arabic full stop)
- hi/mr/bn: U+0964 (Devanagari danda)
- Latin/Cyrillic/other non-CJK: ASCII period
- CJK locales (ja, ko, zh, zh-tw): no-op (U+3002 is correct)
Wired into both processMarkdownFile and processJsonFile pipelines, placed before
detectCrossScriptContamination so warnings reflect post-fix state.
11 unit tests added.
related:
- docs/solutions/integration-issues/sanitizer-test-research.md
- docs/solutions/integration-issues/post-import-sanitizer-bugs-found-japanese-review.md
discovered_in: "PR #17854 (Urdu full translation)"
files_affected:
- public/content/translations/ur/developers/docs/standards/tokens/erc-1363/index.md
- src/intl/ur/glossary-tooltip.json
---

# Sanitizer Bug: CJK Punctuation Not Auto-Fixed in Non-CJK Locales

## Problem

During review of PR #17854 (Urdu translation), two instances of CJK ideographic full stop
(`U+3002`) were found in Urdu files where the Urdu full stop (`U+06D4`) was expected:

- `erc-1363/index.md` line 191: inside a Solidity code comment
- `glossary-tooltip.json` line 165: in the zk-proof definition value

These are classic Gemini cross-script leak artifacts -- the model occasionally copies
punctuation from CJK training data into non-CJK output.

## Root Cause

The sanitizer had `detectCrossScriptContamination()` which warned about foreign characters,
but:

1. **Detection gap**: The `ur` detector pattern was `[\u0900-\u097F\u4E00-\u9FFF\u0400-\u04FF]`,
covering CJK ideographs but NOT CJK punctuation (`\u3000-\u303F`). The full stop `U+3002`
is in the punctuation range, so it was invisible to the detector.

2. **No auto-fix**: Even if detected, the function only pushes to the warnings array. It never
modifies content.

## Fix

Added `fixCrossScriptPunctuation(content, locale)` that:

- Processes content line-by-line with fence tracking (same pattern as `escapeMdxAngleBrackets`)
- Skips code fences and inline code spans
- Maps `U+3002` to the locale-appropriate full stop
- Returns `{ content, fixCount }` like all other fix functions

Locale mapping:

| Locale Group | Replacement | Unicode |
|-------------|-------------|---------|
| ar, ur | Arabic full stop | U+06D4 |
| hi, mr, bn | Devanagari danda | U+0964 |
| ja, ko, zh, zh-tw | No-op (correct) | U+3002 |
| All other | ASCII period | U+002E |

## Tests

11 tests in `tests/unit/sanitizer/standalone-fixes.spec.ts`:
- Per-locale replacement (ur, ar, hi, de, ru)
- CJK locale no-op (ja, zh)
- Code fence skipping
- Multiple occurrences
- Empty locale guard
- No-op when clean
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
---
title: Fix <span dir="ltr"> wrapping backtick inline code in RTL translations
date: 2026-03-25
category: logic-errors
component: Translation post-import sanitizer
tags:
- mdx
- rtl
- bidi
- inline-code
- sanitizer
- translation
- span-dir-ltr
severity: high
recurring: true
languages_affected:
- ar
- ur
files_modified:
- src/scripts/i18n/post_import_sanitize.ts
- tests/unit/sanitizer/standalone-fixes.spec.ts
---

# Fix `<span dir="ltr">` wrapping backtick inline code in RTL translations

## Problem Symptom

In RTL translation files (Arabic, Urdu), inline code wrapped in markdown backticks was rendered as broken text instead of properly formatted code. The MDX output contained patterns like:

```
<span dir="ltr">`APPLY(S,TX) -> S'`</span>
```

MDX cannot nest markdown syntax (backticks) inside JSX (`<span>` tags). The backtick content renders as literal broken text instead of an inline code block.

## Root Cause Analysis

Two sources produce this invalid pattern:

1. **Sanitizer's own RTL fixes**: `fixBareRtlDates` and `fixBareRtlEquations` wrap bare LTR content (dates, math equations) in `<span dir="ltr">...</span>` for BiDi correctness. While the `RTL_SKIP_PATTERN` regex is designed to skip inline code, edge cases can result in backtick content being wrapped.

2. **Gemini translations**: The LLM translator sometimes produces `<span dir="ltr">` wrappers around backtick content directly in its output.

In both cases, the `<span dir="ltr">` wrapper is redundant because inline code in backticks is already inherently LTR (rendered in a monospace Latin font by the browser).

## Fix Applied

Added a new sanitizer function `fixSpanWrappedBackticks` that:

1. Splits content on code fences (to avoid modifying fenced code blocks)
2. Finds all instances of `<span dir="ltr">` followed by backtick content followed by `</span>`
3. Replaces with just the backtick content (stripping the span wrapper)
4. Handles optional whitespace between the span tags and the backtick content

The function is placed in the pipeline immediately after `fixBareRtlDates` and `fixBareRtlEquations`, so it cleans up after them. It also handles cases where Gemini produced the pattern directly.

### Regex pattern

```typescript
/<span dir="ltr">\s*(`[^`]+`)\s*<\/span>/g
```

This matches `<span dir="ltr">` + optional whitespace + backtick-delimited content + optional whitespace + `</span>`, and replaces with just the backtick content (capture group 1).

## Verification

- Unit tests added in `tests/unit/sanitizer/standalone-fixes.spec.ts`
- Tests cover: basic unwrapping, multiple occurrences, non-backtick spans left alone, bare backticks left alone, whitespace handling, no-op case, correct fixCount, code fence skipping
Loading
Loading