Skip to content

fix(windows): force UTF-8 stdin encoding so PowerShell preserves non-ASCII chars#131

Merged
cmeans-claude-dev[bot] merged 1 commit into
mainfrom
fix/windows-utf8-encoding-129
May 7, 2026
Merged

fix(windows): force UTF-8 stdin encoding so PowerShell preserves non-ASCII chars#131
cmeans-claude-dev[bot] merged 1 commit into
mainfrom
fix/windows-utf8-encoding-129

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented May 7, 2026

Closes #129.

Summary

  • Force [Console]::InputEncoding = [System.Text.Encoding]::UTF8 on every Windows PowerShell write script that pipes content through [Console]::In.ReadToEnd(). Centralized as a new module-level _WINDOWS_UTF8_PREAMBLE constant so the five affected scripts share one source of truth.
  • Add 6 new unit tests covering the bug surface: each affected script must contain the InputEncoding = UTF8 directive, the directive must precede the stdin read, and the byte payload must be valid UTF-8 even for the canonical em dash repro from the bug report.

Root cause

PowerShell's [Console]::In.ReadToEnd() decodes stdin via [Console]::InputEncoding, which defaults to the OEM/ANSI code page on Windows (commonly CP1252), not UTF-8. content.encode() in Python emits UTF-8 bytes, so multi-byte sequences (em dash U+2014, curly quotes, non-Latin scripts) were misread as separate CP1252 characters before Set-Clipboard ever ran, corrupting the clipboard payload.

For an em dash (UTF-8 0xE2 0x80 0x94) read as CP1252, the bytes become the three-character string â€", which downstream encoding conversions can warp further into the garbage observed in the original report ([Cö C for — C).

Fix scope

Five scripts touched, all in src/mcp_clipboard/clipboard.py:

Function MIME Before After
_windows_write text/plain Set-Clipboard ... preamble + Set-Clipboard ...
_windows_write_typed text/plain same as above same as above
_windows_write_typed text/html reads stdin into SetData(Html, ...) preamble + same
_windows_write_typed text/rtf reads stdin into SetData(Rtf, ...) preamble + same
_windows_write_typed image/svg+xml reads stdin into SetData('image/svg+xml', ...) preamble + same

The text/html branch is particularly worth calling out: the CF_HTML byte offsets in the header (StartHTML, StartFragment, etc.) are computed against UTF-8 bytes in Python. If PowerShell decodes stdin as CP1252 and re-encodes the resulting string as UTF-16 inside SetData, the offsets are off and consumers see fragment markers pointing into garbage. Forcing UTF-8 at read time is what makes those offsets valid end-to-end.

Paths intentionally NOT changed

  • _windows_write_multi (multi-format writes via clipboard_copy_markdown) already encodes payloads as base64 and decodes UTF-8 inside PowerShell. The JSON envelope is ASCII-only (keys html/text/rtf, values are base64), so the default CP1252 InputEncoding reads it identically to UTF-8.
  • _windows_write_image pipes base64-encoded image data over stdin. Base64 is ASCII-only.

Both paths are already correct and the fix would be a no-op there. Left untouched to keep the diff focused on the bug.

Verification

  • 605 unit tests pass (599 before + 6 new).
  • ruff check, ruff format --check, mypy clean.
  • scripts/sync-server-json.py --check reports in sync at 2.5.1.
  • The 6 new tests fail without the fix and pass with it (verified by running the full suite both before and after the source change).

Why this didn't catch sooner

The bug requires (a) Windows, (b) non-ASCII content, and (c) a downstream paste target that surfaces the corruption visibly. The unit-test mocks asserted the PowerShell invocation shape but not the encoding directive, and the integration-x11 CI job runs on Linux. Live Windows testing was the only path to catching it — the QEMU Windows guest where the bug was reported is the first time mcp-clipboard has been exercised on real Windows with non-ASCII content.

Test plan

  • CI passes (lint, typecheck, all 3 Python versions, integration-x11, codecov, version-sync, validate-server-json).
  • Manual verification on Windows: copy a string with em dash, en dash, curly quotes, and a non-Latin character (e.g. ); paste into Notepad and confirm round-trip fidelity.
  • Same manual verification with clipboard_copy_markdown to confirm the text/html and text/rtf paths behave correctly with non-ASCII content.

Notes

…ASCII chars (closes #129)

PowerShell's [Console]::In.ReadToEnd() decodes stdin via
[Console]::InputEncoding, which defaults to the OEM/ANSI code page on
Windows (commonly CP1252), not UTF-8. Python's content.encode() emits
UTF-8 bytes, so multi-byte sequences (em dash U+2014, curly quotes,
non-Latin scripts) were misread as separate CP1252 characters before
Set-Clipboard ever ran, corrupting the clipboard payload.

Each affected PowerShell write script now sets
[Console]::InputEncoding = [System.Text.Encoding]::UTF8 before reading
stdin. Centralized as _WINDOWS_UTF8_PREAMBLE so the four typed-write
branches (text/plain, text/html, text/rtf, image/svg+xml) and the
plain text path share one source of truth.

The base64-on-stdin paths (_windows_write_multi, _windows_write_image)
were already safe because base64 alphabets are ASCII-only and
CP1252/UTF-8 agree on the ASCII range, so neither path was changed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 7, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 7, 2026
@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 7, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA review — PR #131 (round 1)

Verdict: Zero findings. Recommending Ready for QA Signoff.

Scope check vs #129

Area Status
Closes #129 yes — fix targets the exact root cause described in the issue
Fix direction matches #129's "Fix direction" yes — chose Option 1 (InputEncoding = UTF8 preamble), which is the smallest-viable change
Test direction matches #129's "Test plan" yes — Windows-only-shaped unit test asserting the preamble is present and precedes the stdin read
All affected scripts covered yes — see coverage matrix below

Coverage matrix — every [Console]::In.ReadToEnd() site

I grepped src/mcp_clipboard/clipboard.py for every call site that pipes content through [Console]::In.ReadToEnd(). All seven sites were considered:

Site MIME / payload Stdin alphabet Action
_windows_write (l. 587) text/plain UTF-8 text preamble added ✓
_windows_write_typed (l. 767) text/plain UTF-8 text preamble added ✓
_windows_write_typed (l. 777) text/html UTF-8 text preamble added ✓
_windows_write_typed (l. 791) text/rtf UTF-8 text preamble added ✓
_windows_write_typed (l. 809) image/svg+xml UTF-8 text preamble added ✓
_windows_write_multi (l. 908) base64 JSON envelope ASCII-only not needed (justification in PR body — verified)
_windows_write_image (l. 1000) base64 image bytes ASCII-only not needed (justification in PR body — verified)

The two skipped sites are correctly out of scope: their stdin payloads are pure ASCII (base64 alphabets and JSON keys/quoting), so the default OEM/ANSI code page reads them identically to UTF-8.

Test design

The 6 new unit tests in tests/test_server.py are well-shaped:

  • _assert_utf8_input_encoding checks both presence ([Console]::InputEncoding and UTF8 substrings) and ordering (encoding-set must precede ReadToEnd). Ordering matters: Set-Clipboard ...; [Console]::InputEncoding = UTF8 would silently regress, and the helper would catch that.
  • test_windows_write_passes_utf8_bytes_for_non_ascii asserts the literal byte sequence \xe2\x80\x94 (UTF-8 for U+2014 em dash) is in the piped stdin payload — directly tests the bug from #129's repro.
  • One preamble-assertion test per affected script (5 scripts → 5 tests) plus the byte-payload test = 6 total. Matches the PR body's claim.
  • Tests mock _run_with_stdin, so they're cross-platform — they run in the Linux CI matrix (test 3.11/3.12/3.13 all green), which is the right choice since the fix is structural rather than runtime-Windows-dependent.

Verification on 5cc0b05

  • pytest: 605 passed, 19 deselected, 5 xfailed (matches PR body's claim of 599 prior + 6 new = 605). The 19 deselected are the integration tests in tests/test_integration.py and tests/test_integration_x11.py (require live X11/Wayland display); they execute in the integration-x11 CI job which is green on this PR.
  • ruff check src tests scripts: All checks passed.
  • ruff format --check src tests scripts: 11 files already formatted.
  • mypy src: Success, no issues.
  • scripts/sync-server-json.py --check: in sync at 2.5.1.
  • CI: 11 actual checks all green (lint, typecheck, test ×3, integration-x11, codecov/patch, version-sync, validate-server-json, on-push, qa-approved, qa-gate-pending).

Test-plan checkboxes

  • CI passes — all 11 checks green on 5cc0b05. Ticked.
  • Manual Windows verification (em dash, en dash, curly quotes, in Notepad) — relies on Dev's claim of QEMU verification (CHANGELOG: "Verified on a QEMU Windows guest where the bug originally reproduced"). I cannot independently verify from a Linux session. Leaving unticked; flagging that QA's signoff is conditional on Dev's QEMU verification standing.
  • Manual Windows verification with clipboard_copy_markdown — same situation. Note: clipboard_copy_markdown exercises the _windows_write_multi path, which the PR explicitly does NOT change because its stdin alphabet is base64-only. The text/html and text/rtf payloads inside the JSON envelope are base64-encoded UTF-8 bytes that PowerShell decodes via [System.Text.Encoding]::UTF8.GetString (l. 916, l. 924). So the markdown path was already correct end-to-end pre-PR and shouldn't regress.

Follow-up

I filed #132 for the read-direction encoding concern that the PR notes flagged (Get-Clipboard stdout decoding). Classification: nice-to-have — no current bug report, defensive symmetric fix, can land after #131. Issue includes a concrete repro recipe for the QEMU Windows guest. Does not block this PR.

CHANGELOG

Under [Unreleased] / Fixed — correct category. American-English spelling throughout. No em-dashes in the new content (confirmed via git diff main..fix/windows-utf8-encoding-129 -- CHANGELOG.md | grep '—' — no matches). Closes #129 referenced.

Summary

Clean, well-scoped fix with proportionate test coverage. Comments in the source code (_WINDOWS_UTF8_PREAMBLE definition + the explanatory comment in _windows_write_typed) earn their keep — they document a non-obvious Windows-platform encoding gotcha that future readers would benefit from. Single source of truth via the module-level constant prevents drift across the 5 scripts.

Applying Ready for QA Signoff.

@cmeans cmeans added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 7, 2026
@github-actions github-actions Bot removed the QA Active QA is actively reviewing; Dev should not push changes label May 7, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels May 7, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit d4edc89 into main May 7, 2026
31 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the fix/windows-utf8-encoding-129 branch May 7, 2026 17:23
cmeans-claude-dev Bot added a commit that referenced this pull request May 7, 2026
…three locations

QA Round 2 review surfaced that the project's "Windows is complete but untested" boilerplate is no longer accurate. v2.5.x exercised the Windows code paths end-to-end on a Windows 11 QEMU guest, which surfaced and resolved a real Windows-only encoding bug (#129, the UTF-8 stdin code-page mismatch fixed in #131). Finding and fixing a platform-specific bug via testing on the platform is the strongest possible evidence the platform is being tested; the stale claim contradicted that.

Locations updated:
- README.md `## Setup` → `> **Platform status:**` callout: now reads "Windows has been exercised end-to-end on a Windows 11 guest VM" and references #129 by issue link.
- README.md `## Limitations` → platform-coverage bullet: same correction, with X11 and macOS still honestly held to "complete with unit tests but unverified beyond that".
- CLAUDE.md Conventions section: matching update for the dev-facing platform-status line.

X11 and macOS coverage statements were left honest. Both still have the "complete with unit tests but unverified on live hardware" status; bug reports and PRs for those platforms remain welcome.

CHANGELOG `### Changed` entry added for the doc correction.

Local verification: 610 tests pass, ruff clean, mypy clean, no em/en dashes in added lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmeans-claude-dev Bot added a commit that referenced this pull request May 7, 2026
…A Round 3 F2)

QA reviewer flagged that the source records (#129 issue body, #131 CHANGELOG, awareness QA notes) describe the Windows test platform as "QEMU Windows guest" without specifying version, while the platform-status doc updates pinned "Windows 11" specifically in three locations + CHANGELOG. The QEMU window title in the verification screenshots was indeed "windows-11", but generalizing the doc copy to match the rest of the repo's history is the safer call: it stays accurate if a future test session uses a different Windows guest version, and it doesn't subtly contradict the existing #129/#131 wording.

Locations:
- README.md `## Setup` `> **Platform status:**` callout: "Windows 11 guest VM" -> "QEMU Windows guest"
- README.md `## Limitations` platform-coverage bullet: "Windows 11 guest VM" -> "QEMU Windows guest"
- CLAUDE.md Conventions section: "Windows 11 QEMU guest" -> "QEMU Windows guest"
- CHANGELOG.md `### Changed` entry for the platform-status correction: "Windows 11 QEMU guest" -> "QEMU Windows guest"

No code change. Local verification: 610 tests pass, ruff clean, mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmeans-claude-dev Bot added a commit that referenced this pull request May 7, 2026
…p for new users (#134)

Closes #130.

Three CLI flags on the `mcp-clipboard` binary so users can verify their install before any MCP-host wiring:

- `--version`: prints the installed version and exits 0.
- `--help`:    prints usage and points at the README setup section.
- `--check`:   runs platform detection (`_detect_backend()`); prints platform/backend/OK and exits 0; on failure exits 1 with stderr diagnostic, so the flag doubles as a CI smoke check.

README `## Setup` rewritten into a five-step quick-start that acknowledges mcp-clipboard is a regular PyPI Python package and that either of the two install runners advertised in this repo's badges (pipx and uv) works. Step 1 links to the official upstream pipx and uv install docs rather than reproducing per-platform commands. Steps 3, 4, and 5 show both `pipx run mcp-clipboard` and `uvx mcp-clipboard` forms. The Windows-specific tip about Claude Desktop's environment caching is preserved and made runner-agnostic.

Platform-status doc correction folded in across three locations (README ## Setup callout, README ## Limitations bullet, CLAUDE.md Conventions): the previous "Windows untested on real hardware" boilerplate is no longer accurate, since v2.5.x exercised the Windows code paths end-to-end on a QEMU Windows guest and surfaced #129 (the UTF-8 stdin encoding bug fixed in #131). X11 and macOS still honestly hold "complete with unit tests but unverified beyond that".

Test count: 605 -> 610 (5 new CLI tests covering --version, --help, --check success, --check failure, and --check dispatched through main()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot mentioned this pull request May 7, 2026
5 tasks
cmeans-claude-dev Bot added a commit that referenced this pull request May 7, 2026
Aggregates the changes that landed on `main` since v2.5.1:

- fix(windows): UTF-8 stdin encoding on PowerShell write paths so non-ASCII characters (em dash, curly quotes, non-Latin scripts) are no longer corrupted by clipboard_copy / clipboard_copy_markdown. Reproduced and verified on a QEMU Windows guest. Closes #129 via #131.
- feat(cli): --version, --help, --check flags on the mcp-clipboard binary so users can verify their install before any MCP-host wiring. --check exits 1 with a stderr diagnostic on failure so it doubles as a CI smoke check. Closes #130 via #134.
- ci: github-release job in publish.yml auto-creates the GitHub Release on tag push, with notes pulled from the matching CHANGELOG section. v2.6.0 is the first firing of this workflow on a real release. Closes #126 via #127.
- docs: README ## Setup rewritten into a five-step quick-start that acknowledges both pipx and uv as install runners; platform status corrected to reflect that Windows has been exercised on a QEMU guest as of v2.5.x. Closes #130 via #134.

Bumped MINOR per semver: this release contains backwards-compatible Added entries (CLI flags + github-release job), matching v2.5.0's precedent for the same release shape.

Three-file release commit per repo convention: pyproject.toml + server.json (synced) + CHANGELOG.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmeans pushed a commit that referenced this pull request May 9, 2026
F2 (observation): _windows_read comment block referenced #129 as the
PR that introduced _WINDOWS_UTF8_PREAMBLE; #129 was the issue, #131
was the PR that closed it. Corrected to "#131's input-side preamble
(which closed #129)" and added an explicit "Also closes #132"
acknowledgment so the source comment matches the PR-body issue
references.

F3 (observation): test_windows_read_sets_utf8_output_encoding
asserted the UTF-8 OutputEncoding preamble was present but didn't
enforce its position in the script. Added an ordering check that
the preamble appears before Get-Clipboard (text/plain) or
Clipboard::GetData (text/html, text/rtf, image/svg+xml). The
implementation guarantees this structurally via the
_WINDOWS_UTF8_OUTPUT_PREAMBLE + ... concatenation pattern, but the
ordering assertion is defense in depth against a future refactor
that keeps the preamble but moves it past the read call -- in which
case the encoding assignment would not take effect before stdout
already has bytes. Per #132's suggested-approach step 2, which
explicitly asked for the preamble to "precede any Get-Clipboard
invocation."

F1 (substantive) is addressed in the PR body, not source: PR now
declares Closes #132 alongside Closes #142.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cmeans added a commit that referenced this pull request May 9, 2026
…es (#144)

## Summary

Closes #142. Closes #132.

PowerShell's `[Console]::OutputEncoding` defaults to the parent's active
console code page (typically CP1252 on US-English Windows). When
`_windows_read` for text/plain, text/html, or text/rtf piped
`Get-Clipboard` output back to Python, non-ASCII codepoints were
transliterated (em dash → hyphen, curly → straight, ellipsis → period)
or `?`-substituted (CJK, Arabic, emoji). The clipboard bytes are stored
correctly in UTF-16; the loss only happens on the read-back trip.

The SVG branch already had `[Console]::OutputEncoding = UTF8` (added in
#138's read fix). Text/plain, text/html, and text/rtf did not. This PR
adds the preamble to all four branches via a shared
`_WINDOWS_UTF8_OUTPUT_PREAMBLE` constant, mirroring the existing
input-side `_WINDOWS_UTF8_PREAMBLE` introduced in #131 to address #129.

#132 was filed during #131's QA review specifically as the
read-direction follow-up tracker. Its suggested approach asked for: (1)
reproduce on QEMU, (2) add the symmetric `OutputEncoding = UTF8`
preamble to `_windows_read` and any sibling read scripts, (3) add unit
tests asserting the preamble is present and precedes any `Get-Clipboard`
invocation. This PR delivers all three.

Bug class is identical in shape to #131; this is the same fix on the
output leg.

## Diagnostic chain

The Windows e2e test suite captured this directly. Run-index
`048f8f0a-7cc4-4e50-9748-f17893abf471`
(`mcp-clipboard-windows-e2e-run-claude-code-2026-05-09T17:29:15Z`):

- **mc-103** direct `PowerShell::Clipboard::GetData('image/svg+xml')`
returned 116 bytes ending `73 76 67 3E` (`svg>`, no trailing CRLF) —
clipboard bytes are intact.
- **mc-301 / mc-302** SVG byte-perfect round-trips at 200, 500, and
1000-byte payloads via direct PowerShell read with UTF-8 OutputEncoding.
- **mc-026 / mc-027 / mc-028** PASSED with em dash, curly quotes,
ellipsis, CJK, and Arabic intact when the MCP server's parent process
happened to have a UTF-8-aware console codepage.
- **mc-002 / mc-003** FAILED with the *same* input bytes when the MCP
server's parent had CP1252. Same code path, different parent
environment, opposite outcomes — proof that the storage is correct and
only the read-side encoding varies.
- **mc-102** direct PowerShell with explicit `OutputEncoding = UTF8`
returned `68 65 6c 6c 6f` (`hello`) byte-perfect.

Conclusion: the clipboard write and storage are correct; the only
variable is whether the read script forces UTF-8 stdout. With the
explicit preamble, the parent codepage no longer matters.

## What's in

- `src/mcp_clipboard/clipboard.py` — adds
`_WINDOWS_UTF8_OUTPUT_PREAMBLE` constant, prepends it to all four
`_windows_read` PowerShell-backed branches.
- `tests/test_server.py` — parameterized regression test asserting every
branch (text/plain, text/html, text/rtf, image/svg+xml) emits the
preamble AND that the preamble precedes the `Get-Clipboard` /
`Clipboard::GetData` call (so a future refactor can't keep the preamble
but move it past the read).
- `CHANGELOG.md` — entry under `### Fixed` in `[Unreleased]`.

## Test plan
- [x] `uv run pytest` — full local suite (624 passed locally on this
branch).
- [ ] Re-run the Windows e2e suite on the QEMU guest under v2.6.2
(post-merge): mc-002 and mc-003 should PASS regardless of the parent
process's console codepage; mc-026 / mc-027 / mc-028 / mc-102 / mc-103 /
mc-301 / mc-302 should remain PASS.
- [ ] Confirm em dash + curly quotes + ellipsis round-trip in CD on
Windows (the original user-facing scenario from #129's lineage).

## Related
- Sibling to #131 (`_WINDOWS_UTF8_PREAMBLE` on stdin)
- Diagnostic complement to #138 (which fixed the SVG branch's stdout
encoding while leaving the text branches unchanged)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Windows clipboard_copy corrupts non-ASCII characters (em dash and other UTF-8 multi-byte chars)

1 participant