fix(windows): force UTF-8 stdin encoding so PowerShell preserves non-ASCII chars#131
Conversation
…ASCII chars (closes #129) PowerShell's [Console]::In.ReadToEnd() decodes stdin via [Console]::InputEncoding, which defaults to the OEM/ANSI code page on Windows (commonly CP1252), not UTF-8. Python's content.encode() emits UTF-8 bytes, so multi-byte sequences (em dash U+2014, curly quotes, non-Latin scripts) were misread as separate CP1252 characters before Set-Clipboard ever ran, corrupting the clipboard payload. Each affected PowerShell write script now sets [Console]::InputEncoding = [System.Text.Encoding]::UTF8 before reading stdin. Centralized as _WINDOWS_UTF8_PREAMBLE so the four typed-write branches (text/plain, text/html, text/rtf, image/svg+xml) and the plain text path share one source of truth. The base64-on-stdin paths (_windows_write_multi, _windows_write_image) were already safe because base64 alphabets are ASCII-only and CP1252/UTF-8 agree on the ASCII range, so neither path was changed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cmeans
left a comment
There was a problem hiding this comment.
QA review — PR #131 (round 1)
Verdict: Zero findings. Recommending Ready for QA Signoff.
Scope check vs #129
| Area | Status |
|---|---|
| Closes #129 | yes — fix targets the exact root cause described in the issue |
| Fix direction matches #129's "Fix direction" | yes — chose Option 1 (InputEncoding = UTF8 preamble), which is the smallest-viable change |
| Test direction matches #129's "Test plan" | yes — Windows-only-shaped unit test asserting the preamble is present and precedes the stdin read |
| All affected scripts covered | yes — see coverage matrix below |
Coverage matrix — every [Console]::In.ReadToEnd() site
I grepped src/mcp_clipboard/clipboard.py for every call site that pipes content through [Console]::In.ReadToEnd(). All seven sites were considered:
| Site | MIME / payload | Stdin alphabet | Action |
|---|---|---|---|
_windows_write (l. 587) |
text/plain | UTF-8 text | preamble added ✓ |
_windows_write_typed (l. 767) |
text/plain | UTF-8 text | preamble added ✓ |
_windows_write_typed (l. 777) |
text/html | UTF-8 text | preamble added ✓ |
_windows_write_typed (l. 791) |
text/rtf | UTF-8 text | preamble added ✓ |
_windows_write_typed (l. 809) |
image/svg+xml | UTF-8 text | preamble added ✓ |
_windows_write_multi (l. 908) |
base64 JSON envelope | ASCII-only | not needed (justification in PR body — verified) |
_windows_write_image (l. 1000) |
base64 image bytes | ASCII-only | not needed (justification in PR body — verified) |
The two skipped sites are correctly out of scope: their stdin payloads are pure ASCII (base64 alphabets and JSON keys/quoting), so the default OEM/ANSI code page reads them identically to UTF-8.
Test design
The 6 new unit tests in tests/test_server.py are well-shaped:
_assert_utf8_input_encodingchecks both presence ([Console]::InputEncodingandUTF8substrings) and ordering (encoding-set must precedeReadToEnd). Ordering matters:Set-Clipboard ...; [Console]::InputEncoding = UTF8would silently regress, and the helper would catch that.test_windows_write_passes_utf8_bytes_for_non_asciiasserts the literal byte sequence\xe2\x80\x94(UTF-8 for U+2014 em dash) is in the piped stdin payload — directly tests the bug from #129's repro.- One preamble-assertion test per affected script (5 scripts → 5 tests) plus the byte-payload test = 6 total. Matches the PR body's claim.
- Tests mock
_run_with_stdin, so they're cross-platform — they run in the Linux CI matrix (test 3.11/3.12/3.13 all green), which is the right choice since the fix is structural rather than runtime-Windows-dependent.
Verification on 5cc0b05
pytest: 605 passed, 19 deselected, 5 xfailed (matches PR body's claim of 599 prior + 6 new = 605). The 19 deselected are the integration tests intests/test_integration.pyandtests/test_integration_x11.py(require live X11/Wayland display); they execute in theintegration-x11CI job which is green on this PR.ruff check src tests scripts: All checks passed.ruff format --check src tests scripts: 11 files already formatted.mypy src: Success, no issues.scripts/sync-server-json.py --check: in sync at 2.5.1.- CI: 11 actual checks all green (lint, typecheck, test ×3, integration-x11, codecov/patch, version-sync, validate-server-json, on-push, qa-approved, qa-gate-pending).
Test-plan checkboxes
- CI passes — all 11 checks green on
5cc0b05. Ticked. - Manual Windows verification (em dash, en dash, curly quotes,
日in Notepad) — relies on Dev's claim of QEMU verification (CHANGELOG: "Verified on a QEMU Windows guest where the bug originally reproduced"). I cannot independently verify from a Linux session. Leaving unticked; flagging that QA's signoff is conditional on Dev's QEMU verification standing. - Manual Windows verification with
clipboard_copy_markdown— same situation. Note:clipboard_copy_markdownexercises the_windows_write_multipath, which the PR explicitly does NOT change because its stdin alphabet is base64-only. The text/html and text/rtf payloads inside the JSON envelope are base64-encoded UTF-8 bytes that PowerShell decodes via[System.Text.Encoding]::UTF8.GetString(l. 916, l. 924). So the markdown path was already correct end-to-end pre-PR and shouldn't regress.
Follow-up
I filed #132 for the read-direction encoding concern that the PR notes flagged (Get-Clipboard stdout decoding). Classification: nice-to-have — no current bug report, defensive symmetric fix, can land after #131. Issue includes a concrete repro recipe for the QEMU Windows guest. Does not block this PR.
CHANGELOG
Under [Unreleased] / Fixed — correct category. American-English spelling throughout. No em-dashes in the new content (confirmed via git diff main..fix/windows-utf8-encoding-129 -- CHANGELOG.md | grep '—' — no matches). Closes #129 referenced.
Summary
Clean, well-scoped fix with proportionate test coverage. Comments in the source code (_WINDOWS_UTF8_PREAMBLE definition + the explanatory comment in _windows_write_typed) earn their keep — they document a non-obvious Windows-platform encoding gotcha that future readers would benefit from. Single source of truth via the module-level constant prevents drift across the 5 scripts.
Applying Ready for QA Signoff.
…three locations QA Round 2 review surfaced that the project's "Windows is complete but untested" boilerplate is no longer accurate. v2.5.x exercised the Windows code paths end-to-end on a Windows 11 QEMU guest, which surfaced and resolved a real Windows-only encoding bug (#129, the UTF-8 stdin code-page mismatch fixed in #131). Finding and fixing a platform-specific bug via testing on the platform is the strongest possible evidence the platform is being tested; the stale claim contradicted that. Locations updated: - README.md `## Setup` → `> **Platform status:**` callout: now reads "Windows has been exercised end-to-end on a Windows 11 guest VM" and references #129 by issue link. - README.md `## Limitations` → platform-coverage bullet: same correction, with X11 and macOS still honestly held to "complete with unit tests but unverified beyond that". - CLAUDE.md Conventions section: matching update for the dev-facing platform-status line. X11 and macOS coverage statements were left honest. Both still have the "complete with unit tests but unverified on live hardware" status; bug reports and PRs for those platforms remain welcome. CHANGELOG `### Changed` entry added for the doc correction. Local verification: 610 tests pass, ruff clean, mypy clean, no em/en dashes in added lines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…A Round 3 F2) QA reviewer flagged that the source records (#129 issue body, #131 CHANGELOG, awareness QA notes) describe the Windows test platform as "QEMU Windows guest" without specifying version, while the platform-status doc updates pinned "Windows 11" specifically in three locations + CHANGELOG. The QEMU window title in the verification screenshots was indeed "windows-11", but generalizing the doc copy to match the rest of the repo's history is the safer call: it stays accurate if a future test session uses a different Windows guest version, and it doesn't subtly contradict the existing #129/#131 wording. Locations: - README.md `## Setup` `> **Platform status:**` callout: "Windows 11 guest VM" -> "QEMU Windows guest" - README.md `## Limitations` platform-coverage bullet: "Windows 11 guest VM" -> "QEMU Windows guest" - CLAUDE.md Conventions section: "Windows 11 QEMU guest" -> "QEMU Windows guest" - CHANGELOG.md `### Changed` entry for the platform-status correction: "Windows 11 QEMU guest" -> "QEMU Windows guest" No code change. Local verification: 610 tests pass, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p for new users (#134) Closes #130. Three CLI flags on the `mcp-clipboard` binary so users can verify their install before any MCP-host wiring: - `--version`: prints the installed version and exits 0. - `--help`: prints usage and points at the README setup section. - `--check`: runs platform detection (`_detect_backend()`); prints platform/backend/OK and exits 0; on failure exits 1 with stderr diagnostic, so the flag doubles as a CI smoke check. README `## Setup` rewritten into a five-step quick-start that acknowledges mcp-clipboard is a regular PyPI Python package and that either of the two install runners advertised in this repo's badges (pipx and uv) works. Step 1 links to the official upstream pipx and uv install docs rather than reproducing per-platform commands. Steps 3, 4, and 5 show both `pipx run mcp-clipboard` and `uvx mcp-clipboard` forms. The Windows-specific tip about Claude Desktop's environment caching is preserved and made runner-agnostic. Platform-status doc correction folded in across three locations (README ## Setup callout, README ## Limitations bullet, CLAUDE.md Conventions): the previous "Windows untested on real hardware" boilerplate is no longer accurate, since v2.5.x exercised the Windows code paths end-to-end on a QEMU Windows guest and surfaced #129 (the UTF-8 stdin encoding bug fixed in #131). X11 and macOS still honestly hold "complete with unit tests but unverified beyond that". Test count: 605 -> 610 (5 new CLI tests covering --version, --help, --check success, --check failure, and --check dispatched through main()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aggregates the changes that landed on `main` since v2.5.1: - fix(windows): UTF-8 stdin encoding on PowerShell write paths so non-ASCII characters (em dash, curly quotes, non-Latin scripts) are no longer corrupted by clipboard_copy / clipboard_copy_markdown. Reproduced and verified on a QEMU Windows guest. Closes #129 via #131. - feat(cli): --version, --help, --check flags on the mcp-clipboard binary so users can verify their install before any MCP-host wiring. --check exits 1 with a stderr diagnostic on failure so it doubles as a CI smoke check. Closes #130 via #134. - ci: github-release job in publish.yml auto-creates the GitHub Release on tag push, with notes pulled from the matching CHANGELOG section. v2.6.0 is the first firing of this workflow on a real release. Closes #126 via #127. - docs: README ## Setup rewritten into a five-step quick-start that acknowledges both pipx and uv as install runners; platform status corrected to reflect that Windows has been exercised on a QEMU guest as of v2.5.x. Closes #130 via #134. Bumped MINOR per semver: this release contains backwards-compatible Added entries (CLI flags + github-release job), matching v2.5.0's precedent for the same release shape. Three-file release commit per repo convention: pyproject.toml + server.json (synced) + CHANGELOG.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
F2 (observation): _windows_read comment block referenced #129 as the PR that introduced _WINDOWS_UTF8_PREAMBLE; #129 was the issue, #131 was the PR that closed it. Corrected to "#131's input-side preamble (which closed #129)" and added an explicit "Also closes #132" acknowledgment so the source comment matches the PR-body issue references. F3 (observation): test_windows_read_sets_utf8_output_encoding asserted the UTF-8 OutputEncoding preamble was present but didn't enforce its position in the script. Added an ordering check that the preamble appears before Get-Clipboard (text/plain) or Clipboard::GetData (text/html, text/rtf, image/svg+xml). The implementation guarantees this structurally via the _WINDOWS_UTF8_OUTPUT_PREAMBLE + ... concatenation pattern, but the ordering assertion is defense in depth against a future refactor that keeps the preamble but moves it past the read call -- in which case the encoding assignment would not take effect before stdout already has bytes. Per #132's suggested-approach step 2, which explicitly asked for the preamble to "precede any Get-Clipboard invocation." F1 (substantive) is addressed in the PR body, not source: PR now declares Closes #132 alongside Closes #142. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…es (#144) ## Summary Closes #142. Closes #132. PowerShell's `[Console]::OutputEncoding` defaults to the parent's active console code page (typically CP1252 on US-English Windows). When `_windows_read` for text/plain, text/html, or text/rtf piped `Get-Clipboard` output back to Python, non-ASCII codepoints were transliterated (em dash → hyphen, curly → straight, ellipsis → period) or `?`-substituted (CJK, Arabic, emoji). The clipboard bytes are stored correctly in UTF-16; the loss only happens on the read-back trip. The SVG branch already had `[Console]::OutputEncoding = UTF8` (added in #138's read fix). Text/plain, text/html, and text/rtf did not. This PR adds the preamble to all four branches via a shared `_WINDOWS_UTF8_OUTPUT_PREAMBLE` constant, mirroring the existing input-side `_WINDOWS_UTF8_PREAMBLE` introduced in #131 to address #129. #132 was filed during #131's QA review specifically as the read-direction follow-up tracker. Its suggested approach asked for: (1) reproduce on QEMU, (2) add the symmetric `OutputEncoding = UTF8` preamble to `_windows_read` and any sibling read scripts, (3) add unit tests asserting the preamble is present and precedes any `Get-Clipboard` invocation. This PR delivers all three. Bug class is identical in shape to #131; this is the same fix on the output leg. ## Diagnostic chain The Windows e2e test suite captured this directly. Run-index `048f8f0a-7cc4-4e50-9748-f17893abf471` (`mcp-clipboard-windows-e2e-run-claude-code-2026-05-09T17:29:15Z`): - **mc-103** direct `PowerShell::Clipboard::GetData('image/svg+xml')` returned 116 bytes ending `73 76 67 3E` (`svg>`, no trailing CRLF) — clipboard bytes are intact. - **mc-301 / mc-302** SVG byte-perfect round-trips at 200, 500, and 1000-byte payloads via direct PowerShell read with UTF-8 OutputEncoding. - **mc-026 / mc-027 / mc-028** PASSED with em dash, curly quotes, ellipsis, CJK, and Arabic intact when the MCP server's parent process happened to have a UTF-8-aware console codepage. - **mc-002 / mc-003** FAILED with the *same* input bytes when the MCP server's parent had CP1252. Same code path, different parent environment, opposite outcomes — proof that the storage is correct and only the read-side encoding varies. - **mc-102** direct PowerShell with explicit `OutputEncoding = UTF8` returned `68 65 6c 6c 6f` (`hello`) byte-perfect. Conclusion: the clipboard write and storage are correct; the only variable is whether the read script forces UTF-8 stdout. With the explicit preamble, the parent codepage no longer matters. ## What's in - `src/mcp_clipboard/clipboard.py` — adds `_WINDOWS_UTF8_OUTPUT_PREAMBLE` constant, prepends it to all four `_windows_read` PowerShell-backed branches. - `tests/test_server.py` — parameterized regression test asserting every branch (text/plain, text/html, text/rtf, image/svg+xml) emits the preamble AND that the preamble precedes the `Get-Clipboard` / `Clipboard::GetData` call (so a future refactor can't keep the preamble but move it past the read). - `CHANGELOG.md` — entry under `### Fixed` in `[Unreleased]`. ## Test plan - [x] `uv run pytest` — full local suite (624 passed locally on this branch). - [ ] Re-run the Windows e2e suite on the QEMU guest under v2.6.2 (post-merge): mc-002 and mc-003 should PASS regardless of the parent process's console codepage; mc-026 / mc-027 / mc-028 / mc-102 / mc-103 / mc-301 / mc-302 should remain PASS. - [ ] Confirm em dash + curly quotes + ellipsis round-trip in CD on Windows (the original user-facing scenario from #129's lineage). ## Related - Sibling to #131 (`_WINDOWS_UTF8_PREAMBLE` on stdin) - Diagnostic complement to #138 (which fixed the SVG branch's stdout encoding while leaving the text branches unchanged) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes #129.
Summary
[Console]::InputEncoding = [System.Text.Encoding]::UTF8on every Windows PowerShell write script that pipes content through[Console]::In.ReadToEnd(). Centralized as a new module-level_WINDOWS_UTF8_PREAMBLEconstant so the five affected scripts share one source of truth.InputEncoding = UTF8directive, the directive must precede the stdin read, and the byte payload must be valid UTF-8 even for the canonical em dash repro from the bug report.Root cause
PowerShell's
[Console]::In.ReadToEnd()decodes stdin via[Console]::InputEncoding, which defaults to the OEM/ANSI code page on Windows (commonly CP1252), not UTF-8.content.encode()in Python emits UTF-8 bytes, so multi-byte sequences (em dash U+2014, curly quotes, non-Latin scripts) were misread as separate CP1252 characters beforeSet-Clipboardever ran, corrupting the clipboard payload.For an em dash (UTF-8
0xE2 0x80 0x94) read as CP1252, the bytes become the three-character stringâ€", which downstream encoding conversions can warp further into the garbage observed in the original report ([Cö Cfor— C).Fix scope
Five scripts touched, all in
src/mcp_clipboard/clipboard.py:_windows_writeSet-Clipboard ...Set-Clipboard ..._windows_write_typed_windows_write_typedSetData(Html, ...)_windows_write_typedSetData(Rtf, ...)_windows_write_typedSetData('image/svg+xml', ...)The text/html branch is particularly worth calling out: the CF_HTML byte offsets in the header (
StartHTML,StartFragment, etc.) are computed against UTF-8 bytes in Python. If PowerShell decodes stdin as CP1252 and re-encodes the resulting string as UTF-16 insideSetData, the offsets are off and consumers see fragment markers pointing into garbage. Forcing UTF-8 at read time is what makes those offsets valid end-to-end.Paths intentionally NOT changed
_windows_write_multi(multi-format writes viaclipboard_copy_markdown) already encodes payloads as base64 and decodes UTF-8 inside PowerShell. The JSON envelope is ASCII-only (keyshtml/text/rtf, values are base64), so the default CP1252 InputEncoding reads it identically to UTF-8._windows_write_imagepipes base64-encoded image data over stdin. Base64 is ASCII-only.Both paths are already correct and the fix would be a no-op there. Left untouched to keep the diff focused on the bug.
Verification
ruff check,ruff format --check,mypyclean.scripts/sync-server-json.py --checkreports in sync at 2.5.1.Why this didn't catch sooner
The bug requires (a) Windows, (b) non-ASCII content, and (c) a downstream paste target that surfaces the corruption visibly. The unit-test mocks asserted the PowerShell invocation shape but not the encoding directive, and the
integration-x11CI job runs on Linux. Live Windows testing was the only path to catching it — the QEMU Windows guest where the bug was reported is the first time mcp-clipboard has been exercised on real Windows with non-ASCII content.Test plan
日); paste into Notepad and confirm round-trip fidelity.clipboard_copy_markdownto confirm the text/html and text/rtf paths behave correctly with non-ASCII content.Notes
Get-Clipboardstdout decoding in_run) is a separate potential issue with the same root cause shape but was not part of Windows clipboard_copy corrupts non-ASCII characters (em dash and other UTF-8 multi-byte chars) #129's repro and is left for a follow-up if it manifests.