Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<format>: Incorrect handling of UTF-8 encoded format strings #1820

Closed
vitaut opened this issue Apr 10, 2021 · 6 comments
Closed

<format>: Incorrect handling of UTF-8 encoded format strings #1820

vitaut opened this issue Apr 10, 2021 · 6 comments
Labels
bug Something isn't working fixed Something works now, yay! format C++20/23 format

Comments

@vitaut
Copy link
Contributor

vitaut commented Apr 10, 2021

Describe the bug

std::format throws on formatting a valid UTF-8 string when the literal (execution) encoding is UTF-8 and the locale encoding is Shift-JIS (and possibly other cases). This is wrong because format strings are almost always literals and therefore the locale encoding shouldn't affect parsing.

Command-line test case

C:\Temp>type test.cc
#include <format>
#include <iostream>

int main() {
  setlocale(LC_ALL, ".932");
  try {
    auto s = std::format("\xe2\x82\xa0{}", 42);
    std::cout << (s == "\xe2\x82\xa0" "42") << "\n";
  } catch (const std::exception& e) {
    std::cout << e.what() << "\n";
  }
}

C:\Temp>cl /I STL/stl/inc /std:c++latest /utf-8 test.cc
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.29917 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

/std:c++latest is provided as a preview of language features from the latest C++
working draft, and we're eager to hear about bugs and suggestions for improvements.
However, note that these features are provided as-is without support, and subject
to changes or removal as the working draft evolves. See
https://go.microsoft.com/fwlink/?linkid=2045807 for details.

test.cc
test.cc(6): warning C4530: C++ exception handler used, but unwind semantics are not enabled. Specify /EHsc
Microsoft (R) Incremental Linker Version 14.29.29917.0
Copyright (C) Microsoft Corporation.  All rights reserved.

/out:test.exe
test.obj

C:\Temp>test
Invalid encoded character in format string.

Expected behavior
The test program should print 1.

STL version
https://github.com/microsoft/STL/commit/4d7d4f1

@barcharcraz
Copy link
Member

There is no way to detect the execution charset at compile time, so we assume the active codepage and execution charset is the same, this is obviously broken but it's what we do throughout the STL :(

@vitaut
Copy link
Contributor Author

vitaut commented Apr 10, 2021

You can easily detect if the literal encoding is UTF-8 at compile time as follows:

constexpr bool is_utf8() {
  const unsigned char micro[] = "\u00B5";
  return sizeof(micro) == 3 && micro[0] == 0xC2 && micro[1] == 0xB5;
}

This has been suggested for P0355 and elsewhere. Shift-JIS can be detected similarly and, in fact, you might only need to detect Shift-JIS because it's the one that collides with UTF-8 and requires special handling at parsing. And of course you don't need to detect single-byte encodings which are the vast majority.

@pdimov
Copy link

pdimov commented Apr 10, 2021

Yeah, I was just going to paste https://godbolt.org/z/YhnnrdT6e.

@barcharcraz
Copy link
Member

we also need to detect big5 and some of the ISO-1022 encodings. But if that detection works we can special case it, and not be broken for utf-8

@barcharcraz
Copy link
Member

well we don't need to detect them but we want to work for them.

@StephanTLavavej
Copy link
Member

@statementreply Closing this as fixed now that your #1824 (and @CaseyCarter's #1834) are merged; it wasn't auto-closed because the Word Of Power "Fixes #NNN" works for the default branch only (as documented in GitHub docs, but which I wasn't aware of until now).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Something works now, yay! format C++20/23 format
Projects
None yet
Development

No branches or pull requests

4 participants