-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: better support for UTF-8 localized number formatting #1861
Comments
On second thought, a different format specifier to indicate preference of |
I can certainly relate to your pain. Proper localization without proper UTF support still is a mirage for the most part. Sooner or later you will get bitten by reality and implicit assumptions like a single character equates a single code unit. I've noticed this with the de_CH locale just recently during my attempt to serve our Swiss customers better. Experiences like these are my main motivation to refuse any meaningful string handling using |
Just using wchar_t for the notion of thousands separator is still wrong. The type to represent a UTF character is string, as it can span arbitrary many code points. |
Yeah exactly, the real problem is the Standard library facets that return char_type for certain fields rather than string_type. Namely std::numpunct and std::moneypunct. That part of the library hasn't aged nearly as well. Perhaps what is needed is for someone to write a proposal to modernize those while maintaining backwards compatibility. I think it should be possible. Getting it approved, well that's a different story. |
Definitely the case today, but would be nice to get this fixed for the future. |
It is. As Jonathan correctly pointed out It might be possible to replace |
{fmt} now supports the UTF-8 #include <fmt/format.h>
#include <locale>
int main() {
std::locale::global(std::locale({}, new fmt::format_facet<std::locale>("’")));
fmt::print("{:L}\n", 1000);
} prints:
Here |
Hi! Here's the problem. Given that
char8_t
is kind of broken currently and still lacks a good standardized transcoding library, we're treatingchar
based strings as if they were UTF-8. We want to format localized numbers with these strings. The problem is thatstd::numpunct<char>::decimal_point()
andthousands_sep()
only return a char and so it's not possible for these to represent UTF-8 characters beyond the ASCII subset. Some locales use non-ASCII characters for these, an example would be de-CH, which uses U+2019 for the digit separator. What we want to do is somehow transcode the values fromstd::numpunct<wchar_t>
to UTF-8 and have libfmt use these instead. Using a custom formatter specialization and a wrapper type isn't really an option for us since we want this to be less-intrusive and not something user's of the library need to be concerned about. Fortunately, we do have a facade class for formatting localized strings, and this class owns thestd::locale
object, so it's possible for us to do some pre-processing or post-processing on the input and output arguments respectively. However, pre-processing doesn't work that well forvformat
, since there's no way to convertformat_args
towformat_args
(and there probably shouldn't be, since that would end up increasing code-bloat by creating a link dependency betweenchar
andwchar_t
formatters due to the type erase that is involved). So we're kind of stuck with post-processing.Ideally, if libfmt (and perhaps eventually the standardized version) had an option to use the
std::num_put<char>
facet instead ofstd::numpunct<char>
, we could solve the problem that way by providing a customstd::num_put<char>
facet that outputs the correct UTF-8 sequence. libfmt would then need to recognize a different localized number format specifier, perhaps uppercaseN
instead ofL
, to indicate that it should usestd::num_put
instead ofstd::numpunct
.Another option would be to eventually support
char8_t
,char16_t
andchar32_t
string formatting, and automatically transcode the values fromstd::numpunct<wchar_t>
to the target character encoding.It also looks like it would be possible to specialize the internal
detail::int_writer
ordetail::arg_formatter
template classes for each of the integral and floating point types to override the default formatting behavior, but this isn't a solution that would be portable to other implementations ofstd::format
. So not really a valid solution for us.Our current workaround that should work with the standardized
std::format
is to detect when the decimal point or thousands separator are non-ASCII characters and override thestd::locale
object with a customstd::numpunct<char>
facet. This facet uses ASCII control characters\x01
and\x02
for the decimal point and digit separator, since these aren't found in strings in any of our uses cases. We then do a post-processing pass on the formatted string to replace\x01
or\x02
with the correct UTF-8 octet sequence. It's definitely a hack, but it works.It looks something like this:
What are your thoughts? Any better ideas? Is there any good way out of this quagmire?
The text was updated successfully, but these errors were encountered: