glibc: add patch to use utf-8 when LANG is unset#61202
glibc: add patch to use utf-8 when LANG is unset#61202matthewbauer wants to merge 6 commits intoNixOS:stagingfrom
Conversation
|
For reference, this is what musl already does: https://wiki.musl-libc.org/functional-differences-from-glibc.html#Default-locale |
|
@matthewbauer I would love to see this. However we should also take macOS into account. If we start to use UTF-8 by default on linux we also should on macOS or otherwise we break accidentally packages for macOS. I don't have access to macOS however I roughly described in #54485 (comment) what needs to be done in order to get C.UTF-8 support. |
Have you ever seen this problem appear on macOS? It's possible it happens but I can't recall it happening. Usually these hacks are glibc specific. |
peti
left a comment
There was a problem hiding this comment.
How will glibc provide that UTF-8 locale without referencing the glibcLocale files? My understanding is that those files are necessary at run-time, but glibc does not know their location.
|
Some possible things that may work:
|
I don't think it's necessary for C.UTF-8. It's definitely necessary for en_US.UTF-8, but that is due to the US localizations. UTF-8 and C should be baked in to glibc. |
|
@matthewbauer you don't want to set a invalid locale on macOS or things start to fail. |
pkgs/stdenv/darwin/default.nix
Outdated
There was a problem hiding this comment.
It would be still great if we could set LANG on macOS as well. Reason is, that $LC_CTYPE has a different semantics/priority compared to $LANG. LANG has a lower priority then all LC_* variables, so if an existing package already sets LC_ALL or similar to some value, it would be preferred.
There was a problem hiding this comment.
It doesn't seem to work correctly:
$ env -i LANG=UTF-8 locale
LANG="UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
vs.
$ env -i LANG=en_US.UTF-8 locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
vs.
$ env -i LANG=en_US.UTF-8 locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
This is the same thing described in the FreeBSD proposal:
Introduce C.UTF-8 locale, using the same common CTYPE map as other UTF-8 locales ... and having all other components use C locale
UTF-8 is only valid for LC_CTYPE, which makes sense. This is a BSDism it looks like though, so documentation on it would be helpful. My thinking on this is that if a package really needs a non-UTF-8 encoding in 2019, they should specifically request it through setting LC_CTYPE or LC_ALL. Also just to show that LC_ALL does take precedence:
$ env -i LC_ALL=en_US.UTF-8 LC_CTYPE=UTF-8 locale
LANG=
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
There was a problem hiding this comment.
Yeah, I meant by having a C.UTF-8 locale generated on macOS instead of using LC_CTYPE=UTF-8.
The code of glibc does first look into the archive in its own installation prefix before falling back to |
|
I should verify that c65124a3ce0b59dcf71e4ad9c72f6616f67b7079 works correctly for all of the changed files. It's possible that somewhere the locale vars are set and they specifically want one of the glibcLocales included. This might make sense for certain kinds of tests, but also seems like kind of bad practice. |
|
I don't know, it seems scary to patch Glibc in this way. Does upstream have an opinion about this? |
@edolstra Upstream ticket and discussions:
It looks like it's part of the roadmap, just never been done? I can't find any distros that have done the switch for the default, though many provide C.UTF-8 in their glibc package (as we already do). It seems reasonable to me to enable UTF-8 by default, though. |
|
Why do these even use the huge full |
|
Yeah unfortunately C.UTF-8 only works on Glibc/Musl. So we need to to either use something that works on macOS like en_US.UTF-8 or else apply a patch like this. |
This allows the default builder to use utf-8 encoding. Previously, when LANG and LC_ALL was left unset, glibc would default to “C”. This is not good in many. As a result of this, I think we can get rid of a lot of uses of LC_ALL in Nixpkgs, where utf-8 was needed. These can be found by grepping for: (?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?
A couple of places set LANG=“en_US.UTF-8” to get unicode working in the Nix builder. This is very very bad! Not everyone who uses Nix lives in the U.S. and is an English speaker. Instead we should use the “C.UTF-8”, which is an UTF-8 capable version of the unlocalized C locale. Please use this one! The previous commit makes Glibc fall back to this locale when LANG is unset. This is in line with the Musl behavior. As a result, the Nix builder is made UTF-8 capable without any LOCALE_ARCHIVE or LANG hacks.
Unfortunately, the default macOS locale does not have UTF-8 support. You need to set this for correct behavior.
977d329 to
0ebb6d2
Compare
These break due to default locale not being what gnulib expects.
2b3829b to
9b218d1
Compare
| }) | ||
| (fetchurl { | ||
| url = "https://bugs.debian.org/cgi-bin/bugreport.cgi?att=1;bug=874160;filename=0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch;msg=5"; | ||
| name = "0001-Default-to-C.UTF-8-on-setlocale-.-if-no-env-vars-are.patch"; |
There was a problem hiding this comment.
So debian switched to C.UTF-8 now?
There was a problem hiding this comment.
No this is just a proposed patch
There was a problem hiding this comment.
|
@matthewbauer This needs a rebase :) |
|
I got a osx vm now and could actually test the patch I proposed here: #54485 (comment) |
|
Alacritty also switched to |
|
Instead of patching |
|
Closing for now - glibc probably shouldn't be patched like this & probably should just set LANG=C.UTF-8 in stdenv. |
This allows the default builder to use utf-8 encoding. Previously,
when LANG and LC_ALL was left unset, glibc would default to “C”. This
is not good in many.
As a result of this, I think we can get rid of a lot of uses of LC_ALL
in Nixpkgs, where utf-8 was needed. These can be found by grepping
for:
(?:LC_ALL|LANG) *= *["']?(?:en_US|C)\.UTF-8["']?A lot of times the ones using en_US.UTF-8 also add glibcLocales.
/cc @vaibhavsagar