-
Notifications
You must be signed in to change notification settings - Fork 12.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Issue for aligning Unicode version to 15.0 (year 2022) #101840
Comments
|
I happened to notice this in the UCD spec https://www.unicode.org/reports/tr44/#Missing_Conventions. It seems that as of 15.0.0 have to start parsing some of the comments to look for
This is pretty annoying and will likely lead to some bugs (likely minor). I'm certain I've written code that would be buggy under this. (Hell, I'm not sure I've ever seen code that handled these, and I've read a good number of UCD wrangler scripts -- usually it's just to say "assign the obvious default value to anything that doesn't get specified", as in the It looks like the only files in UCD that take advantage of this (e.g. use it several times to define a full range as having the obvious default case) are I suspect the argument is folks we were always doing this wrong if they ignored it though? Oof. CC @Manishearth (since I suspect you know more details here, and even if not, I believe |
This seems like another reason for us to potentially switch over to ICU4X (maintained by Unicode), potentially with unicode-rs (or somewhere) maintaining crates of docs.rs/databake-format data output. Unfortunately it's unlikely I don't have time to fixup unicode-width but would be happy to accept a patch. |
I think rustc doing that is probably very good. It's hard for me to imagine anybody fighting you on it. For the For the ecosystem at large, I dunno. I suspect it will be a tough sell, due to compile time, binary size, and overall heavyweight dependency trees already being in the most common complaints. |
Yeah for the core unicode stuff it'd probably be best done as a table generation. Worth noting though: you could still use the efficient code point trie impl that ICU4X uses for this, and not take on too many dependencies. Note that ICU4X's model also involves running a data generation tool; so it might not be an improvement over ucd-parse for us. Eventually, ICU4X will have a copy of various slices of data on a CDN, and then it would be an improvement over ucd-parse. The only difference right now would be that ICU4X is much more likely to update quickly to the latest unicode stuff over ucd_parse, but that may not be a huge deal since ucd_parse doesn't need much work. I wasn't really thinking about the ecosystem at large. I'm fine continuing to maintain unicode-rs, I just don't have that much time to devote to it. Overall rustc and other users tend to make PRs when they need them and that works out fine. FWIW, ICU4X is very much optimized for binary size.. Perhaps not compile times, but it does tend to produce small binary sizes (and it's shipping on embedded devices already). I'm not really pushing on us moving to ICU4X yet (I think @crlf0710 is investigating a bit though!), I think the status quo is manageable, but it's definitely an option on the table! |
I don't think I have any strong opinions. It seems like Also worth pointing out that the whole |
Oh yeah to be clear, no shade on ucd-parse and ucd-generate! They're great tools and I don't necessarily mean to say ICU4X is much better or anything. Rustc already needs ICU4X for list formatting and @crlf0710 was investigating using it for more things like the XID and security properties (which rely on those manual python scripts). |
@thomcc I can't make heads or tails of the
So it seems to me like parsers of UCD that ignore The Unicode docs do a really terrible job of even explaining what
Most of the rest of that section is documenting the format. The key thing that's missing from the docs is what |
I have some guesses as to what the answer is but I don't want to say something incorrect, so I've just asked these questions to the Unicode people. |
I'd like to hear back from @Manishearth's contacts in case I'm mistaken (or missing some detail), but I'm confident I know what they're for, and have thoughts on them in the context of a low level library like These exist because several of the UCD files do already require special handling for missing values, but it's not usable in a programmatic manner and needs hard-coding. I believe the goal is that for properties that have Here are most of the interesting cases I've found for this (some of which are more useful than others), which should give you a good idea of why it exists and why
There are some annoying things about this:
That said, overall it's probably a good thing, but the release notes for this definitely ignores many of the reasons why this might be a bit of a pain in the ass to handle. |
TLDR: they make the files more maintainable and easier to read comments for
|
Some more background from Asmus, including some explicit steps on how to handle them. They brought up an interesting point that the order of parsing them matters a lot.
|
…earth Update lexer emoji diagnostics to Unicode 15.0 This replaces the `unic-emoji-char` dep tree (which hasn't been updated for a while) with `unicode-properties` crate which contains Unicode 15.0 data. Improves diagnostics for added emoji characters in recent years. (See tests). cc rust-lang#101840 cc ``@Manishearth``
Unicode is released on a yearly basis, so we update the data files we used accordingly after each Unicode release in the Rust project. (Keep in mind that new dependencies might be added over time.)
About tracking issues
Tracking issues are used to record the overall progress of implementation.
They are also used as hubs connecting to other relevant issues, e.g., bugs or open design questions.
A tracking issue is however not meant for large scale discussion, questions, or bug reports about a feature.
Instead, open a dedicated issue for the specific matter and add the relevant feature gate label.
Steps
Goal: Unicode 15.0 (year 2022)
Unicode version dependent crates:
Libraries
libcore
: Bump Unicode to version 15.0.0, regenerate tables #101821Compiler
Cargo.lock
unicode-rs
crates to Unicode 15 #101912Language integrated:
unicode-xid
(Decide whether it's a valid identifier)Current: 0.2.2 (Unicode 13)
Goal: 0.2.4 (Unicode 15)
unicode-normalization
(Preprocess identifiers for equality)Current: 0.1.13 (Unicode 9)
Goal: 0.1.22 (Unicode 15)
unicode-security
(Decide whether lints against unwanted usages should be triggered)Current: 0.0.5 (Unicode 13)
Goal: 0.1.0 (Unicode 15)
unicode-script
(Used byunicode-security
for script detection)Current: 0.5.3 (Unicode 13)
Goal: 0.5.5 (Unicode 15)
Diagnostics:
unicode-width
(used byrustc-parse
,rustc-errors
and many others)Current: 0.1.8 (Unicode 13)
Goal: 0.1.10 (Unicode 15)
unicode-properties
(used byrustc-lexer
)Current: 0.1.0 (Unicode 15)
Goal: 0.1.0 (Unicode 15)
unic-char-property
(used byunic-emoji-char
, thenrustc-lexer
)Current: 0.9.0 (Unclear, No release in 2 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update lexer emoji diagnostics to Unicode 15.0 #114193)unic-char-range
(used byunic-emoji-char
, thenrustc-lexer
)Current: 0.9.0 (Unclear, No release in 2 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update lexer emoji diagnostics to Unicode 15.0 #114193)unic-common
(used byunic-emoji-char
, thenrustc-lexer
)Current: 0.9.0 (Unclear, No release in 2 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update lexer emoji diagnostics to Unicode 15.0 #114193)unic-ucd-version
(used byunic-emoji-char
, thenrustc-lexer
)Current: 0.9.0 (Unclear, No release in 2 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update lexer emoji diagnostics to Unicode 15.0 #114193)unic-emoji-char
(used byrustc-lexer
)Current: 0.9.0 (Unclear, No release in 2 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update lexer emoji diagnostics to Unicode 15.0 #114193)Dev-Tools:
Cargo.lock
unicase
crate version. #118229Dependency crates:
unicode-bidi
(used byidna
thenurl
then [ammonia
,cargo
,cargo-test-support
,clippy_lints
,crates-io
,git2
,git2-curl
,rustc-workspace-hack
])Previously: 0.3.4 (Unicode 10)
Goal: >=0.3.10 (Unicode 15)
unicode-segmentation
(used byrustfmt
)Previously: 1.9.0 (Unicode 14)
Goal: >=1.10.0 (Unicode 15)
unicode-properties
(used byrustfmt
)Mentioned above in compiler diagnostics section
unicode_categories
(used byrustfmt
)Current: 0.1.1 (Unclear, No release in 6 years)Goal: Need a new release (Will be replaced byunicode-properties
in Update Unicode data to 15.0 rustfmt#5864)unicase
(used bypulldown-cmark
, then [rustdoc
,clippy-lints
,mdbook
])Current: 2.6.0 (Unclear, No release in 3 years)
Goal: >=2.7.0 (Unicode 15)
Unicode version independent crates (ignorable for now, just for future reference):
unicode-bdd
(in-tree maintainence tool): Unicode version independentucd-parse
: Unicode version independent (used byunicode-bdd
tool)ucd-trie
: Unicode version independent (used byhandlebars
, thenmdbook
)unic-langid
,unic-langid-impl
,unic-langid-macros
,unic-langid-macros-impl
: Not really Unicode version independent but we only use Unicode version independent part. They're outdated, current: 0.9.1 (CLDR 37, Spring 2020, ~= Unicode 13), would be nice if a new release is used.The text was updated successfully, but these errors were encountered: