Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve std::format's width estimation #3903

Merged
merged 39 commits into from
Aug 11, 2023

Conversation

achabense
Copy link
Contributor

@achabense achabense commented Jul 23, 2023

Resolves #3446

This pr introduces a generator for the new _Width_estimate_intervals table. The new project should include EastAsianWidth.txt(link) somewhere.

Here is its raw output; the "Old table:" part can help to confirm that the interval-generating algorithm is correct; the "Was 1, now 2:" and "Was 2, now 1:" part can help to confirm that this implementation is conformant to the standard (by comparing them with those in the annex of the paper)

Output
Old table:
0x1100u, 0x1160u, 0x2329u, 0x232Bu, 0x2E80u, 0x303Fu, 0x3040u, 0xA4D0u, 0xAC00u, 0xD7A4u, 0xF900u, 0xFB00u,
0xFE10u, 0xFE1Au, 0xFE30u, 0xFE70u, 0xFF00u, 0xFF61u, 0xFFE0u, 0xFFE7u, 0x1F300u, 0x1F650u, 0x1F900u, 0x1FA00u,
0x20000u, 0x2FFFEu, 0x30000u, 0x3FFFEu,

New table:
Input path for EastAsianWidth.txt: EastAsianWidth.txt
0x1100u, 0x1160u, 0x231Au, 0x231Cu, 0x2329u, 0x232Bu, 0x23E9u, 0x23EDu, 0x23F0u, 0x23F1u, 0x23F3u, 0x23F4u,
0x25FDu, 0x25FFu, 0x2614u, 0x2616u, 0x2648u, 0x2654u, 0x267Fu, 0x2680u, 0x2693u, 0x2694u, 0x26A1u, 0x26A2u,
0x26AAu, 0x26ACu, 0x26BDu, 0x26BFu, 0x26C4u, 0x26C6u, 0x26CEu, 0x26CFu, 0x26D4u, 0x26D5u, 0x26EAu, 0x26EBu,
0x26F2u, 0x26F4u, 0x26F5u, 0x26F6u, 0x26FAu, 0x26FBu, 0x26FDu, 0x26FEu, 0x2705u, 0x2706u, 0x270Au, 0x270Cu,
0x2728u, 0x2729u, 0x274Cu, 0x274Du, 0x274Eu, 0x274Fu, 0x2753u, 0x2756u, 0x2757u, 0x2758u, 0x2795u, 0x2798u,
0x27B0u, 0x27B1u, 0x27BFu, 0x27C0u, 0x2B1Bu, 0x2B1Du, 0x2B50u, 0x2B51u, 0x2B55u, 0x2B56u, 0x2E80u, 0x2E9Au,
0x2E9Bu, 0x2EF4u, 0x2F00u, 0x2FD6u, 0x2FF0u, 0x2FFCu, 0x3000u, 0x303Fu, 0x3041u, 0x3097u, 0x3099u, 0x3100u,
0x3105u, 0x3130u, 0x3131u, 0x318Fu, 0x3190u, 0x31E4u, 0x31F0u, 0x321Fu, 0x3220u, 0x3248u, 0x3250u, 0xA48Du,
0xA490u, 0xA4C7u, 0xA960u, 0xA97Du, 0xAC00u, 0xD7A4u, 0xF900u, 0xFB00u, 0xFE10u, 0xFE1Au, 0xFE30u, 0xFE53u,
0xFE54u, 0xFE67u, 0xFE68u, 0xFE6Cu, 0xFF01u, 0xFF61u, 0xFFE0u, 0xFFE7u, 0x16FE0u, 0x16FE5u, 0x16FF0u, 0x16FF2u,
0x17000u, 0x187F8u, 0x18800u, 0x18CD6u, 0x18D00u, 0x18D09u, 0x1AFF0u, 0x1AFF4u, 0x1AFF5u, 0x1AFFCu, 0x1AFFDu, 0x1AFFFu,
0x1B000u, 0x1B123u, 0x1B132u, 0x1B133u, 0x1B150u, 0x1B153u, 0x1B155u, 0x1B156u, 0x1B164u, 0x1B168u, 0x1B170u, 0x1B2FCu,
0x1F004u, 0x1F005u, 0x1F0CFu, 0x1F0D0u, 0x1F18Eu, 0x1F18Fu, 0x1F191u, 0x1F19Bu, 0x1F200u, 0x1F203u, 0x1F210u, 0x1F23Cu,
0x1F240u, 0x1F249u, 0x1F250u, 0x1F252u, 0x1F260u, 0x1F266u, 0x1F300u, 0x1F650u, 0x1F680u, 0x1F6C6u, 0x1F6CCu, 0x1F6CDu,
0x1F6D0u, 0x1F6D3u, 0x1F6D5u, 0x1F6D8u, 0x1F6DCu, 0x1F6E0u, 0x1F6EBu, 0x1F6EDu, 0x1F6F4u, 0x1F6FDu, 0x1F7E0u, 0x1F7ECu,
0x1F7F0u, 0x1F7F1u, 0x1F900u, 0x1FA00u, 0x1FA70u, 0x1FA7Du, 0x1FA80u, 0x1FA89u, 0x1FA90u, 0x1FABEu, 0x1FABFu, 0x1FAC6u,
0x1FACEu, 0x1FADCu, 0x1FAE0u, 0x1FAE9u, 0x1FAF0u, 0x1FAF9u, 0x20000u, 0x2FFFEu, 0x30000u, 0x3FFFEu,

Was 1, now 2:
U+231A..U+231B
U+23E9..U+23EC
U+23F0
U+23F3
U+25FD..U+25FE
U+2614..U+2615
U+2648..U+2653
U+267F
U+2693
U+26A1
U+26AA..U+26AB
U+26BD..U+26BE
U+26C4..U+26C5
U+26CE
U+26D4
U+26EA
U+26F2..U+26F3
U+26F5
U+26FA
U+26FD
U+2705
U+270A..U+270B
U+2728
U+274C
U+274E
U+2753..U+2755
U+2757
U+2795..U+2797
U+27B0
U+27BF
U+2B1B..U+2B1C
U+2B50
U+2B55
U+A960..U+A97C
U+16FE0..U+16FE4
U+16FF0..U+16FF1
U+17000..U+187F7
U+18800..U+18CD5
U+18D00..U+18D08
U+1AFF0..U+1AFF3
U+1AFF5..U+1AFFB
U+1AFFD..U+1AFFE
U+1B000..U+1B122
U+1B132
U+1B150..U+1B152
U+1B155
U+1B164..U+1B167
U+1B170..U+1B2FB
U+1F004
U+1F0CF
U+1F18E
U+1F191..U+1F19A
U+1F200..U+1F202
U+1F210..U+1F23B
U+1F240..U+1F248
U+1F250..U+1F251
U+1F260..U+1F265
U+1F680..U+1F6C5
U+1F6CC
U+1F6D0..U+1F6D2
U+1F6D5..U+1F6D7
U+1F6DC..U+1F6DF
U+1F6EB..U+1F6EC
U+1F6F4..U+1F6FC
U+1F7E0..U+1F7EB
U+1F7F0
U+1FA70..U+1FA7C
U+1FA80..U+1FA88
U+1FA90..U+1FABD
U+1FABF..U+1FAC5
U+1FACE..U+1FADB
U+1FAE0..U+1FAE8
U+1FAF0..U+1FAF8

Was 2, now 1:
U+2E9A
U+2EF4..U+2EFF
U+2FD6..U+2FEF
U+2FFC..U+2FFF
U+3040
U+3097..U+3098
U+3100..U+3104
U+3130
U+318F
U+31E4..U+31EF
U+321F
U+3248..U+324F
U+A48D..U+A48F
U+A4C7..U+A4CF
U+FE53
U+FE67
U+FE6C..U+FE6F
U+FF00

Thanks @frederick-vs-ja for pointing out that the old _Width_estimate_intervals table can be overwritten directly 👀

@achabense achabense requested a review from a team as a code owner July 23, 2023 06:39
Copy link
Contributor

@frederick-vs-ja frederick-vs-ja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should no longer cite N4928, since it didn't contain the changes in P2675R1.
The first C++26 working draft is likely to complete before landing this PR. But I think it makes sense to still cite N4950 since it the final draft of C++23, and MSVC STL is currently citing N4950 in the whole product codes.

stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
Co-authored-by: A. Jiang <[email protected]>
@cpplearner

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej added format C++20/23 format cxx23 C++23 feature labels Jul 23, 2023
@achabense

This comment was marked as resolved.

@frederick-vs-ja

This comment was marked as resolved.

// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

// The following code generates data for `_Width_estimate_intervals` in <format>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI In libc++ I use a python script that was based on this script.

@StephanTLavavej StephanTLavavej self-assigned this Jul 24, 2023
@achabense

This comment was marked as resolved.

@sam20908

This comment was marked as resolved.

@achabense

This comment was marked as resolved.

Copy link
Contributor Author

@achabense achabense left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the full list of all remaining issues I'm concerned about. Aside from these issues, should I also offer a test for all affected ranges(those in the annex of P267521) in this pr?

stl/inc/format Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
tests/libcxx/expected_results.txt Show resolved Hide resolved
table_u read_from(ifstream& source) {
table_u table;

// "The unassigned code points in the following blocks default to "W":"
Copy link
Contributor Author

@achabense achabense Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the citation, the original comment in "EastAsianWidth.txt" for 20000~2FFFFD, 30000~3FFFFD is "All undesignated code points in Planes 2 and 3, whether inside or outside of allocated blocks, default to "W" : "
What's its difference with this sentence? Aren't they of the same meaning?

Also see #3903 (comment) and #3903 (comment)
(My idea is that we can safely keep this part, as it is not harmful, and hopefully for a long time will not become outdated.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question here - where is this comment coming from, and why is it different from what EastAsianWidth.txt says? Can't we just quote the EastAsianWidth.txt comment verbatim?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both comments comes from EastAsianWidth.txt; I'm being confused why there should be a different comment for 20000~2FFFFD, 30000~3FFFFD here:
image

Copy link
Contributor

@cpplearner cpplearner Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC EastAsianWidth.txt is trying to communicate that not all of Planes 2 & 3 are allocated. In particular, the following ranges are not allocated for any block:

  • 0002A6E0-0002A6FF
  • 0002EE50-0002F7FF
  • 0002FA20-0002FFFD
  • 000323B0-0003FFFD

But I don't think we care about the relationship between planes and blocks. For our purposes, they are just "big areas" and "small areas".

@StephanTLavavej StephanTLavavej self-requested a review July 26, 2023 19:32
@StephanTLavavej

This comment was marked as resolved.

add comments for `_Unicode_width_estimate`;
add test cases: all 2->1 cases; and some 1->2 cases
Copy link
Member

@barcharcraz barcharcraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, we should create an issue to port the generator to python.

@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit 84cc12d into microsoft:main Aug 11, 2023
@StephanTLavavej
Copy link
Member

Thanks for implementing this C++23 feature and bringing the codebase closer to completeness! ✅ 🚀 😻

@achabense achabense deleted the P2675R1_v2 branch August 11, 2023 02:40
@frederick-vs-ja
Copy link
Contributor

Thanks for consistently citing N4950 in the whole product code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cxx23 C++23 feature format C++20/23 format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

P2675R1 Improving std::format's Width Estimation
7 participants