Let string splitters respect `East_Asian_Width` property #3445

dahlia · 2022-12-17T07:32:22Z

Description

This patch changes the preview style so that string splitters respect Unicode East Asian Width property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples.

Traditionally, East Asian characters (including punctuations) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters:

abcdefg.
글、字。

The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their width rather than the number of characters. Again, the following characters:

글、字。

These are just 4 characters, but their total width is 8.

Suppose we want to maintain up to 4 columns per line with the following text:

abcdefg.
글、字。

How should it be then? We want it to look like:

abcd
efg.
글、
字。

However, Black currently turns it into like this:

abcd
efg.
글、字。

It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named East_Asian_Width. Unicode grouped every single character according to their width in fixed-width typeset.

So, this patch tried to change how Black measure line widths using East_Asian_Width attribute, and I believe it works well with full-width characters now. However, I am not confident if I implemented things in a proper way, so please let me know if I need to adjust!

Also, I believe it partially addresses #1197, but I touched only string splitters. Other parts probably need to be fixed as well.

Checklist - did you ...

Add an entry in CHANGES.md if necessary?
Add / update tests if necessary?
Add new / update outdated documentation?

src/black/strings.py

github-actions · 2022-12-17T16:22:22Z

diff-shades results comparing this PR (82e39e8) to main (dba3c26). The full diff is available in the logs under the "Generate HTML diff report" step.

╭─────────────────────── Summary ────────────────────────╮
│ 2 projects & 6 files changed / 130 changes [+102/-28]  │
│                                                        │
│ ... out of 2 418 550 lines, 11 508 files & 23 projects │
╰────────────────────────────────────────────────────────╯

Differences found.

What is this? | Workflow run | diff-shades documentation

item4 · 2022-12-18T02:21:15Z

How nice!

But I have a question. How about handling for non string literal such as variable name, function name, class name, comment?

Examples are below

# comment

# 이 값을 건드리면 당신은 야근을 면치 못 할 것입니다. 이상하게 보여도 절대 수정하지 마세요 모두를 위해 그냥 둡시다. - 전임자 드림
ANSWER = 42

# function name

def 영어주소2한글주소(country: str, province: str, city: str, street_address1: str, street_address2: str, extra_data: dict[str, any]) -> str:
    pass

dahlia · 2022-12-18T02:38:39Z

@item4 Other than string literals are not covered by this patch. I would appreciate if someone works on it further.

dahlia · 2022-12-18T02:46:07Z

Here's also few more points we need to address in the future:

There are zero-width characters like U+200B ZERO WIDTH SPACE. They are designated N (narrow) according to East_Asian_Width, which would be counted as a single column. We need to count them as zero column.
Multiple code points can be combined together and look like a single character when they are rendered. They are called combining characters or grapheme clusters. (E.g., umlaut letters, emoji with skin color.) Those characters should be counted together.

src/black/lines.py

src/black/trans.py

JelleZijlstra · 2022-12-24T04:37:30Z

Also, I'm a little worried about performance here. char_width() is obviously going to be slower than len(), but will this noticeably affect Black's overall performance?

Possible optimizations:

Use a fast path for ascii-only strings (str.isascii method), since many programs will contain mostly ASCII-only strings
Cache the result of the computation per string; this is only helpful if we often re-compute the width for the same string

See also: psf#3445 (comment)

dahlia · 2022-12-25T19:49:53Z

I added a fast path for ASCII-only strings as well!

src/black/strings.py

See also: psf#3445 (comment)

ZeroRin · 2023-01-05T06:15:53Z

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this)
Also wcwidth uses lru_cache which should be helpful for the performance

dahlia · 2023-01-05T11:37:48Z

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this) Also wcwidth uses lru_cache which should be helpful for the performance

Sounds good to me. @JelleZijlstra I wonder if it's appropriate to add a new dependency for this specific logic.

ZeroRin · 2023-01-05T11:47:47Z

There also exist packages like rich who generates a table with a tool script depend on wcwidth so that the package itself does not need it as dependency. Same strategy can be adopted here as well. I personally prefer using existing packages instead of copying the code though

JelleZijlstra · 2023-01-05T15:22:28Z

Looks like wcwidth is a pure-Python package with no further dependencies, so it wouldn't be too bad to add.

However, I am a little concerned about making formatting output depend on a third-party package. Right now if you have the same version of Black installed as someone else, you'll get the same formatting. But with wcwidth, perhaps different versions of the library will count line length differently, which would cause different formatting. That could create a confusing experience for users. We could get around it by pinning wcwidth to a specific version, but that's not great either.

ichard26 · 2023-01-05T19:44:24Z

If we want to use wcwidth, I'd prefer using a mechanism similar to rich to avoid the issues Jelle brings up. Otherwise pinning is "fine," but that's not great either (may cause dependency conflicts for some users if they use another tool that uses wcwidth). I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

ZeroRin · 2023-01-05T23:27:14Z

I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

Actually wcwidth itself is already using bisearch. To my understanding the key reason is that the width information is stored as a interval table instead of a full dictionary. A full dictionary for every unicode character would have ~a million records i guess, but the interval table consists of only ~500 records, as a block of characters tend to have same width.
The rich implementation differs from the original wcwidth in that it merges bisearch table for 0-width and 2-width into a single, and that it caches width for strings with less then 512 characters for reuse.

dahlia · 2023-01-08T06:04:04Z

If we decide to generate Black's own table to avoid a runtime dependency on wcwidth, should the generated table be version-controlled, or should it be generated at build time (through a PEP-517-style custom build backend or a release script in the CI configuration)?

ichard26 · 2023-01-08T20:20:39Z

While writing a plugin for Hatchling (our build backend) would be super cool, it's probably not the best idea given it still gives room for install-to-install variability as Jelle brought up. It would be rare given almost no one installs from sdist, but the possibility is still there. Extending our release CD would be nice, but we haven't tried adding any automated source changes before and in general it sounds like it could devolve into a debugging nightmare. Doing it the rich way (feel free to add the new logic in subpackage of the Black package) would probably be best (assuming running the generate table script from time to time is basically we all we need to do).

However, I'd like @JelleZijlstra's input.

JelleZijlstra · 2023-01-08T20:23:09Z

Generating a file and checking it in seems like the best way. That way, we can make sure it stays the same for the whole year and keep the stable style stable.

See also: psf#3445 (comment)

dahlia · 2023-01-25T10:14:21Z

@JelleZijlstra I addressed things you pointed out. Could you take a look again?

scripts/make_width_table.py

See also: psf#3445 (comment)

JelleZijlstra · 2023-02-22T07:20:48Z

diff-shades is failing on pandas apparently because there are no files to analyze, cc @ichard26

See also: https://www.unicode.org/reports/tr11/

See also: psf#3445 (comment)

dahlia · 2023-03-16T14:33:01Z

I rebased commits on the latest main and resolved conflicts (mostly from #1879). Please take a look again!

dahlia · 2023-03-19T04:50:27Z

The test suite apparently failed on the CI job. Should I try to fix it?

JelleZijlstra · 2023-03-19T14:44:31Z

@yilei kindly fixed it in #3615. I'll merge in main to hopefully get tests to pass.

ichard26

I didn't check the width table code too closely since I presume that they were borrowed from rich. Everything else looks good though. Many thanks for the PR!

JelleZijlstra reviewed Dec 17, 2022

View reviewed changes

src/black/strings.py Outdated Show resolved Hide resolved

dahlia force-pushed the east-asian-width branch from 6b8c321 to 456f0ad Compare December 17, 2022 17:01

dahlia requested a review from JelleZijlstra December 17, 2022 17:02

dahlia mentioned this pull request Dec 17, 2022

Line length should consider Unicode character width #1197

Open

JelleZijlstra reviewed Dec 24, 2022

View reviewed changes

src/black/lines.py Outdated Show resolved Hide resolved

src/black/trans.py Show resolved Hide resolved

dahlia force-pushed the east-asian-width branch from ab2a939 to 1c06c45 Compare December 25, 2022 19:40

dahlia added a commit to dahlia/black that referenced this pull request Dec 25, 2022

Fast path to measure width of ASCII strings

63ebbcf

See also: psf#3445 (comment)

dahlia requested a review from JelleZijlstra December 25, 2022 19:50

JelleZijlstra reviewed Dec 29, 2022

View reviewed changes

src/black/strings.py Outdated Show resolved Hide resolved

dahlia added a commit to dahlia/black that referenced this pull request Dec 30, 2022

Fast path to measure width of ASCII strings

7c70980

See also: psf#3445 (comment)

dahlia force-pushed the east-asian-width branch from 63ebbcf to 7c70980 Compare December 30, 2022 09:39

dahlia requested a review from JelleZijlstra December 30, 2022 09:40

dahlia added a commit to dahlia/black that referenced this pull request Jan 9, 2023

Fast path to measure width of ASCII strings

745b2c8

See also: psf#3445 (comment)

dahlia force-pushed the east-asian-width branch from 7c70980 to 71da607 Compare January 9, 2023 16:19

dahlia removed the request for review from JelleZijlstra January 9, 2023 16:28

dahlia added a commit to dahlia/black that referenced this pull request Jan 25, 2023

Fast path to measure width of ASCII strings

4565670

See also: psf#3445 (comment)

dahlia force-pushed the east-asian-width branch from 71da607 to 0cdb8e4 Compare January 25, 2023 10:12

dahlia requested a review from JelleZijlstra January 25, 2023 10:12

JelleZijlstra reviewed Feb 7, 2023

View reviewed changes

scripts/make_width_table.py Outdated Show resolved Hide resolved

Jackenmen mentioned this pull request Feb 9, 2023

Setup "ecosystem CI" to avoid regressions for existing users astral-sh/ruff#2677

Closed

dahlia added a commit to dahlia/black that referenced this pull request Feb 10, 2023

Fast path to measure width of ASCII strings

52e1406

See also: psf#3445 (comment)

dahlia force-pushed the east-asian-width branch from ff88d9c to ed73789 Compare February 10, 2023 13:10

dahlia requested a review from JelleZijlstra February 10, 2023 13:11

JelleZijlstra approved these changes Feb 22, 2023

View reviewed changes

dahlia added 4 commits March 16, 2023 23:29

Let string splitters respect East Asian Width

cd27157

See also: https://www.unicode.org/reports/tr11/

Fast path to measure width of ASCII strings

0800aee

See also: psf#3445 (comment)

Generate more precise width table using wcwidth

ecb5fdf

Let width table treat 0-width chars as 1-width

0ab121e

dahlia force-pushed the east-asian-width branch from ed73789 to 0ab121e Compare March 16, 2023 14:31

JelleZijlstra approved these changes Mar 18, 2023

View reviewed changes

JelleZijlstra self-assigned this Mar 18, 2023

Merge branch 'main' into east-asian-width

3593955

Merge branch 'main' into east-asian-width

82e39e8

ichard26 approved these changes Mar 19, 2023

View reviewed changes

ichard26 merged commit ef6e079 into psf:main Mar 19, 2023

charliermarsh mentioned this pull request Apr 6, 2023

E501 false positive when using east_asian_unicode string astral-sh/ruff#3902

Closed

takanakahiko mentioned this pull request Nov 28, 2023

Setting the 2024 stable style #4042

Closed

JelleZijlstra mentioned this pull request Feb 15, 2024

Black joins lines when it shouldn't - short width characters. #4235

Open

KaiSforza mentioned this pull request Feb 27, 2024

Added test for the Khmer language #4253

Draft

3 tasks

zanieb mentioned this pull request May 13, 2024

Incorrect character counting for non-ASCII characters astral-sh/ruff#11396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let string splitters respect `East_Asian_Width` property #3445

Let string splitters respect `East_Asian_Width` property #3445

dahlia commented Dec 17, 2022 •

edited

Loading

github-actions bot commented Dec 17, 2022 •

edited

Loading

item4 commented Dec 18, 2022

dahlia commented Dec 18, 2022

dahlia commented Dec 18, 2022

JelleZijlstra commented Dec 24, 2022

dahlia commented Dec 25, 2022 •

edited

Loading

ZeroRin commented Jan 5, 2023 •

edited

Loading

dahlia commented Jan 5, 2023

ZeroRin commented Jan 5, 2023 •

edited

Loading

JelleZijlstra commented Jan 5, 2023

ichard26 commented Jan 5, 2023

ZeroRin commented Jan 5, 2023 •

edited

Loading

dahlia commented Jan 8, 2023

ichard26 commented Jan 8, 2023 •

edited

Loading

JelleZijlstra commented Jan 8, 2023

dahlia commented Jan 25, 2023

JelleZijlstra commented Feb 22, 2023

dahlia commented Mar 16, 2023

dahlia commented Mar 19, 2023

JelleZijlstra commented Mar 19, 2023

ichard26 left a comment

Let string splitters respect East_Asian_Width property #3445

Let string splitters respect East_Asian_Width property #3445

Conversation

dahlia commented Dec 17, 2022 • edited Loading

Description

Checklist - did you ...

github-actions bot commented Dec 17, 2022 • edited Loading

item4 commented Dec 18, 2022

dahlia commented Dec 18, 2022

dahlia commented Dec 18, 2022

JelleZijlstra commented Dec 24, 2022

dahlia commented Dec 25, 2022 • edited Loading

ZeroRin commented Jan 5, 2023 • edited Loading

dahlia commented Jan 5, 2023

ZeroRin commented Jan 5, 2023 • edited Loading

JelleZijlstra commented Jan 5, 2023

ichard26 commented Jan 5, 2023

ZeroRin commented Jan 5, 2023 • edited Loading

dahlia commented Jan 8, 2023

ichard26 commented Jan 8, 2023 • edited Loading

JelleZijlstra commented Jan 8, 2023

dahlia commented Jan 25, 2023

JelleZijlstra commented Feb 22, 2023

dahlia commented Mar 16, 2023

dahlia commented Mar 19, 2023

JelleZijlstra commented Mar 19, 2023

ichard26 left a comment

Choose a reason for hiding this comment

Let string splitters respect `East_Asian_Width` property #3445

Let string splitters respect `East_Asian_Width` property #3445

dahlia commented Dec 17, 2022 •

edited

Loading

github-actions bot commented Dec 17, 2022 •

edited

Loading

dahlia commented Dec 25, 2022 •

edited

Loading

ZeroRin commented Jan 5, 2023 •

edited

Loading

ZeroRin commented Jan 5, 2023 •

edited

Loading

ZeroRin commented Jan 5, 2023 •

edited

Loading

ichard26 commented Jan 8, 2023 •

edited

Loading