Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let string splitters respect East_Asian_Width property #3445

Merged
merged 6 commits into from
Mar 19, 2023

Conversation

dahlia
Copy link
Contributor

@dahlia dahlia commented Dec 17, 2022

Description

This patch changes the preview style so that string splitters respect Unicode East Asian Width property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples.

Traditionally, East Asian characters (including punctuations) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters:

abcdefg.
글、字。

The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their width rather than the number of characters. Again, the following characters:

글、字。

These are just 4 characters, but their total width is 8.

Suppose we want to maintain up to 4 columns per line with the following text:

abcdefg.
글、字。

How should it be then? We want it to look like:

abcd
efg.
글、
字。

However, Black currently turns it into like this:

abcd
efg.
글、字。

It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named East_Asian_Width. Unicode grouped every single character according to their width in fixed-width typeset.

So, this patch tried to change how Black measure line widths using East_Asian_Width attribute, and I believe it works well with full-width characters now. However, I am not confident if I implemented things in a proper way, so please let me know if I need to adjust!

Also, I believe it partially addresses #1197, but I touched only string splitters. Other parts probably need to be fixed as well.

Checklist - did you ...

  • Add an entry in CHANGES.md if necessary?
  • Add / update tests if necessary?
  • Add new / update outdated documentation?

src/black/strings.py Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Dec 17, 2022

diff-shades results comparing this PR (82e39e8) to main (dba3c26). The full diff is available in the logs under the "Generate HTML diff report" step.

╭─────────────────────── Summary ────────────────────────╮
│ 2 projects & 6 files changed / 130 changes [+102/-28]  │
│                                                        │
│ ... out of 2 418 550 lines, 11 508 files & 23 projects │
╰────────────────────────────────────────────────────────╯

Differences found.

What is this? | Workflow run | diff-shades documentation

@item4
Copy link

item4 commented Dec 18, 2022

How nice!

But I have a question. How about handling for non string literal such as variable name, function name, class name, comment?

Examples are below

# comment

# 이 값을 건드리면 당신은 야근을 면치 못 할 것입니다. 이상하게 보여도 절대 수정하지 마세요 모두를 위해 그냥 둡시다. - 전임자 드림
ANSWER = 42
# function name

def 영어주소2한글주소(country: str, province: str, city: str, street_address1: str, street_address2: str, extra_data: dict[str, any]) -> str:
    pass

@dahlia
Copy link
Contributor Author

dahlia commented Dec 18, 2022

@item4 Other than string literals are not covered by this patch. I would appreciate if someone works on it further.

@dahlia
Copy link
Contributor Author

dahlia commented Dec 18, 2022

Here's also few more points we need to address in the future:

  • There are zero-width characters like U+200B ZERO WIDTH SPACE. They are designated N (narrow) according to East_Asian_Width, which would be counted as a single column. We need to count them as zero column.
  • Multiple code points can be combined together and look like a single character when they are rendered. They are called combining characters or grapheme clusters. (E.g., umlaut letters, emoji with skin color.) Those characters should be counted together.

src/black/lines.py Outdated Show resolved Hide resolved
src/black/trans.py Show resolved Hide resolved
@JelleZijlstra
Copy link
Collaborator

Also, I'm a little worried about performance here. char_width() is obviously going to be slower than len(), but will this noticeably affect Black's overall performance?

Possible optimizations:

  • Use a fast path for ascii-only strings (str.isascii method), since many programs will contain mostly ASCII-only strings
  • Cache the result of the computation per string; this is only helpful if we often re-compute the width for the same string

dahlia added a commit to dahlia/black that referenced this pull request Dec 25, 2022
@dahlia
Copy link
Contributor Author

dahlia commented Dec 25, 2022

I added a fast path for ASCII-only strings as well!

@dahlia dahlia requested a review from JelleZijlstra December 25, 2022 19:50
src/black/strings.py Outdated Show resolved Hide resolved
dahlia added a commit to dahlia/black that referenced this pull request Dec 30, 2022
@dahlia dahlia requested a review from JelleZijlstra December 30, 2022 09:40
@ZeroRin
Copy link

ZeroRin commented Jan 5, 2023

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this)
Also wcwidth uses lru_cache which should be helpful for the performance

@dahlia
Copy link
Contributor Author

dahlia commented Jan 5, 2023

Consider using wcwidth? In my experience in other projects this is more accurate for control characters, non-asian characters, etc. (see this) Also wcwidth uses lru_cache which should be helpful for the performance

Sounds good to me. @JelleZijlstra I wonder if it's appropriate to add a new dependency for this specific logic.

@ZeroRin
Copy link

ZeroRin commented Jan 5, 2023

There also exist packages like rich who generates a table with a tool script depend on wcwidth so that the package itself does not need it as dependency. Same strategy can be adopted here as well. I personally prefer using existing packages instead of copying the code though

@JelleZijlstra
Copy link
Collaborator

Looks like wcwidth is a pure-Python package with no further dependencies, so it wouldn't be too bad to add.

However, I am a little concerned about making formatting output depend on a third-party package. Right now if you have the same version of Black installed as someone else, you'll get the same formatting. But with wcwidth, perhaps different versions of the library will count line length differently, which would cause different formatting. That could create a confusing experience for users. We could get around it by pinning wcwidth to a specific version, but that's not great either.

@ichard26
Copy link
Collaborator

ichard26 commented Jan 5, 2023

If we want to use wcwidth, I'd prefer using a mechanism similar to rich to avoid the issues Jelle brings up. Otherwise pinning is "fine," but that's not great either (may cause dependency conflicts for some users if they use another tool that uses wcwidth). I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

@ZeroRin
Copy link

ZeroRin commented Jan 5, 2023

I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking?

Actually wcwidth itself is already using bisearch. To my understanding the key reason is that the width information is stored as a interval table instead of a full dictionary. A full dictionary for every unicode character would have ~a million records i guess, but the interval table consists of only ~500 records, as a block of characters tend to have same width.
The rich implementation differs from the original wcwidth in that it merges bisearch table for 0-width and 2-width into a single, and that it caches width for strings with less then 512 characters for reuse.

@dahlia
Copy link
Contributor Author

dahlia commented Jan 8, 2023

If we decide to generate Black's own table to avoid a runtime dependency on wcwidth, should the generated table be version-controlled, or should it be generated at build time (through a PEP-517-style custom build backend or a release script in the CI configuration)?

@ichard26
Copy link
Collaborator

ichard26 commented Jan 8, 2023

While writing a plugin for Hatchling (our build backend) would be super cool, it's probably not the best idea given it still gives room for install-to-install variability as Jelle brought up. It would be rare given almost no one installs from sdist, but the possibility is still there. Extending our release CD would be nice, but we haven't tried adding any automated source changes before and in general it sounds like it could devolve into a debugging nightmare. Doing it the rich way (feel free to add the new logic in subpackage of the Black package) would probably be best (assuming running the generate table script from time to time is basically we all we need to do).

However, I'd like @JelleZijlstra's input.

@JelleZijlstra
Copy link
Collaborator

Generating a file and checking it in seems like the best way. That way, we can make sure it stays the same for the whole year and keep the stable style stable.

dahlia added a commit to dahlia/black that referenced this pull request Jan 9, 2023
@dahlia dahlia removed the request for review from JelleZijlstra January 9, 2023 16:28
dahlia added a commit to dahlia/black that referenced this pull request Jan 25, 2023
@dahlia dahlia requested a review from JelleZijlstra January 25, 2023 10:12
@dahlia
Copy link
Contributor Author

dahlia commented Jan 25, 2023

@JelleZijlstra I addressed things you pointed out. Could you take a look again?

@JelleZijlstra
Copy link
Collaborator

diff-shades is failing on pandas apparently because there are no files to analyze, cc @ichard26

@dahlia dahlia force-pushed the east-asian-width branch from ed73789 to 0ab121e Compare March 16, 2023 14:31
@dahlia
Copy link
Contributor Author

dahlia commented Mar 16, 2023

I rebased commits on the latest main and resolved conflicts (mostly from #1879). Please take a look again!

@JelleZijlstra JelleZijlstra self-assigned this Mar 18, 2023
@dahlia
Copy link
Contributor Author

dahlia commented Mar 19, 2023

The test suite apparently failed on the CI job. Should I try to fix it?

@JelleZijlstra
Copy link
Collaborator

@yilei kindly fixed it in #3615. I'll merge in main to hopefully get tests to pass.

Copy link
Collaborator

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check the width table code too closely since I presume that they were borrowed from rich. Everything else looks good though. Many thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants