-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let string splitters respect East_Asian_Width
property
#3445
Conversation
diff-shades results comparing this PR (82e39e8) to main (dba3c26). The full diff is available in the logs under the "Generate HTML diff report" step.
|
6b8c321
to
456f0ad
Compare
How nice! But I have a question. How about handling for non string literal such as variable name, function name, class name, comment? Examples are below # comment
# 이 값을 건드리면 당신은 야근을 면치 못 할 것입니다. 이상하게 보여도 절대 수정하지 마세요 모두를 위해 그냥 둡시다. - 전임자 드림
ANSWER = 42 # function name
def 영어주소2한글주소(country: str, province: str, city: str, street_address1: str, street_address2: str, extra_data: dict[str, any]) -> str:
pass |
@item4 Other than string literals are not covered by this patch. I would appreciate if someone works on it further. |
Here's also few more points we need to address in the future:
|
Also, I'm a little worried about performance here. Possible optimizations:
|
ab2a939
to
1c06c45
Compare
I added a fast path for ASCII-only strings as well! |
63ebbcf
to
7c70980
Compare
Sounds good to me. @JelleZijlstra I wonder if it's appropriate to add a new dependency for this specific logic. |
There also exist packages like rich who generates a table with a tool script depend on |
Looks like However, I am a little concerned about making formatting output depend on a third-party package. Right now if you have the same version of Black installed as someone else, you'll get the same formatting. But with wcwidth, perhaps different versions of the library will count line length differently, which would cause different formatting. That could create a confusing experience for users. We could get around it by pinning wcwidth to a specific version, but that's not great either. |
If we want to use wcwidth, I'd prefer using a mechanism similar to rich to avoid the issues Jelle brings up. Otherwise pinning is "fine," but that's not great either (may cause dependency conflicts for some users if they use another tool that uses wcwidth). I'm kinda surprised rich seems to use a binary search instead of a dictionary lookup, but maybe there's some complexity I'm overlooking? |
Actually wcwidth itself is already using bisearch. To my understanding the key reason is that the width information is stored as a interval table instead of a full dictionary. A full dictionary for every unicode character would have ~a million records i guess, but the interval table consists of only ~500 records, as a block of characters tend to have same width. |
If we decide to generate Black's own table to avoid a runtime dependency on wcwidth, should the generated table be version-controlled, or should it be generated at build time (through a PEP-517-style custom build backend or a release script in the CI configuration)? |
While writing a plugin for Hatchling (our build backend) would be super cool, it's probably not the best idea given it still gives room for install-to-install variability as Jelle brought up. It would be rare given almost no one installs from sdist, but the possibility is still there. Extending our release CD would be nice, but we haven't tried adding any automated source changes before and in general it sounds like it could devolve into a debugging nightmare. Doing it the rich way (feel free to add the new logic in subpackage of the Black package) would probably be best (assuming running the generate table script from time to time is basically we all we need to do). However, I'd like @JelleZijlstra's input. |
Generating a file and checking it in seems like the best way. That way, we can make sure it stays the same for the whole year and keep the stable style stable. |
7c70980
to
71da607
Compare
71da607
to
0cdb8e4
Compare
@JelleZijlstra I addressed things you pointed out. Could you take a look again? |
ff88d9c
to
ed73789
Compare
diff-shades is failing on pandas apparently because there are no files to analyze, cc @ichard26 |
ed73789
to
0ab121e
Compare
I rebased commits on the latest main and resolved conflicts (mostly from #1879). Please take a look again! |
The test suite apparently failed on the CI job. Should I try to fix it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check the width table code too closely since I presume that they were borrowed from rich. Everything else looks good though. Many thanks for the PR!
Description
This patch changes the preview style so that string splitters respect Unicode East Asian Width property. If you are not familiar to CJK languages it is not clear immediately. Let me elaborate with some examples.
Traditionally, East Asian characters (including punctuations) have taken up space twice than European letters and stops when they are rendered in monospace typeset. Compare the following characters:
The characters at the first line are half-width, and the second line are full-width. (Also note that the last character with a small circle, the East Asian period, is also full-width.) Therefore, if we want to prevent those full-width characters to exceed the maximum columns per line, we need to count their width rather than the number of characters. Again, the following characters:
These are just 4 characters, but their total width is 8.
Suppose we want to maintain up to 4 columns per line with the following text:
How should it be then? We want it to look like:
However, Black currently turns it into like this:
It's because Black currently counts the number of characters in the line instead of measuring their width. So, how could we measure the width? How can we tell if a character is full- or half-width? What if half-width characters and full-width ones are mixed in a line? That's why Unicode defined an attribute named
East_Asian_Width
. Unicode grouped every single character according to their width in fixed-width typeset.So, this patch tried to change how Black measure line widths using
East_Asian_Width
attribute, and I believe it works well with full-width characters now. However, I am not confident if I implemented things in a proper way, so please let me know if I need to adjust!Also, I believe it partially addresses #1197, but I touched only string splitters. Other parts probably need to be fixed as well.
Checklist - did you ...
CHANGES.md
if necessary?