Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parse-namestring doesn't accept unicode #1595

Closed
fosskers opened this issue Jun 17, 2024 · 6 comments · Fixed by #1603
Closed

parse-namestring doesn't accept unicode #1595

fosskers opened this issue Jun 17, 2024 · 6 comments · Fixed by #1603
Labels

Comments

@fosskers
Copy link

fosskers commented Jun 17, 2024

Describe the bug

If we inspect the string "/foo/bar/ゆびわ", we see:

Dimensions: (12)
Element type: CHARACTER
Total size: 12
Adjustable: NIL
Fill pointer: NIL
Contents:
0: #\/
1: #\f
2: #\o
3: #\o
4: #\/
5: #\b
6: #\a
7: #\r
8: #\/
9: #\HIRAGANA_LETTER_YU
10: #\HIRAGANA_LETTER_BI
11: #\HIRAGANA_LETTER_WA

Wonderful. However, if we attempt to include a unicode character in a pathname, like (parse-namestring "/foo/bar/ゆびわ"), the debugger opens and we're told:

Cannot coerce string "/foo/bar/�s�" to a base-string

Somewhat cryptic. However, base-string is a hint, and the Clasp docs also mention:

Clasp supports Unicode by default. code-char and char-code work with Unicode codepoints.
...
Type base-char includes only single byte characters, i.e. Basic Latin and Latin-1 Supplement.

So perhaps somewhere in the depths of parsing the path, characters are assumed to be non-Unicode and a conversion to base-string (probably an array of base-char?) is attempted.

Expected behavior
It should be possible to contain Unicode characters with pathnames, as people in non-English-speaking countries often have Unicode characters in filepaths on their computers.

Actual behavior
(shown above)

Note also that this occurs for #p literals as well (probably powered by parse-namestring underneath).

@fosskers fosskers added the bug label Jun 17, 2024
@Bike
Copy link
Member

Bike commented Jun 17, 2024

Yes, a base-string is an array of base-chars, and base-chars are one byte, Latin-1 or something. The coercion is done here:

clasp/src/core/pathname.cc

Lines 1349 to 1351 in 9585462

#ifdef CLASP_UNICODE
thing = coerce::coerce_to_base_string(thing);
#endif
This code is pretty old, dating back to when Unicode support was new and Clasp was called BRCL.

The underlying parser in clasp_parseNamestring doesn't seem to actually be limited to byte characters on first glance, so maybe this will be easy to fix.

@Bike
Copy link
Member

Bike commented Jun 17, 2024

Nope, not that simple. namestring improperly treats components as base strings for reasons I don't follow, and the actual OS functions do not seem to work with your ring pathname. Let me try to sort all this out, I hate having secret English-centric assumptions littered around.

Bike added a commit that referenced this issue Jun 17, 2024
Note that this is NOT a complete solution to #1595, as the actual
OS interface (calls to open(3), etc.) is not really prepared for
non-base-strings. It will probably have to do some encoding.
@fosskers
Copy link
Author

Let me try to sort all this out, I hate having secret English-centric assumptions littered around.

Perhaps an innocent "ASCII will do for now" kind of thought, the lingering effects of which we're seeing here. Thanks for tackling this.

@Bike Bike closed this as completed in 932bb4c Jun 24, 2024
Bike added a commit that referenced this issue Jun 24, 2024
Support extended characters in pathnames. Fixes #1595.
@fosskers
Copy link
Author

I will test your fix locally as well.

@Bike
Copy link
Member

Bike commented Jun 25, 2024

Please do. I think I got it working, but I might have missed some of the more obscure filesystem access functions.

@fosskers
Copy link
Author

fosskers commented Jun 25, 2024

As far as my tests are concerned, your fixes work. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants