-
Notifications
You must be signed in to change notification settings - Fork 20
Do not filter keys that use codepoints outside the BMP when listing #986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pepijnve
wants to merge
3
commits into
scality:development/8.1
Choose a base branch
from
datadobi:dev/BF/unicode_max
base: development/8.1
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, me again. This seems very similar to what localeCompare probably does, have you tried using that? https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/localeCompare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly localeCompare does locale sensitive collation. That's not what you want here; it needs to be locale independent UTF-8 binary order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had another look at the description on MDN and the two pointers to the relevant Ecma specifications. There's a lot of room for 'implementation dependent behaviour' which makes localeCompare unsuitable for this usage. Amazon is fairly vague on this, but https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysUsingAPIs.html does state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the optional arguments will have this type of comparison but I can not find the exact one as I don't fully understand the problem.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're going to go down a rabbit hole here 😄
Looking at V8 you end up at https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/builtins/builtins-string.cc#L135
which, assuming ICU support has been compiled in, delegates to
https://github.com/v8/v8/blob/4b9b23521e6fd42373ebbcb20ebe03bf445494f9/src/objects/intl-objects.cc#L947
which in turn creates an ICU collator and then compares using that.
Making some assumption here, but you're most likely going get a Unicode collator that uses the system's default locale if you do not specify any options. You can find the details of what that is exactly at https://www.unicode.org/reports/tr10/#Main_Algorithm. In a nutshell this is sorting for humans taking language/culture specific details into account. That's not what you want here.
I had hoped to be able to find a locale code or an option value that says 'I want codepoint order', but as far as I can tell that does not exist.
UTF-8 binary order means encoding both strings as UTF-8 and comparing byte-wise. Thanks to the way UTF-8 is constructed this is identical to the ordering you get when you take the codepoint values of the string and compare those integer-wise.
You might be wondering why the default UTF-16 comparison isn't good enough. Best way I can explain that is with an example.
Take codepoints 48, 65104 and 129648. That's 0,﹐and 🩰 respectively.
Sorting in codepoint order you get them in exactly that order.
If you encode in UTF-16 you get
respectively. UTF-16 binary sorting compares each 16-bit value one by one so that would return the string in the order [0, 🩰, ﹐].
Encode in UTF-8 and you get
respectively. UTF-8 binary sorting compares each 8-bit value one by one and then you get [0,﹐,🩰] which is what you want.
This is by design in UTF-8. Binary sort order for UTF-8 is identical to codepoint order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea that's fair, I had assumed it was something similar to that affect but was hoping there was something built in as this seems like it would be easy and common enough. I am new to scality so I won't be able to merge this but its got my approval if that means anything.