Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

path: fix unicode path problems in path.relative #27644

Closed

Conversation

monoblaine
Copy link

This commit changes the way two paths are compared in path.relative:
Instead of comparing each char code in path strings one by one, which
causes problems when the number of char codes in lowercased path string
does not match the original one (e.g. path contains certain Unicode
characters like 'İ'), it now splits the path string by backslash and
compares the parts instead.

Fixes: #27534

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the path Issues and PRs related to the path subsystem. label May 10, 2019
Copy link
Contributor

@mscdex mscdex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark results:

path/relative-win32.js n=100000 paths='C:\\\\|D:\\\\'                                                    ***    -76.13 %       ±3.75% ±5.05% ±6.71%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz|C:\\\\foo\\\\bar\\\\baz'                           4.74 %       ±5.08% ±6.76% ±8.80%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\BAR\\\\BAZ|C:\\\\foo\\\\bar\\\\baz'                           0.01 %       ±2.56% ±3.41% ±4.44%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz\\\\quux|C:\\\\'                           ***    -70.69 %       ±1.98% ±2.67% ±3.54%
path/relative-win32.js n=100000 paths='C:\\\\orandea\\\\test\\\\aaa|C:\\\\orandea\\\\impl\\\\bbb'        ***    -64.37 %       ±1.53% ±2.04% ±2.68%

@monoblaine
Copy link
Author

Hey @mscdex I've removed regex replace statements and made sure splits and joins are executed as lazily as possible. I've also executed benchmarks in my machine (node benchmark/path/relative-win32.js) but they look kinda weird and inconsistent. Hope it's better now! 😐

@mscdex
Copy link
Contributor

mscdex commented May 11, 2019

Benchmark results:

path/relative-win32.js n=100000 paths='C:\\\\|D:\\\\'                                                    ***    -55.30 %       ±2.87% ±3.83% ±5.00%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz|C:\\\\foo\\\\bar\\\\baz'                          -0.35 %       ±4.71% ±6.28% ±8.18%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\BAR\\\\BAZ|C:\\\\foo\\\\bar\\\\baz'                          -0.59 %       ±2.66% ±3.54% ±4.61%
path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz\\\\quux|C:\\\\'                           ***    -59.94 %       ±1.59% ±2.13% ±2.81%
path/relative-win32.js n=100000 paths='C:\\\\orandea\\\\test\\\\aaa|C:\\\\orandea\\\\impl\\\\bbb'        ***    -55.13 %       ±2.06% ±2.76% ±3.64%

@monoblaine
Copy link
Author

monoblaine commented May 11, 2019

I've kept the two necessary String splits and removed all the other Array slices, concats and a call to fill. Perf seems to be better, but still worse than the original code. I don't know, maybe this wasn't a good idea in the first place. 😞

@addaleax
Copy link
Member

Hm, here’s a new benchmark CI run: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/369/

I think some performance impact is okay if it’s a correctness issue, but of course it’s nice to keep that minimal…

@monoblaine
Copy link
Author

Hey @addaleax thank you for the benchmark. Results look better and I was able to remove another slice call (and I promise this is my last attempt). Can you restart that job?

@addaleax
Copy link
Member

@Trott
Copy link
Member

Trott commented May 14, 2019

16:38:11                                                                                                    confidence improvement accuracy (*)   (**)  (***)
16:38:11  path/relative-win32.js n=100000 paths='C:\\\\|D:\\\\'                                                            -1.82 %       ±4.66% ±6.20% ±8.07%
16:38:11  path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz|C:\\\\foo\\\\bar\\\\baz'                           1.82 %       ±4.78% ±6.37% ±8.29%
16:38:11  path/relative-win32.js n=100000 paths='C:\\\\foo\\\\BAR\\\\BAZ|C:\\\\foo\\\\bar\\\\baz'                    *     -2.73 %       ±2.50% ±3.33% ±4.33%
16:38:11  path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz\\\\quux|C:\\\\'                           ***    -41.19 %       ±1.45% ±1.93% ±2.51%
16:38:11  path/relative-win32.js n=100000 paths='C:\\\\orandea\\\\test\\\\aaa|C:\\\\orandea\\\\impl\\\\bbb'        ***    -37.46 %       ±1.53% ±2.05% ±2.69%
16:38:11 
16:38:11 Be aware that when doing many comparisons the risk of a false-positive
16:38:11 result increases. In this case there are 5 comparisons, you can thus
16:38:11 expect the following amount of false-positive results:
16:38:11   0.25 false positives, when considering a   5% risk acceptance (*, **, ***),
16:38:11   0.05 false positives, when considering a   1% risk acceptance (**, ***),
16:38:11   0.01 false positives, when considering a 0.1% risk acceptance (***)

@nodejs-github-bot
Copy link
Collaborator

@Trott
Copy link
Member

Trott commented May 14, 2019

I'm OK with a performance hit on edge cases if it's for correctness. We can always improve performance subsequently.

@gireeshpunathil
Copy link
Member

can we have one or more reviews on this please?

@jasnell
Copy link
Member

jasnell commented Jul 7, 2020

This will need to be rebased in order to move forward

@jasnell jasnell added the stalled Issues and PRs that are stalled. label Jul 7, 2020
@monoblaine
Copy link
Author

@jasnell Done!

This commit changes the way two paths are compared in path.relative:
Instead of comparing each char code in path strings one by one, which
causes problems when the number of char codes in lowercased path string
does not match the original one (e.g. path contains certain Unicode
characters like 'İ'), it now splits the path string by backslash and
compares the parts instead.

Fixes: nodejs#27534
@mscdex
Copy link
Contributor

mscdex commented Jul 12, 2020

Current benchmark results still show significant regressions:

path/relative-win32.js n=100000 paths='C:\\\\foo\\\\bar\\\\baz\\\\quux|C:\\\\'                           ***    -40.87 %       ±2.27%  ±3.02%  ±3.94%
path/relative-win32.js n=100000 paths='C:\\\\orandea\\\\test\\\\aaa|C:\\\\orandea\\\\impl\\\\bbb'        ***    -38.16 %       ±2.47%  ±3.28%  ±4.28%

@monoblaine
Copy link
Author

Thank you all guys for your kindness and respect towards a first-time contributor! I'm afraid I currently don't have time to investigate any possible performance improvements (if there are any left). It's been more than a year and there's already a better fix in #27662 by @mscdex. So you should probably go with that.

@lundibundi
Copy link
Member

Closed as per comment above (#27644 (comment)). Feel free to reopen.

@lundibundi lundibundi closed this Aug 25, 2020
@monoblaine monoblaine deleted the utf-chars-in-path-relative branch July 20, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
path Issues and PRs related to the path subsystem. stalled Issues and PRs that are stalled.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF Characters for relative path
8 participants