Add fast path skipping UTF8 length counting by gaearon · Pull Request #2819 · bluesky-social/atproto

gaearon · 2024-09-14T17:29:07Z

Commits

What

Similar to #2817, I'm trying to avoid calling into TextEncoder().encode(str).byteLength for every string. After this change, I basically don't hit it in the app at all — the fast path always lets me out early.

The fast pass itself is pretty general. The idea is that .length counts UTF-16 code units, and each UTF-16 code unit corresponds to at most 3 bytes in UTF-8 encoding. So we can safely use value.length * 3 as an upper bound on what utf8Len(value) could possibly be. If this upper bound is below the minLength, the same is true for utf8Len. If this upper bound is within maxLength, the same is true for utf8Len.

Why * 3?

Codepoints that fit into a single UTF-16 code unit become 1 to 3 bytes in UTF-8. (Worst case is 3x.)
Codepoints that need two UTF-16 code units become 4 bytes in UTF-8. (Worst case is 2x.)

So .length * 3 should always give us a valid upper bound. But this needs a look from an expert.

I've added some test cases.

bnewbold · 2024-09-16T19:18:33Z

this seems reasonable, though I should probably re-read more carefully and maybe cook up more corner-cases. I kind of suspect that it won't be as much of a win as the earlier grapheme cluster and utf8 caching patch though? I guess UTF-16 to UTF-8 does cost something through, and this probably does help with the happy path, and we do a lot of these, hrm.

devinivy

Good thinkin! Re: the factor of 3 in here, I am quite sure that checks out.

gaearon · 2024-12-10T19:44:46Z

Alright, had to rebase to account for stylistic changes in #2817, but let's land this. Re: perf impact, in React Native calling into encoder/decoder goes between JS and native so it's not guaranteed to be super cheap and it would just be nice to not worry about it for the common case.

gaearon requested review from bnewbold, devinivy, dholms and pfrazee September 14, 2024 17:29

gaearon changed the title ~~Add fast path for UTF8 length counting~~ Add fast path skipping UTF8 length counting Sep 14, 2024

devinivy approved these changes Oct 1, 2024

View reviewed changes

gaearon added 3 commits December 10, 2024 19:14

Harden UTF8 length test cases

8f9becc

Harden tests to account for new fast path

6488907

Add fast paths that skip UTF8 encoding

d7cc7d2

gaearon force-pushed the len-opt-utf8 branch from 9337c2b to d7cc7d2 Compare December 10, 2024 19:37

gaearon merged commit 5ade78d into main Dec 10, 2024

gaearon deleted the len-opt-utf8 branch December 10, 2024 19:45

gaearon added a commit that referenced this pull request Dec 10, 2024

Add changeset for #2819

d895dd1

gaearon added a commit that referenced this pull request Dec 10, 2024

Add changeset for #2819 (#3223)

9fd65ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast path skipping UTF8 length counting#2819

Add fast path skipping UTF8 length counting#2819
gaearon merged 3 commits into
mainfrom
len-opt-utf8

gaearon commented Sep 14, 2024 •

edited

Loading

Uh oh!

bnewbold commented Sep 16, 2024

Uh oh!

devinivy left a comment

Uh oh!

gaearon commented Dec 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gaearon commented Sep 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commits

What

Uh oh!

bnewbold commented Sep 16, 2024

Uh oh!

devinivy left a comment

Choose a reason for hiding this comment

Uh oh!

gaearon commented Dec 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gaearon commented Sep 14, 2024 •

edited

Loading