Fast byteLength() #333

jamiebuilds · 2024-07-25T23:58:47Z

What problem are you trying to solve?

new TextEncoder().encode(input).byteLength is an order of magnitude slower than alternatives, including Node's Buffer.byteLength(input) and even handwritten JavaScript implementations.

Benchmarks

./benchmarks/blob.js:              202’345.0 ops/sec (± 13’993.9, p=0.001, o=0/100)
./benchmarks/buffer.js:         57’434’701.2 ops/sec (±425’763.3, p=0.001, o=9/100) severe outliers=5
./benchmarks/implementation.js: 48’441’909.6 ops/sec (±397’249.6, p=0.001, o=5/100) severe outliers=2
./benchmarks/textencoder.js:     2’667’052.4 ops/sec (±564’727.5, p=0.001, o=6/100) severe outliers=2

My benchmark repo includes a JS implementation that I believe is at least close enough to correct for benchmarking purposes, although I'm no expert in UTF-16 so there may be some mistakes.

What solutions exist today?

new Blob([input]).size
new TextEncoder(input).byteLength
Buffer.byteLength(input) (Node only)
Implementations in JS

How would you solve it?

let encoder = new TextEncoder()
let byteLength = encoder.byteLength("Hello, World!")
// >> 13

Anything else?

No response

The text was updated successfully, but these errors were encountered:

WebReflection · 2024-07-26T07:46:15Z

I am not 100% this is correct ... but ... it's also pretty slow and I start wondering if the slowness doesn't come directly from string internal code-points:

"use strict"
module.exports = (input) => {
  let total = 0;
  for (const c of input) {
    const p = c.codePointAt(0);
    if (p < 0x80) total += 1;
    else if (p < 0x800) total += 2;
    else total += (p & 0xD800) ? 4 : 3;
  }
  return total;
};

Results on my laptop:

./benchmarks/blob.js:           405’174.6 ops/sec (±5’563.9, p=0.001, o=6/100) severe outliers=2
./benchmarks/buffer.js:         45’447’421.7 ops/sec (±659’453.6, p=0.001, o=0/100)
./benchmarks/codepoint.js:      15’096’778.8 ops/sec (±185’463.1, p=0.001, o=0/100)
./benchmarks/implementation.js: 65’565’103.6 ops/sec (±1’127’578.0, p=0.001, o=4/100) severe outliers=2
./benchmarks/textencoder.js:    2’698’465.4 ops/sec (±97’198.0, p=0.001, o=0/100)

jakearchibald · 2024-07-26T08:02:44Z

@jamiebuilds can you give usecase(s) where you only care about the byte length and don't need the encoded data?

WebReflection · 2024-07-26T08:02:53Z

Did some extra test to verify if the buffer creation is the reason for such slowdown and indeed this proves it:

new buffer each time

"use strict"
let input = require("../input")
let encoder = new TextEncoder()
module.exports = () => {
  // size as worst case scenario
  const ui8Array = new Uint8Array(input.length * 4);
  return encoder.encodeInto(input, ui8Array).written;
}

This is still faster than encode(input).byteLength:

./benchmarks/textencoder.js:    3’329’442.3 ops/sec (±174’287.7, p=0.001, o=0/100)

Now, if there is no new buffer creation at all:

"use strict"
let input = require("../input")
let encoder = new TextEncoder()
// size as worst case scenario
const ui8Array = new Uint8Array(input.length * 4);
module.exports = () => {
  return encoder.encodeInto(input, ui8Array).written;
}

The result is better than code points loop:

./benchmarks/textencoder.js:    23’922’510.0 ops/sec (±547’321.7, p=0.001, o=4/100) severe outliers=3

I suppose a method to just count bytes length would make it possible to have performance closer to NodeJS buffer.

jamiebuilds · 2024-07-26T18:09:20Z

@jakearchibald I work on an end-to-end encrypted messaging app where we can't inspect the types of payloads being sent between clients on the server, so there are many places where we need to enforce a max byte length on the client to prevent certain types of abuse overloading client apps.

Right now we mostly do encode the data in Node buffers but found that it would be more efficient to catch these things earlier and have the option of dropping payloads that are too large before we start doing anything with that data.

After implementing some of this though, I actually found an even better way of doing this:

function maxLimitCheck(maxByteSize: number) {
	let encoder = new TextEncoder()
  let maxSizeArray = new Uint8Array(maxByteSize + 4)
  return (input: string): boolean => {
    return encoder.encodeInto(input, maxSizeArray).written < maxByteSize
  }
}

let check = maxLimitCheck(5e6) // 5MB

check("a".repeat(5)) // true
check("a".repeat(5e6)) // true
check("a".repeat(5e6 - 1) + "¢") // true
check("a".repeat(5e6 + 1)) // false
check("a".repeat(2 ** 29 - 24)) // false

Testing this out in my benchmark repo with the max size array enforcing a couple different limits:

./benchmarks/blob.js:                        4.8 ops/sec (±0.1, p=0.001, o=0/10)
./benchmarks/buffer.js:                     54.5 ops/sec (±3.0, p=0.001, o=0/10)
./benchmarks/implementation.js:              0.7 ops/sec (±0.0, p=0.001, o=0/10)
./benchmarks/textencoder.js:                11.9 ops/sec (±1.0, p=0.001, o=0/10)

5MB:
6’318.7 ops/sec (±743.3, p=0.001, o=8/100) severe outliers=6

50MB:
551.8 ops/sec (±7.6, p=0.001, o=7/100) severe outliers=4

500MB:
51.5 ops/sec (±4.6, p=0.001, o=6/100) severe outliers=4

I still believe this is a useful function to have, there are more than 10k results for Buffer.byteLength( on GitHub (which looking around mostly seem like strings being passed in, although the API accepts Buffers and other typed arrays too).

Seems like a lot of people are using it for Content-Length headers too

annevk · 2024-08-12T12:35:38Z

I'm a little worried about offering this API as it encourages decoupling the length from the data which can have all kinds of unintended consequences.

Bad legacy APIs such as Content-Length are a reasonable use case though.

We'd only support UTF-8 and as such it also seems like it would be quite straightforward (modulo surrogates) to implement this API yourself in script as you did. Still, we might as well expose it.

(I also considered prefixing with unsafe due to the bad nature of the API, but I can't quite convince myself it's that bad. An instance long long byteLength(DOMString input) method seems reasonable enough.)

cc @mathiasbynens @hsivonen

andreubotella · 2024-08-12T12:46:14Z

I suspect that this API might be used to determine the length of the buffer to use for encodeInto(), which it is not a good fit for, since it would iterate through the string twice unnecessarily. If we think it is useful to add this API, maybe we should also add another maxByteLength method that simply returns 3 * string.length, to avoid steering developers away from the right usage patterns.

mathiasbynens · 2024-08-12T13:14:20Z

In Chrome DevTools we had a need for this functionality and implemented it here: https://source.chromium.org/chromium/chromium/src/+/main:third_party/devtools-frontend/src/front_end/core/platform/StringUtilities.ts;l=216-243;drc=35e203b9cb7890f5bd6ac3f818f01744728ec398 (patch)

jamiebuilds · 2024-08-30T17:44:04Z

Maybe something that indicates that byteLength() API is not constant-time operation the way that .length generally is. This might discourage cases like twice-iterating through the string. Something like countByteLength()

jamiebuilds added addition/proposal New features or enhancements needs implementer interest Moving the issue forward requires implementers to express interest labels Jul 25, 2024

jamiebuilds mentioned this issue Jul 29, 2024

Improve performance of maxBytes() fabian-hiller/valibot#749

Open

annevk added the topic: api label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast byteLength() #333

Fast byteLength() #333

jamiebuilds commented Jul 25, 2024

WebReflection commented Jul 26, 2024

jakearchibald commented Jul 26, 2024

WebReflection commented Jul 26, 2024

jamiebuilds commented Jul 26, 2024 •

edited

Loading

annevk commented Aug 12, 2024

andreubotella commented Aug 12, 2024 •

edited

Loading

mathiasbynens commented Aug 12, 2024 •

edited

Loading

jamiebuilds commented Aug 30, 2024

Fast byteLength() #333

Fast byteLength() #333

Comments

jamiebuilds commented Jul 25, 2024

What problem are you trying to solve?

What solutions exist today?

How would you solve it?

Anything else?

WebReflection commented Jul 26, 2024

jakearchibald commented Jul 26, 2024

WebReflection commented Jul 26, 2024

jamiebuilds commented Jul 26, 2024 • edited Loading

annevk commented Aug 12, 2024

andreubotella commented Aug 12, 2024 • edited Loading

mathiasbynens commented Aug 12, 2024 • edited Loading

jamiebuilds commented Aug 30, 2024

jamiebuilds commented Jul 26, 2024 •

edited

Loading

andreubotella commented Aug 12, 2024 •

edited

Loading

mathiasbynens commented Aug 12, 2024 •

edited

Loading