Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.indexOf #4

Closed
ronag opened this issue Nov 14, 2022 · 11 comments
Closed

Buffer.indexOf #4

ronag opened this issue Nov 14, 2022 · 11 comments

Comments

@ronag
Copy link
Member

ronag commented Nov 14, 2022

It seems Buffer.indexOf is much slower than Uint8Array.indexOf. Not sure why we implement our own indexOf instead of just using the base class indexOf.

╔════════════════════╤═════════╤═══════════════════╤═══════════╗
║ Slower tests       │ Samples │            Result │ Tolerance ║
╟────────────────────┼─────────┼───────────────────┼───────────╢
║ Buffer.indexOf     │   10000 │ 7600568.83 op/sec │  ± 1.27 % ║
╟────────────────────┼─────────┼───────────────────┼───────────╢
║ Fastest test       │ Samples │            Result │ Tolerance ║
╟────────────────────┼─────────┼───────────────────┼───────────╢
║ Uint8Array.indexOf │   10000 │ 9337722.06 op/sec │  ± 5.53 % ║
╚════════════════════╧═════════╧═══════════════════╧═══════════╝
import cronometro from 'cronometro'
import assert from 'node:assert'

const BUF = Buffer.allocUnsafe(64)
for (let n = 0; n < 64; ++n) {
  BUF[n] = n
}

const Uint8ArrayIndexOf = Uint8Array.prototype.indexOf

cronometro({
  'Buffer.indexOf'() {
    const idx = BUF.indexOf(44)
    assert(idx >= 0)
  },
  'Uint8Array.indexOf'() {
    const idx = Uint8ArrayIndexOf.call(BUF, 44)
    assert(idx >= 0)
  },
})
@anonrig
Copy link
Member

anonrig commented Nov 14, 2022

Buffer.indexOf mainly exists to support different encodings, since different encodings have different char sizes, and from my understanding, Buffer.indexOf handles that correctly. For this particular case, you are using UTF-8, where the char size is 8, therefore, you don't need to take the complex logic of encoding offsets, and just go with Uint8Array. There might be a good optimization path, and just have a fast path for UTF8 on indexOf and this would improve this particular benchmark.

@ronag ronag added the good first issue Good for newcomers label Nov 14, 2022
@tony-go
Copy link
Member

tony-go commented Nov 18, 2022

Hey 👋

Regarding what I read in this issue, I thought that having something like that:

if (encoding === 'utf8' || encoding === 'utf-8') {
  return Uint8ArrayPrototype.indexOf.call(buffer, val, byteOffset);
 }

in https://github.com/nodejs/node/blob/main/lib/buffer.js#L943 could make the trick.

Am I correct?

@anonrig
Copy link
Member

anonrig commented Nov 18, 2022

Yes, but do not forgot also check for encoding === undefined, since if it's undefined, the default value is utf8

@ronag
Copy link
Member Author

ronag commented Nov 19, 2022

You also need to check the type of the search value.

@tony-go
Copy link
Member

tony-go commented Nov 19, 2022

I give it a first try: nodejs/node@main...tony-go:node:buff-utf8-fast-path

It doesn't work, even with this simple test:

const buf = Buffer.from('abcde');

const index = buf.indexOf('b');
console.log('index of b is:', index); // "index of b is: -1"

Whereas when I do this, it works perfectly:

const buf = Buffer.from('abcde');

const index = buf.indexOf('b');
console.log(index); // 1

const index2 = Uint8Array.prototype.indexOf.call(buf, 'b');
console.log(index); // 1

I didn't have more time than that to investigate today, I'll continue later

@ronag
Copy link
Member Author

ronag commented Nov 19, 2022

Val needs to be a number not a string

@tony-go
Copy link
Member

tony-go commented Nov 19, 2022

Val needs to be a number not a string

I misunderstood this: "where the char size is 8", making me think that it was for string 🙈

But yeah, when we look at your benchmark you only use numbers.

Thanks 🙌

@tony-go
Copy link
Member

tony-go commented Nov 20, 2022

I have an issue with these tests:

//  Test truncation of Number arguments to uint8
{                                                           
  const buf = Buffer.from('this is a test');
  assert.strictEqual(buf.indexOf(0x6973), 3);
  assert.strictEqual(buf.indexOf(0x697320), 4);
  assert.strictEqual(buf.indexOf(0x69732069), 2);
  assert.strictEqual(buf.indexOf(0x697374657374), 0);
  assert.strictEqual(buf.indexOf(0x69737374), 0);
  assert.strictEqual(buf.indexOf(0x69737465), 11);
  assert.strictEqual(buf.indexOf(-140), 0); // <======= here!
  assert.strictEqual(buf.indexOf(-152), 1); 
  assert.strictEqual(buf.indexOf(0xff), -1);
  assert.strictEqual(buf.indexOf(0xffff), -1);
}

I updated my code to handle the positive hex ones:

const isUtf8 = encoding === 'utf8' || encoding === 'utf-8' || encoding === undefined;
if (isUtf8 && typeof val === 'number') {
   const method = dir ? 'indexOf' : 'lastIndexOf';
   let value = val % 256;
   return Uint8ArrayPrototype[method].call(buffer, value, byteOffset);                                            }

Indeed, I don't really know how to truncate the number into uint8.

Also, assert.strictEqual(buf.indexOf(-140), 0); don't understand why -140 point to zero when char code for t is equal to 116.

@ronag
Copy link
Member Author

ronag commented Nov 22, 2022

It seems neither V8 nor Node uses simd for this...

Something like:

// 0 for not found, non-zero for found.  (Bit position tells you where).
unsigned contains(__m128i data, uint8_t needle) {
    __m128i k = _mm_set1_epi8(needle);
    __m128i cmp = _mm_cmpeq_epi8(data, k);  // vector mask
    return _mm_movemask_epi8(cmp);          // integer bitmask
}

Could provide a significant speedup.

@tony-go
Copy link
Member

tony-go commented Nov 23, 2022

Thanks for sharing @ronag 👍

Is that: https://www.cs.virginia.edu/~cr4bd/3330/S2018/simdref.html what you refer?

@tony-go
Copy link
Member

tony-go commented Nov 25, 2022

A bit stuck here: https://github.com/nodejs/node/pull/45627/files#r1032668045

Maybe I could cast it with bitwise, I'll browse the web.

EDIT: I think I found something

@anonrig anonrig closed this as completed Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants