src: replace naive search with naive + BMH in Buffer::IndexOf #2539

skomski · 2015-08-25T14:43:16Z

Adds the string search implementation from v8
which uses naive search if pattern length < 8
or to a specific badness then uses Boyer-Moore-Horspool

Added benchmark shows the expected improvements
Added option to use ucs2 encoding with Buffer::IndexOf

Since v8 is BSD licensed it should be no problem to use the code if credited.

buffers/buffer-indexof.js search=@ iter=1: ./iojs: 3523000 ./iojs_naive: 423220 .............................................................. 732.42%
buffers/buffer-indexof.js search=SQ iter=1: ./iojs: 822020 ./iojs_naive: 67540 .............................................................. 1117.08%
buffers/buffer-indexof.js search=10x iter=1: ./iojs: 3168900 ./iojs_naive: 664590 ............................................................ 376.82%
buffers/buffer-indexof.js search=--l iter=1: ./iojs: 167290 ./iojs_naive: 14779 ............................................................. 1031.95%
buffers/buffer-indexof.js search=Alice iter=1: ./iojs: 4228800 ./iojs_naive: 4458200 .......................................................... -5.15%
buffers/buffer-indexof.js search=Gryphon iter=1: ./iojs: 269360 ./iojs_naive: 14296 ......................................................... 1784.19%
buffers/buffer-indexof.js search=Panther iter=1: ./iojs: 218090 ./iojs_naive: 12590 ......................................................... 1632.31%
buffers/buffer-indexof.js search=Ou est ma chatte? iter=1: ./iojs: 99297 ./iojs_naive: 50987 .................................................. 94.75%
buffers/buffer-indexof.js search=found it very iter=1: ./iojs: 18249 ./iojs_naive: 11225 ...................................................... 62.57%
buffers/buffer-indexof.js search=among mad people iter=1: ./iojs: 31875 ./iojs_naive: 8616 ................................................... 269.95%
buffers/buffer-indexof.js search=neighbouring pool iter=1: ./iojs: 17349 ./iojs_naive: 4754.1 ................................................ 264.93%
buffers/buffer-indexof.js search=Soo--oop iter=1: ./iojs: 13625 ./iojs_naive: 12154 ........................................................... 12.10%
buffers/buffer-indexof.js search=aaaaaaaaaaaaaaaaa iter=1: ./iojs: 23441 ./iojs_naive: 4263.1 ................................................ 449.86%
buffers/buffer-indexof.js search=venture to go near the house till she had brought herself down to iter=1: ./iojs: 75611 ./iojs_naive: 20240 . 273.58%
buffers/buffer-indexof.js search=</i> to the Caterpillar iter=1: ./iojs: 26146 ./iojs_naive: 9589.9 .......................................... 172.64%

mscdex · 2015-08-25T14:56:31Z

+1

YurySolovyov · 2015-08-25T16:30:58Z

Would this make it into 4.0 ?

rvagg · 2015-08-26T01:03:43Z

/cc @trevnorris

trevnorris · 2015-08-26T05:27:19Z

src/node_buffer.cc

 void IndexOfString(const FunctionCallbackInfo<Value>& args) {
  ASSERT(args[1]->IsString());
  ASSERT(args[2]->IsNumber());

+  enum encoding enc = ParseEncoding(args.GetIsolate(),
+                                    args[3].As<String>(),


From JS you'll need to do if (!encoding) encoding = 'utf8';. Then from here add CHECK(args[3]->IsString()); just above this.

At least, if args[3].IsEmpty() then .As<String>() will abort in debug mode. Need to make sure it's a string first.

trevnorris · 2015-08-26T06:09:09Z

Not sure how I feel about benchmark/fixtures/alice.html, but also don't have a better idea.

Probably not for this PR, but I think a useful addition (especially now that performance will make this more usable) is the addition of a fromIndex argument. The signature may possibly be
indexOf(value[, fromIndex = 0][, encoding = 'utf8']).

Did a quick initial review. Also want @bnoordhuis to take a look.

@skomski Thanks much for the PR and for fixing my initial naive implementation.

skomski · 2015-08-26T08:01:05Z

@trevnorris Buffer.indexOf already takes the argument byteOffset. It would make sense to rename the current argument byteOffset to fromIndex.

bnoordhuis · 2015-08-26T13:25:15Z

src/string_search.h

+      return kLatin1AlphabetSize;
+    } else {
+      // UC16 needle.
+      return kUC16AlphabetSize;


Can you add a static_assert(sizeof(PatternChar) == sizeof(uint16_t), "sizeof(PatternChar) == sizeof(uint16_t)") here?

trevnorris · 2015-08-26T16:11:49Z

Buffer.indexOf already takes the argument byteOffset. It would make sense to rename the current argument byteOffset to fromIndex.

Oh yup. You're right. Missed the int32_t offset = args[2]->Int32Value(); since it was moved down further. Thanks for pointing it out.

As for changing fromIndex to byteOffset, do you mean just changing the name?

trevnorris · 2015-08-26T16:13:35Z

src/node_buffer.cc

-  node::Utf8Value str(args.GetIsolate(), args[1]);
-  int32_t offset_i32 = args[2]->Int32Value();
-  uint32_t offset;
+  Local<String> needle = args[1]->ToString();


When we've confirmed what the argument type is, use args[1].As<String>(); instead. Skips some internal type checking V8 performs.

skomski · 2015-08-27T14:13:11Z

Always specifying an encoding like below is too much of a performance drop.

+++ lib/buffer.js
@@ -399,6 +399,9 @@ Buffer.prototype.indexOf = function indexOf(val, byteOffset, encoding) {
     byteOffset = -0x80000000;
   byteOffset >>= 0;

+  if (encoding === undefined)
+    encoding = 'utf8';
+
   if (typeof val === 'string') {
     if (encoding === undefined ||
         encoding === 'ucs2' ||
diff --git src/node_buffer.cc src/node_buffer.cc
index ab9d69e..2650d82 100644
--- src/node_buffer.cc
+++ src/node_buffer.cc
@@ -785,9 +785,7 @@ void IndexOfString(const FunctionCallbackInfo<Value>& args) {
   ASSERT(args[1]->IsString());
   ASSERT(args[2]->IsNumber());

-  enum encoding enc =
-      args[3].IsEmpty() ? UTF8 : ParseEncoding(args.GetIsolate(),
-                                               args[3].As<String>(), UTF8);
+  enum encoding enc = ParseEncoding(args.GetIsolate(), args[3].As<String>(), UTF8);

   THROW_AND_RETURN_UNLESS_BUFFER(Environment::GetCurrent(args), args[0]);
   SPREAD_ARG(args[0], ts_obj);

buffers/buffer-indexof.js search=@ iter=1: ./iojs_fastenc: 3203100 ./iojs: 2550500 ............................................................. 25.59%
buffers/buffer-indexof.js search=SQ iter=1: ./iojs_fastenc: 735410 ./iojs: 698660 ............................................................... 5.26%
buffers/buffer-indexof.js search=10x iter=1: ./iojs_fastenc: 2731700 ./iojs: 2237400 ........................................................... 22.10%
buffers/buffer-indexof.js search=--l iter=1: ./iojs_fastenc: 162590 ./iojs: 155260 .............................................................. 4.72%
buffers/buffer-indexof.js search=Alice iter=1: ./iojs_fastenc: 3695700 ./iojs: 2179300 ......................................................... 69.58%
buffers/buffer-indexof.js search=Gryphon iter=1: ./iojs_fastenc: 242620 ./iojs: 245240 ......................................................... -1.07%
buffers/buffer-indexof.js search=Panther iter=1: ./iojs_fastenc: 179150 ./iojs: 154900 ......................................................... 15.66%
buffers/buffer-indexof.js search=Ou est ma chatte? iter=1: ./iojs_fastenc: 106870 ./iojs: 105720 ................................................ 1.08%
buffers/buffer-indexof.js search=found it very iter=1: ./iojs_fastenc: 19512 ./iojs: 19658 ..................................................... -0.74%
buffers/buffer-indexof.js search=among mad people iter=1: ./iojs_fastenc: 33341 ./iojs: 33526 .................................................. -0.55%
buffers/buffer-indexof.js search=neighbouring pool iter=1: ./iojs_fastenc: 18422 ./iojs: 18294 .................................................. 0.70%
buffers/buffer-indexof.js search=Soo--oop iter=1: ./iojs_fastenc: 14654 ./iojs: 14619 ........................................................... 0.24%
buffers/buffer-indexof.js search=aaaaaaaaaaaaaaaaa iter=1: ./iojs_fastenc: 24591 ./iojs: 24638 ................................................. -0.19%
buffers/buffer-indexof.js search=venture to go near the house till she had brought herself down to iter=1: ./iojs_fastenc: 79716 ./iojs: 79876 . -0.20%
buffers/buffer-indexof.js search=</i> to the Caterpillar iter=1: ./iojs_fastenc: 27657 ./iojs: 28041 ........................................... -1.37%

trevnorris · 2015-08-27T17:45:03Z

@skomski From your diff it looks like a default string is always sent. So there's no need to check args[3].IsEmpty().

Though I also have a hard time taking that at face value. The performance drop from that patch should be proportional to the number of time indexOf() is called. Which is not the case.

trevnorris · 2015-08-27T17:59:31Z

lib/buffer.js

+    if (encoding === undefined ||
+        encoding === 'ucs2' ||
+        encoding === 'utf16le' ||
+        encoding === 'utf8') {


the encoding is case insensitive. look at Buffer.isEncoding() to see how we do it now to speed up the common case.

trevnorris · 2015-09-24T18:36:49Z

src/node_buffer.cc

+    }
+
+    result = SearchString(reinterpret_cast<const uint16_t*>(haystack),
+                          haystack_length / 2,


Is it a concern if user does this on an odd length buffer?

trevnorris · 2015-09-24T19:33:59Z

Looks great. Amazing set of tests. If CI is happy then LGTM.

CI: https://ci.nodejs.org/job/node-test-pull-request/374/

jasnell · 2015-10-08T03:59:59Z

Hmm.. previous test run on this failed but it's not obvious if the failure was related. Running another CI run just to make sure. If it passes, will land in master and on v4.x

https://ci.nodejs.org/job/node-test-pull-request/448/

jasnell · 2015-10-08T04:03:00Z

Some lint issues (https://ci.nodejs.org/job/node-linter/893/). I'll fix those manually when I land

Adds the string search implementation from v8 which uses naive search if pattern length < 8 or to a specific badness then uses Boyer-Moore-Horspool Added benchmark shows the expected improvements Added option to use ucs2 encoding with Buffer::IndexOf Reviewed-By: James M Snell <[email protected]> Reviewed-By: Trevor Norris <[email protected]> PR-URL: #2539

jasnell · 2015-10-08T04:21:15Z

Landed in a18dd7b

Adds the string search implementation from v8 which uses naive search if pattern length < 8 or to a specific badness then uses Boyer-Moore-Horspool Added benchmark shows the expected improvements Added option to use ucs2 encoding with Buffer::IndexOf Reviewed-By: James M Snell <[email protected]> Reviewed-By: Trevor Norris <[email protected]> PR-URL: #2539

jasnell · 2015-10-09T13:29:11Z

This change has introduced a regression on Big-Endian machines. Buffer.indexOf with a string parameter is now broken. This is blocking #3258 now and needs a fix before we can progress.

Also, given that this adds a parameter to buffer.indexOf, it really ought to have been marked semver-minor. My bad as I missed that in my review.

@skomski, I believe that the issue has to do with the byte ordering of the string value being passed in to Buffer.indexOf but haven't yet been able to test my theory. A bit later today I'll be able to get into the BE machine and test the theory but if you happen to get in there sooner, please let me know.

skomski · 2015-10-09T16:30:26Z

@jasnell Pushed a fix at #3295

jasnell · 2015-10-09T16:31:15Z

Ok thank you! I'll be able to look shortly.
On Oct 9, 2015 9:30 AM, "Karl Skomski" [email protected] wrote:

@jasnell https://github.com/jasnell Pushed a fix at #3295
#3295

—
Reply to this email directly or view it on GitHub
#2539 (comment).

jasnell · 2015-10-09T18:19:03Z

@mhdawson ... Can you look at this and verify that it fixes the problem you
were seeing. Looks like it should after a quick review
On Oct 9, 2015 9:31 AM, "James M Snell" [email protected] wrote:

Ok thank you! I'll be able to look shortly.
On Oct 9, 2015 9:30 AM, "Karl Skomski" [email protected] wrote:

@jasnell https://github.com/jasnell Pushed a fix at #3295
#3295

—
Reply to this email directly or view it on GitHub
#2539 (comment).

Fishrock123 added buffer Issues and PRs related to the buffer subsystem. c++ Issues and PRs that require attention from people who are familiar with C++. labels Aug 25, 2015

trevnorris reviewed Aug 26, 2015
View reviewed changes

skomski force-pushed the add-fast-string-search branch 2 times, most recently from 0aa8558 to decd947 Compare August 26, 2015 07:45

bnoordhuis reviewed Aug 26, 2015
View reviewed changes

trevnorris reviewed Aug 26, 2015
View reviewed changes

skomski force-pushed the add-fast-string-search branch 2 times, most recently from e59db22 to abb35e9 Compare August 27, 2015 14:01

skomski force-pushed the add-fast-string-search branch from abb35e9 to 07bc1e5 Compare August 27, 2015 14:30

trevnorris reviewed Aug 27, 2015
View reviewed changes

skomski force-pushed the add-fast-string-search branch from 9369e24 to b2088ce Compare September 21, 2015 19:30

trevnorris reviewed Sep 24, 2015
View reviewed changes

trevnorris added the land-on-v4.x label Sep 24, 2015

jasnell closed this Oct 8, 2015

jasnell added the lts-landed-on-v4.x label Oct 8, 2015

jasnell mentioned this pull request Oct 8, 2015

Release proposal: v4.2.0 'Argon' (LTS) #3258

Closed

mhdawson mentioned this pull request Oct 8, 2015

test-buffer-indexof.js failing on pLinux BE #3283

Closed

jasnell added the semver-minor PRs that contain new features and should be released in the next minor version. label Oct 9, 2015

MylesBorins removed the land-on-v4.x label Nov 16, 2015

ZYSzys mentioned this pull request Mar 7, 2019

buffer: improve performance of buffer indexof method #26400

Closed

2 tasks

solebox mentioned this pull request Nov 27, 2023

[Snyk] Fix for 3 vulnerabilities solebox/node#33

Open

UbuntuEvangelist mentioned this pull request Nov 27, 2023

[Snyk] Fix for 3 vulnerabilities UbuntuEvangelist/node#28

Open

s4sc mentioned this pull request Nov 28, 2023

[Snyk] Fix for 3 vulnerabilities s4sc/node#32

Open

enterstudio mentioned this pull request Nov 29, 2023

[Snyk] Fix for 3 vulnerabilities enterstudio/node#32

Open

leonardoadame mentioned this pull request Dec 19, 2023

[Snyk] Fix for 4 vulnerabilities leonardoadame/node#583

Open

ryan-ally mentioned this pull request Dec 19, 2023

[Snyk] Fix for 6 vulnerabilities ryan-ally/node#98

Open

aliscco mentioned this pull request Dec 20, 2023

[Snyk] Fix for 6 vulnerabilities aliscco/alisco-node#1174

Open

baby636 mentioned this pull request Dec 20, 2023

[Snyk] Fix for 6 vulnerabilities baby636/node#124

Open

This was referenced Dec 20, 2023

[Snyk] Fix for 6 vulnerabilities MaxMood96/node#40

Open

[Snyk] Fix for 6 vulnerabilities Mement-Mori/node#36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: replace naive search with naive + BMH in Buffer::IndexOf #2539

src: replace naive search with naive + BMH in Buffer::IndexOf #2539

skomski commented Aug 25, 2015

mscdex commented Aug 25, 2015

YurySolovyov commented Aug 25, 2015

rvagg commented Aug 26, 2015

trevnorris Aug 26, 2015

trevnorris Aug 26, 2015

trevnorris commented Aug 26, 2015

skomski commented Aug 26, 2015

bnoordhuis Aug 26, 2015

trevnorris commented Aug 26, 2015

trevnorris Aug 26, 2015

skomski commented Aug 27, 2015

trevnorris commented Aug 27, 2015

trevnorris Aug 27, 2015

trevnorris Sep 24, 2015

trevnorris commented Sep 24, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 9, 2015

skomski commented Oct 9, 2015

jasnell commented Oct 9, 2015

jasnell commented Oct 9, 2015

src: replace naive search with naive + BMH in Buffer::IndexOf #2539

src: replace naive search with naive + BMH in Buffer::IndexOf #2539

Conversation

skomski commented Aug 25, 2015

mscdex commented Aug 25, 2015

YurySolovyov commented Aug 25, 2015

rvagg commented Aug 26, 2015

trevnorris Aug 26, 2015

Choose a reason for hiding this comment

trevnorris Aug 26, 2015

Choose a reason for hiding this comment

trevnorris commented Aug 26, 2015

skomski commented Aug 26, 2015

bnoordhuis Aug 26, 2015

Choose a reason for hiding this comment

trevnorris commented Aug 26, 2015

trevnorris Aug 26, 2015

Choose a reason for hiding this comment

skomski commented Aug 27, 2015

trevnorris commented Aug 27, 2015

trevnorris Aug 27, 2015

Choose a reason for hiding this comment

trevnorris Sep 24, 2015

Choose a reason for hiding this comment

trevnorris commented Sep 24, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 8, 2015

jasnell commented Oct 9, 2015

skomski commented Oct 9, 2015

jasnell commented Oct 9, 2015

jasnell commented Oct 9, 2015