`RecyclerBytesStreamOutput` using absolute offsets by DaveCTurner · Pull Request #140303 · elastic/elasticsearch

DaveCTurner · 2026-01-07T17:54:47Z

Today RecyclerBytesStreamOutput works with a BytesRef currentPage
combined with an int currentPageOffset tracking the offset within that
page, manipulated such that currentPageOffset ≤ pageSize. In practice
we need the absolute offset currentPage.offset + currentPageOffset
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.

Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:

Before:

Benchmark                                                        Mode  Cnt     Score     Error  Units
RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    56.122 ±   4.262  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.555 ±   3.486  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op
RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   884.288 ±  15.576  ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2517.240 ±  30.936  ns/op

After:

Benchmark                                                        Mode  Cnt     Score    Error  Units
RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  1772.697 ± 10.986  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    44.298 ±  0.072  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.949 ±  1.256  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1571.635 ± 12.979  ns/op
RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   798.606 ± 15.008  ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2374.194 ±  4.405  ns/op

Today `RecyclerBytesStreamOutputTests` only covers the case where the recycler supplies a buffer with zero offset, potentially missing bugs that only arise with slices of a larger pool of buffers. This commit strengthens these tests to verify the behaviour when using a slice of a larger pool, including verification that we never write outside our buffer and that we do not attempt to read from the buffer after it is released.

Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage` combined with an `int currentPageOffset` tracking the offset within that page, manipulated such that `currentPageOffset ≤ pageSize`. In practice we need the absolute offset `currentPage.offset + currentPageOffset` fairly often, and it seems that it's slightly more efficient to track this value instead and compute a new upper bound each new page. Microbenchmarking indicates that this change gives meaningful savings away from page boundaries, and doesn't make any of the boundary-crossing cases appreciably worse: Before: Benchmark Mode Cnt Score Error Units RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 2200.652 ± 186.561 ns/op RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 56.122 ± 4.262 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.555 ± 3.486 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1563.307 ± 185.027 ns/op RecyclerBytesStreamOutputBenchmark.writeString avgt 3 884.288 ± 15.576 ns/op RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2517.240 ± 30.936 ns/op After: Benchmark Mode Cnt Score Error Units RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 1772.697 ± 10.986 ns/op RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 44.298 ± 0.072 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.949 ± 1.256 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1571.635 ± 12.979 ns/op RecyclerBytesStreamOutputBenchmark.writeString avgt 3 798.606 ± 15.008 ns/op RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2374.194 ± 4.405 ns/op

elasticsearchmachine · 2026-01-07T17:55:12Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

DaveCTurner · 2026-01-07T17:55:46Z

Note this branch also includes #140263 which needs to land first. There's no actual test changes to merge here.

mhl-b · 2026-01-07T18:05:43Z

Do you know why these two have large error, but others stays the same after change?

RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op

DaveCTurner · 2026-01-07T18:21:58Z

Do you know why these two have large error, but others stays the same after change?

Not a clue. I ran these on an otherwise-quiet n2-standard-16 instance on GCP but I expect it doesn't take much to disturb the timings in such a tight loop.

…e-offset

mhl-b

My understanding that this version removes single instruction per single byte copy or slice copy. So for cases with larger slices and strings there should not be noticeable improvement, because this single computation would be amortized over size of the slice.

mhl-b · 2026-01-07T21:26:02Z

...arks/src/main/java/org/elasticsearch/benchmark/bytes/RecyclerBytesStreamOutputBenchmark.java

        }
    }

    private record BenchmarkRecycler(AtomicReference<BytesRef> bytesRef) implements Recycler<BytesRef> {


I think allocating recycler introduce noise. Since this bench does not care about real production recycler I would preallocate all pages so obtaining page would take least amount of work during bench iteration.

I'm just using the benchmark that already exists (the only change here is to correct the BytesRef#length values). Are you saying you think that with a different benchmark you'd get to a different conclusion?

The change looks fine, it does a bit less work than before. I'm wondering how much of meaningful savings we achieve in presence of high error rate. I think for such fine tuning we better have cleaner measurements.

I looked some more at this and I don't think the benchmark is doing much allocation. Every iteration writes to the same stream, resetting the position with seek(1) but that doesn't release the pages it already holds. So the first warmup iteration allocates all the pages needed and all the other iterations are just writing to the pages that already exist.

Right, it does not allocate much. I ran with -prof gc, there are no GC cycles either.

Benchmark Mode Cnt Score Error Units RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 740.186 ± 23.884 ns/op RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op RecyclerBytesStreamOutputBenchmark.writeByte:gc.count avgt 3 ≈ 0 counts RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 34.343 ± 0.626 ns/op RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate.norm avgt 3 ≈ 10⁻⁵ B/op RecyclerBytesStreamOutputBenchmark.writeBytes:gc.count avgt 3 ≈ 0 counts RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 54.514 ± 9.643 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate.norm avgt 3 ≈ 10⁻⁴ B/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.count avgt 3 ≈ 0 counts RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1546.191 ± 515.694 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.count avgt 3 ≈ 0 counts RecyclerBytesStreamOutputBenchmark.writeString avgt 3 662.734 ± 35.317 ns/op RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate.norm avgt 3 ≈ 10⁻³ B/op RecyclerBytesStreamOutputBenchmark.writeString:gc.count avgt 3 ≈ 0 counts RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 1488.419 ± 137.158 ns/op RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op RecyclerBytesStreamOutputBenchmark.writeVInt:gc.count avgt 3 ≈ 0 counts

DaveCTurner · 2026-01-07T21:51:43Z

So for cases with larger slices and strings there should not be noticeable improvement,

We write a lot of small strings (e.g. index names and field names), vInts (e.g. collection sizes) and single bytes (e.g. generic value type codes) in practice. I mean it's not going to make an astoundingly large difference but scanning through Universal Profiling it looks like we spend a little over 0.6% of all CPU time in serverless in this area.

mhl-b · 2026-01-07T22:56:43Z

vInts (e.g. collection sizes)

I think we can squeeze a bit more performance by removing bytes fiddling in writeVInt. For example, writing single byte integer performs size check twice. Non-single byte integer checks size three times, also uses scratch buffer, I don't think it's necessary, can be unwinded.

When I said "bytes fiddling" I meant we can allocate new page if there is not enough space for the whole integer, we don't need to know exact size. We still write varInt. In worst case we waste 3 bytes per page. And probability of having less than 4 bytes per page to write vInt is very low.

writeVInt(int i) {
  if 4 > remainingBytesInPage then new page;
  StreamOutputHelper.putMultiByteVInt(page, i, pageOffset);
}

I think StreamOutputHelper.putMultiByteVInt is pretty tight loop almost identical to a single byte check. Can be unwinded into 4 statements too for each byte, but I doubt it would be faster.

DaveCTurner · 2026-01-07T23:14:38Z

Yes I think so too, that's the first item I mentioned in #140257 right? (although the limit is 5 not 4 bytes). My intention was for this PR really just to change the offsets but leave the fundamental flow alone and then I was planing to take a look at other aspects after this and #140263 are merged.

mhl-b

LGTM

mhl-b · 2026-01-08T00:12:55Z

Yes I think so too, that's the first item I mentioned in #140257 right?

Yes, you did! I missed that.

…e-offset

* upstream/main: (76 commits) [Inference API] Get _services skips EIS authorization call if CCM is not configured (elastic#139964) Improve TSDB codec benchmarks with full encoder and compression metrics (elastic#140299) ESQL: Consolidate test `BlockLoaderContext`s (elastic#140403) ESQL: Improve Lookup Join performance with CachedDirectoryReader (elastic#139314) ES|QL: Add more examples for the match operator (elastic#139815) ESQL: Add timezone to add and sub operators, and ConfigurationAware planning support (elastic#140101) ESQL: Updated ToIp tests and generated documentation for map parameters (elastic#139994) Disable _delete_by_query and _update_by_query for CCS/stateful (elastic#140301) Remove unused method ElasticInferenceService.translateToChunkedResults (elastic#140442) logging hot threads on large queue of the management threadpool (elastic#140251) Search functions docs cleanup (elastic#140435) Unmute 350_point_in_time/point-in-time with index filter (elastic#140443) Remove unused methods (elastic#140222) Add CPS and `project_routing` support for `_mvt` (elastic#140053) Streamline `ShardDeleteResults` collection (elastic#140363) Fix Docker build to use --load for single-platform images (elastic#140402) Parametrize + test VectorScorerOSQBenchmark (elastic#140354) `RecyclerBytesStreamOutput` using absolute offsets (elastic#140303) Define bulk float native methods for vector scoring (elastic#139885) Make `TimeSeriesAggregate` `TimestampAware` (elastic#140270) ...

Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage` combined with an `int currentPageOffset` tracking the offset within that page, manipulated such that `currentPageOffset ≤ pageSize`. In practice we need the absolute offset `currentPage.offset + currentPageOffset` fairly often, and it seems that it's slightly more efficient to track this value instead and compute a new upper bound each new page. Microbenchmarking indicates that this change gives meaningful savings away from page boundaries, and doesn't make any of the boundary-crossing cases appreciably worse: Before: Benchmark Mode Cnt Score Error Units RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 2200.652 ± 186.561 ns/op RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 56.122 ± 4.262 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.555 ± 3.486 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1563.307 ± 185.027 ns/op RecyclerBytesStreamOutputBenchmark.writeString avgt 3 884.288 ± 15.576 ns/op RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2517.240 ± 30.936 ns/op After: Benchmark Mode Cnt Score Error Units RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 1772.697 ± 10.986 ns/op RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 44.298 ± 0.072 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.949 ± 1.256 ns/op RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1571.635 ± 12.979 ns/op RecyclerBytesStreamOutputBenchmark.writeString avgt 3 798.606 ± 15.008 ns/op RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2374.194 ± 4.405 ns/op

DaveCTurner added 2 commits January 7, 2026 12:09

DaveCTurner requested review from Tim-Brooks and joegallo January 7, 2026 17:54

DaveCTurner requested a review from a team as a code owner January 7, 2026 17:54

DaveCTurner added >non-issue :Distributed/Network Http and internode communication implementations v9.4.0 labels Jan 7, 2026

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 7, 2026

Merge branch 'main' into 2026/01/07/RecyclerBytesStreamOutput-absolut…

b6f21d7

…e-offset

mhl-b reviewed Jan 7, 2026

View reviewed changes

mhl-b approved these changes Jan 7, 2026

View reviewed changes

Merge branch 'main' into 2026/01/07/RecyclerBytesStreamOutput-absolut…

2912641

…e-offset

DaveCTurner merged commit c545b94 into elastic:main Jan 9, 2026
35 checks passed

DaveCTurner deleted the 2026/01/07/RecyclerBytesStreamOutput-absolute-offset branch January 9, 2026 09:34

Conversation

DaveCTurner commented Jan 7, 2026

Uh oh!

elasticsearchmachine commented Jan 7, 2026

Uh oh!

DaveCTurner commented Jan 7, 2026

Uh oh!

mhl-b commented Jan 7, 2026

Uh oh!

DaveCTurner commented Jan 7, 2026

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

mhl-b Jan 10, 2026

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Jan 7, 2026

Uh oh!

mhl-b commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Jan 7, 2026

Uh oh!

mhl-b left a comment

Choose a reason for hiding this comment

Uh oh!

mhl-b commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mhl-b commented Jan 7, 2026 •

edited

Loading

mhl-b commented Jan 8, 2026 •

edited

Loading