Skip to content

RecyclerBytesStreamOutput using absolute offsets#140303

Merged
DaveCTurner merged 4 commits intoelastic:mainfrom
DaveCTurner:2026/01/07/RecyclerBytesStreamOutput-absolute-offset
Jan 9, 2026
Merged

RecyclerBytesStreamOutput using absolute offsets#140303
DaveCTurner merged 4 commits intoelastic:mainfrom
DaveCTurner:2026/01/07/RecyclerBytesStreamOutput-absolute-offset

Conversation

@DaveCTurner
Copy link
Contributor

Today RecyclerBytesStreamOutput works with a BytesRef currentPage
combined with an int currentPageOffset tracking the offset within that
page, manipulated such that currentPageOffset ≤ pageSize. In practice
we need the absolute offset currentPage.offset + currentPageOffset
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.

Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:

Before:

Benchmark                                                        Mode  Cnt     Score     Error  Units
RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    56.122 ±   4.262  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.555 ±   3.486  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op
RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   884.288 ±  15.576  ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2517.240 ±  30.936  ns/op

After:

Benchmark                                                        Mode  Cnt     Score    Error  Units
RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  1772.697 ± 10.986  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    44.298 ±  0.072  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.949 ±  1.256  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1571.635 ± 12.979  ns/op
RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   798.606 ± 15.008  ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2374.194 ±  4.405  ns/op

Today `RecyclerBytesStreamOutputTests` only covers the case where the
recycler supplies a buffer with zero offset, potentially missing bugs
that only arise with slices of a larger pool of buffers.

This commit strengthens these tests to verify the behaviour when using a
slice of a larger pool, including verification that we never write
outside our buffer and that we do not attempt to read from the buffer
after it is released.
Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage`
combined with an `int currentPageOffset` tracking the offset within that
page, manipulated such that `currentPageOffset ≤ pageSize`. In practice
we need the absolute offset `currentPage.offset + currentPageOffset`
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.

Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:

Before:

    Benchmark                                                        Mode  Cnt     Score     Error  Units
    RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    56.122 ±   4.262  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.555 ±   3.486  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op
    RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   884.288 ±  15.576  ns/op
    RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2517.240 ±  30.936  ns/op

After:

    Benchmark                                                        Mode  Cnt     Score    Error  Units
    RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  1772.697 ± 10.986  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    44.298 ±  0.072  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.949 ±  1.256  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1571.635 ± 12.979  ns/op
    RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   798.606 ± 15.008  ns/op
    RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2374.194 ±  4.405  ns/op
@DaveCTurner DaveCTurner requested a review from a team as a code owner January 7, 2026 17:54
@DaveCTurner DaveCTurner added >non-issue :Distributed/Network Http and internode communication implementations v9.4.0 labels Jan 7, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 7, 2026
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DaveCTurner
Copy link
Contributor Author

Note this branch also includes #140263 which needs to land first. There's no actual test changes to merge here.

@mhl-b
Copy link
Contributor

mhl-b commented Jan 7, 2026

Do you know why these two have large error, but others stays the same after change?

RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op

@DaveCTurner
Copy link
Contributor Author

Do you know why these two have large error, but others stays the same after change?

Not a clue. I ran these on an otherwise-quiet n2-standard-16 instance on GCP but I expect it doesn't take much to disturb the timings in such a tight loop.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding that this version removes single instruction per single byte copy or slice copy. So for cases with larger slices and strings there should not be noticeable improvement, because this single computation would be amortized over size of the slice.

}
}

private record BenchmarkRecycler(AtomicReference<BytesRef> bytesRef) implements Recycler<BytesRef> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think allocating recycler introduce noise. Since this bench does not care about real production recycler I would preallocate all pages so obtaining page would take least amount of work during bench iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just using the benchmark that already exists (the only change here is to correct the BytesRef#length values). Are you saying you think that with a different benchmark you'd get to a different conclusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks fine, it does a bit less work than before. I'm wondering how much of meaningful savings we achieve in presence of high error rate. I think for such fine tuning we better have cleaner measurements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked some more at this and I don't think the benchmark is doing much allocation. Every iteration writes to the same stream, resetting the position with seek(1) but that doesn't release the pages it already holds. So the first warmup iteration allocates all the pages needed and all the other iterations are just writing to the pages that already exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it does not allocate much. I ran with -prof gc, there are no GC cycles either.

Benchmark                                                                           Mode  Cnt     Score     Error   Units
RecyclerBytesStreamOutputBenchmark.writeByte                                        avgt    3   740.186 ±  23.884   ns/op
RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate                          avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate.norm                     avgt    3     0.001 ±   0.001    B/op
RecyclerBytesStreamOutputBenchmark.writeByte:gc.count                               avgt    3       ≈ 0            counts
RecyclerBytesStreamOutputBenchmark.writeBytes                                       avgt    3    34.343 ±   0.626   ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate                         avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate.norm                    avgt    3    ≈ 10⁻⁵              B/op
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.count                              avgt    3       ≈ 0            counts
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary                     avgt    3    54.514 ±   9.643   ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate       avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate.norm  avgt    3    ≈ 10⁻⁴              B/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.count            avgt    3       ≈ 0            counts
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage                              avgt    3  1546.191 ± 515.694   ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate                avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate.norm           avgt    3     0.001 ±   0.001    B/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.count                     avgt    3       ≈ 0            counts
RecyclerBytesStreamOutputBenchmark.writeString                                      avgt    3   662.734 ±  35.317   ns/op
RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate                        avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate.norm                   avgt    3    ≈ 10⁻³              B/op
RecyclerBytesStreamOutputBenchmark.writeString:gc.count                             avgt    3       ≈ 0            counts
RecyclerBytesStreamOutputBenchmark.writeVInt                                        avgt    3  1488.419 ± 137.158   ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate                          avgt    3     0.001 ±   0.001  MB/sec
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate.norm                     avgt    3     0.001 ±   0.001    B/op
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.count                               avgt    3       ≈ 0            counts

@DaveCTurner
Copy link
Contributor Author

So for cases with larger slices and strings there should not be noticeable improvement,

We write a lot of small strings (e.g. index names and field names), vInts (e.g. collection sizes) and single bytes (e.g. generic value type codes) in practice. I mean it's not going to make an astoundingly large difference but scanning through Universal Profiling it looks like we spend a little over 0.6% of all CPU time in serverless in this area.

@mhl-b
Copy link
Contributor

mhl-b commented Jan 7, 2026

vInts (e.g. collection sizes)

I think we can squeeze a bit more performance by removing bytes fiddling in writeVInt. For example, writing single byte integer performs size check twice. Non-single byte integer checks size three times, also uses scratch buffer, I don't think it's necessary, can be unwinded.

When I said "bytes fiddling" I meant we can allocate new page if there is not enough space for the whole integer, we don't need to know exact size. We still write varInt. In worst case we waste 3 bytes per page. And probability of having less than 4 bytes per page to write vInt is very low.

writeVInt(int i) {
  if 4 > remainingBytesInPage then new page;
  StreamOutputHelper.putMultiByteVInt(page, i, pageOffset);
}

I think StreamOutputHelper.putMultiByteVInt is pretty tight loop almost identical to a single byte check. Can be unwinded into 4 statements too for each byte, but I doubt it would be faster.

@DaveCTurner
Copy link
Contributor Author

Yes I think so too, that's the first item I mentioned in #140257 right? (although the limit is 5 not 4 bytes). My intention was for this PR really just to change the offsets but leave the fundamental flow alone and then I was planing to take a look at other aspects after this and #140263 are merged.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhl-b
Copy link
Contributor

mhl-b commented Jan 8, 2026

Yes I think so too, that's the first item I mentioned in #140257 right?

Yes, you did! I missed that.

@DaveCTurner DaveCTurner merged commit c545b94 into elastic:main Jan 9, 2026
35 checks passed
@DaveCTurner DaveCTurner deleted the 2026/01/07/RecyclerBytesStreamOutput-absolute-offset branch January 9, 2026 09:34
szybia added a commit to szybia/elasticsearch that referenced this pull request Jan 9, 2026
* upstream/main: (76 commits)
  [Inference API] Get _services skips EIS authorization call if CCM is not configured (elastic#139964)
  Improve TSDB codec benchmarks with full encoder and compression metrics (elastic#140299)
  ESQL: Consolidate test `BlockLoaderContext`s (elastic#140403)
  ESQL: Improve Lookup Join performance with CachedDirectoryReader (elastic#139314)
  ES|QL: Add more examples for the match operator (elastic#139815)
  ESQL: Add timezone to add and sub operators, and ConfigurationAware planning support (elastic#140101)
  ESQL: Updated ToIp tests and generated documentation for map parameters (elastic#139994)
  Disable _delete_by_query and _update_by_query for CCS/stateful (elastic#140301)
  Remove unused method ElasticInferenceService.translateToChunkedResults (elastic#140442)
  logging hot threads on large queue of the management threadpool (elastic#140251)
  Search functions docs cleanup (elastic#140435)
  Unmute 350_point_in_time/point-in-time with index filter (elastic#140443)
  Remove unused methods (elastic#140222)
  Add CPS and `project_routing` support for `_mvt` (elastic#140053)
  Streamline `ShardDeleteResults` collection (elastic#140363)
  Fix Docker build to use --load for single-platform images (elastic#140402)
  Parametrize + test VectorScorerOSQBenchmark (elastic#140354)
  `RecyclerBytesStreamOutput` using absolute offsets (elastic#140303)
  Define bulk float native methods for vector scoring (elastic#139885)
  Make `TimeSeriesAggregate` `TimestampAware` (elastic#140270)
  ...
jimczi pushed a commit to jimczi/elasticsearch that referenced this pull request Jan 12, 2026
Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage`
combined with an `int currentPageOffset` tracking the offset within that
page, manipulated such that `currentPageOffset ≤ pageSize`. In practice
we need the absolute offset `currentPage.offset + currentPageOffset`
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.

Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:

Before:

    Benchmark                                                        Mode  Cnt     Score     Error  Units
    RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  2200.652 ± 186.561  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    56.122 ±   4.262  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.555 ±   3.486  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1563.307 ± 185.027  ns/op
    RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   884.288 ±  15.576  ns/op
    RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2517.240 ±  30.936  ns/op

After:

    Benchmark                                                        Mode  Cnt     Score    Error  Units
    RecyclerBytesStreamOutputBenchmark.writeByte                     avgt    3  1772.697 ± 10.986  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytes                    avgt    3    44.298 ±  0.072  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary  avgt    3    67.949 ±  1.256  ns/op
    RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage           avgt    3  1571.635 ± 12.979  ns/op
    RecyclerBytesStreamOutputBenchmark.writeString                   avgt    3   798.606 ± 15.008  ns/op
    RecyclerBytesStreamOutputBenchmark.writeVInt                     avgt    3  2374.194 ±  4.405  ns/op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations >non-issue Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants