RecyclerBytesStreamOutput using absolute offsets#140303
RecyclerBytesStreamOutput using absolute offsets#140303DaveCTurner merged 4 commits intoelastic:mainfrom
RecyclerBytesStreamOutput using absolute offsets#140303Conversation
Today `RecyclerBytesStreamOutputTests` only covers the case where the recycler supplies a buffer with zero offset, potentially missing bugs that only arise with slices of a larger pool of buffers. This commit strengthens these tests to verify the behaviour when using a slice of a larger pool, including verification that we never write outside our buffer and that we do not attempt to read from the buffer after it is released.
Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage`
combined with an `int currentPageOffset` tracking the offset within that
page, manipulated such that `currentPageOffset ≤ pageSize`. In practice
we need the absolute offset `currentPage.offset + currentPageOffset`
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.
Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:
Before:
Benchmark Mode Cnt Score Error Units
RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 2200.652 ± 186.561 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 56.122 ± 4.262 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.555 ± 3.486 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1563.307 ± 185.027 ns/op
RecyclerBytesStreamOutputBenchmark.writeString avgt 3 884.288 ± 15.576 ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2517.240 ± 30.936 ns/op
After:
Benchmark Mode Cnt Score Error Units
RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 1772.697 ± 10.986 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 44.298 ± 0.072 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.949 ± 1.256 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1571.635 ± 12.979 ns/op
RecyclerBytesStreamOutputBenchmark.writeString avgt 3 798.606 ± 15.008 ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2374.194 ± 4.405 ns/op
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Note this branch also includes #140263 which needs to land first. There's no actual test changes to merge here. |
|
Do you know why these two have large error, but others stays the same after change? |
Not a clue. I ran these on an otherwise-quiet |
mhl-b
left a comment
There was a problem hiding this comment.
My understanding that this version removes single instruction per single byte copy or slice copy. So for cases with larger slices and strings there should not be noticeable improvement, because this single computation would be amortized over size of the slice.
| } | ||
| } | ||
|
|
||
| private record BenchmarkRecycler(AtomicReference<BytesRef> bytesRef) implements Recycler<BytesRef> { |
There was a problem hiding this comment.
I think allocating recycler introduce noise. Since this bench does not care about real production recycler I would preallocate all pages so obtaining page would take least amount of work during bench iteration.
There was a problem hiding this comment.
I'm just using the benchmark that already exists (the only change here is to correct the BytesRef#length values). Are you saying you think that with a different benchmark you'd get to a different conclusion?
There was a problem hiding this comment.
The change looks fine, it does a bit less work than before. I'm wondering how much of meaningful savings we achieve in presence of high error rate. I think for such fine tuning we better have cleaner measurements.
There was a problem hiding this comment.
I looked some more at this and I don't think the benchmark is doing much allocation. Every iteration writes to the same stream, resetting the position with seek(1) but that doesn't release the pages it already holds. So the first warmup iteration allocates all the pages needed and all the other iterations are just writing to the pages that already exist.
There was a problem hiding this comment.
Right, it does not allocate much. I ran with -prof gc, there are no GC cycles either.
Benchmark Mode Cnt Score Error Units
RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 740.186 ± 23.884 ns/op
RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeByte:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op
RecyclerBytesStreamOutputBenchmark.writeByte:gc.count avgt 3 ≈ 0 counts
RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 34.343 ± 0.626 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.alloc.rate.norm avgt 3 ≈ 10⁻⁵ B/op
RecyclerBytesStreamOutputBenchmark.writeBytes:gc.count avgt 3 ≈ 0 counts
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 54.514 ± 9.643 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.alloc.rate.norm avgt 3 ≈ 10⁻⁴ B/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary:gc.count avgt 3 ≈ 0 counts
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1546.191 ± 515.694 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage:gc.count avgt 3 ≈ 0 counts
RecyclerBytesStreamOutputBenchmark.writeString avgt 3 662.734 ± 35.317 ns/op
RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeString:gc.alloc.rate.norm avgt 3 ≈ 10⁻³ B/op
RecyclerBytesStreamOutputBenchmark.writeString:gc.count avgt 3 ≈ 0 counts
RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 1488.419 ± 137.158 ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate avgt 3 0.001 ± 0.001 MB/sec
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.alloc.rate.norm avgt 3 0.001 ± 0.001 B/op
RecyclerBytesStreamOutputBenchmark.writeVInt:gc.count avgt 3 ≈ 0 counts
We write a lot of small strings (e.g. index names and field names), vInts (e.g. collection sizes) and single bytes (e.g. generic value type codes) in practice. I mean it's not going to make an astoundingly large difference but scanning through Universal Profiling it looks like we spend a little over 0.6% of all CPU time in serverless in this area. |
I think we can squeeze a bit more performance by removing bytes fiddling in When I said "bytes fiddling" I meant we can allocate new page if there is not enough space for the whole integer, we don't need to know exact size. We still write varInt. In worst case we waste 3 bytes per page. And probability of having less than 4 bytes per page to write vInt is very low. I think |
|
Yes I think so too, that's the first item I mentioned in #140257 right? (although the limit is 5 not 4 bytes). My intention was for this PR really just to change the offsets but leave the fundamental flow alone and then I was planing to take a look at other aspects after this and #140263 are merged. |
Yes, you did! I missed that. |
* upstream/main: (76 commits) [Inference API] Get _services skips EIS authorization call if CCM is not configured (elastic#139964) Improve TSDB codec benchmarks with full encoder and compression metrics (elastic#140299) ESQL: Consolidate test `BlockLoaderContext`s (elastic#140403) ESQL: Improve Lookup Join performance with CachedDirectoryReader (elastic#139314) ES|QL: Add more examples for the match operator (elastic#139815) ESQL: Add timezone to add and sub operators, and ConfigurationAware planning support (elastic#140101) ESQL: Updated ToIp tests and generated documentation for map parameters (elastic#139994) Disable _delete_by_query and _update_by_query for CCS/stateful (elastic#140301) Remove unused method ElasticInferenceService.translateToChunkedResults (elastic#140442) logging hot threads on large queue of the management threadpool (elastic#140251) Search functions docs cleanup (elastic#140435) Unmute 350_point_in_time/point-in-time with index filter (elastic#140443) Remove unused methods (elastic#140222) Add CPS and `project_routing` support for `_mvt` (elastic#140053) Streamline `ShardDeleteResults` collection (elastic#140363) Fix Docker build to use --load for single-platform images (elastic#140402) Parametrize + test VectorScorerOSQBenchmark (elastic#140354) `RecyclerBytesStreamOutput` using absolute offsets (elastic#140303) Define bulk float native methods for vector scoring (elastic#139885) Make `TimeSeriesAggregate` `TimestampAware` (elastic#140270) ...
Today `RecyclerBytesStreamOutput` works with a `BytesRef currentPage`
combined with an `int currentPageOffset` tracking the offset within that
page, manipulated such that `currentPageOffset ≤ pageSize`. In practice
we need the absolute offset `currentPage.offset + currentPageOffset`
fairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.
Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:
Before:
Benchmark Mode Cnt Score Error Units
RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 2200.652 ± 186.561 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 56.122 ± 4.262 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.555 ± 3.486 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1563.307 ± 185.027 ns/op
RecyclerBytesStreamOutputBenchmark.writeString avgt 3 884.288 ± 15.576 ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2517.240 ± 30.936 ns/op
After:
Benchmark Mode Cnt Score Error Units
RecyclerBytesStreamOutputBenchmark.writeByte avgt 3 1772.697 ± 10.986 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytes avgt 3 44.298 ± 0.072 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesAcrossPageBoundary avgt 3 67.949 ± 1.256 ns/op
RecyclerBytesStreamOutputBenchmark.writeBytesMultiPage avgt 3 1571.635 ± 12.979 ns/op
RecyclerBytesStreamOutputBenchmark.writeString avgt 3 798.606 ± 15.008 ns/op
RecyclerBytesStreamOutputBenchmark.writeVInt avgt 3 2374.194 ± 4.405 ns/op
Today
RecyclerBytesStreamOutputworks with aBytesRef currentPagecombined with an
int currentPageOffsettracking the offset within thatpage, manipulated such that
currentPageOffset ≤ pageSize. In practicewe need the absolute offset
currentPage.offset + currentPageOffsetfairly often, and it seems that it's slightly more efficient to track
this value instead and compute a new upper bound each new page.
Microbenchmarking indicates that this change gives meaningful savings
away from page boundaries, and doesn't make any of the boundary-crossing
cases appreciably worse:
Before:
After: