Coalesce getSortedNumeric calls for ES819 doc values merging by jordan-powers · Pull Request #126732 · elastic/elasticsearch

jordan-powers · 2025-04-11T23:37:25Z

When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort.

This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment.

Relates to #126111

…ation

elasticsearchmachine · 2025-04-14T19:58:03Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

jordan-powers · 2025-04-14T19:59:13Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java

+                String addressDataOutputName = null;
+                try (
+                    var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp");
+                    // TODO: which IOContext should be used here?


This comment was in Martijn's initial implementation, and I didn't know the answer, so I left it. I'd appreciate suggestions

@dnhatn Do you think usage of IOContext.DEFAULT is ok here or is there a better IOContext that can be used here?

I think we need to do something like this: #126499 (comment)

martijnvg

This looks good Jordan.

Would you be able to change the ES819TSDBDocValuesFormatTests#testForceMergeDenseCase() and ES819TSDBDocValuesFormatTests#testForceMergeSparseCase() tests by also indexing multi valued sorted numeric doc values? For example by randomly indexing gauge_2 field with multiple values? (similar to the tags field)

martijnvg · 2025-04-15T08:46:43Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java

+                String addressDataOutputName = null;
+                try (
+                    var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp");
+                    // TODO: which IOContext should be used here?


@dnhatn Do you think usage of IOContext.DEFAULT is ok here or is there a better IOContext that can be used here?

martijnvg · 2025-04-15T08:54:14Z

I also re-ran the micro benchmark:

Benchmark                                                    (deltaTime)   (nDocs)  (seed)  Mode  Cnt     Score   Error  Units
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge            1000  20431204      42    ss       4678.076          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge         1000  20431204      42    ss       7230.848          ms/op

Which looks better as was reported in #125403.

UPDATE:

Running the same micro benchmark from main branch:

Benchmark                                                    (deltaTime)   (nDocs)  (seed)  Mode  Cnt     Score   Error  Units
TSDBDocValuesMergeBenchmark.forceMergeWithOptimizedMerge            1000  20431204      42    ss       5607.886          ms/op
TSDBDocValuesMergeBenchmark.forceMergeWithoutOptimizedMerge         1000  20431204      42    ss       8397.983          ms/op

which is relatively slightly slower than what was reported above.

…ation

martijnvg

Thanks for iterating. I left two more comments.

martijnvg · 2025-04-17T08:16:33Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java

+                        } catch (final IOException ignored) {
+                            // ignore exception
+                        }


Can this try-catch be removed? This method signature does allow IOException. If addressDataOutputName is not null, then there should be a temp file, I think?

Will do.
I originally added the try-catch because the draft implementation used org.apache.lucene.util.IOUtils.deleteFilesIgnoringExceptions here. That's a forbidden api so I couldn't use it, but it seemed like we were trying to suppress any IOException that happened during that deletion, so I added the try-catch.

I see in DISIAccumulator#close(), you use @SuppressForbidden to call IOUtils.deleteFilesIgnoringExceptions, I'll do the same.

martijnvg · 2025-04-17T08:18:42Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java

+                    var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp");
+                    var addressDataOutput = dir.createTempOutput(data.getName(), "address-data", ioContext)


Maybe similarly to DISIAccumulator when can encapsulate the accumulation of the addresses in a class that implements Closable and has. a build method that copies data from temp file to actual data file and update metadata?

I think this could make the code a little bit more manageable similar to effect of what DISIAccumulator did.

Makes sense to me, I'll add that

…ation

martijnvg

I left one tiny comment. Otherwise LGTM.

martijnvg · 2025-04-18T05:27:13Z

server/src/main/java/org/elasticsearch/index/codec/tsdb/es819/ES819TSDBDocValuesConsumer.java

+        FieldInfo field,
+        TsdbDocValuesProducer valuesProducer,
+        long maxOrd,
+        CheckedIntConsumer<IOException> docCountConsumer


nit: maybe just pass down the OffsetsAccumulator here? No need for an int consumer abstraction at this level.

…ation

elasticsearchmachine · 2025-04-18T20:55:33Z

💚 Backport successful

Status	Branch	Result
✅	8.x

When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort. This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment. Relates to #126111

Applies the merge optimizations from #126499 and #126732 to binary field types for the ES819 codec.

#127346) Applies the merge optimizations from #126499 and #126732 to binary field types for the ES819 codec.

Coalesce getSortedNumeric calls for ES819 doc values merging

b1eedf0

jordan-powers added >non-issue auto-backport Automatically create backport pull requests when merged :StorageEngine/Codec v8.19.0 v9.1.0 labels Apr 11, 2025

jordan-powers self-assigned this Apr 11, 2025

jordan-powers added 2 commits April 14, 2025 10:27

Avoid using forbidden lucene IOUtils api

fa51f2c

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

55c56e2

…ation

jordan-powers requested a review from martijnvg April 14, 2025 19:57

jordan-powers marked this pull request as ready for review April 14, 2025 19:57

elasticsearchmachine added the Team:StorageEngine label Apr 14, 2025

jordan-powers commented Apr 14, 2025

View reviewed changes

martijnvg reviewed Apr 15, 2025

View reviewed changes

martijnvg mentioned this pull request Apr 15, 2025

Optimize segment merging in the tsdb doc value codec #126111

Closed

5 tasks

jordan-powers added 6 commits April 15, 2025 12:27

Index multi-valued sorted numeric doc values in ES819 force merge tests

187fe48

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

4e94261

…ation

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

183ccb2

…ation

Use IOContext from SegmentWriteState

7237f19

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

6b733d3

…ation

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

5147b91

…ation

martijnvg reviewed Apr 17, 2025

View reviewed changes

jordan-powers added 6 commits April 17, 2025 10:15

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

80f0879

…ation

Don't ignore IOExceptions in temp file cleanup

d61633b

Encapsulate address accumulation logic in separate class

76f78d2

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

8aee9ef

…ation

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

de86949

…ation

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

402a25b

…ation

martijnvg approved these changes Apr 18, 2025

View reviewed changes

jordan-powers added 2 commits April 18, 2025 12:40

Just pass the accumulator directly into writeField

682a7a6

Merge remote-tracking branch 'upstream/main' into es819-merge-optimiz…

1c8b60d

…ation

jordan-powers enabled auto-merge (squash) April 18, 2025 19:58

jordan-powers merged commit b972364 into elastic:main Apr 18, 2025
17 checks passed

jordan-powers deleted the es819-merge-optimization branch April 18, 2025 20:53

jordan-powers mentioned this pull request Apr 18, 2025

[8.x] Optimize ES819 doc values address offset calculation (#126732) #127079

Merged

jordan-powers mentioned this pull request Apr 23, 2025

Apply TSDB jump table and offset construction optimizations to binary doc values #127278

Merged

jordan-powers added a commit that referenced this pull request Apr 24, 2025

Apply recent TSDB codec merge optimizations to binary values (#127278)

69c2eda

Applies the merge optimizations from #126499 and #126732 to binary field types for the ES819 codec.

elasticsearchmachine pushed a commit that referenced this pull request Apr 24, 2025

Apply recent TSDB codec merge optimizations to binary values (#127278) (

1d1e85d

#127346) Applies the merge optimizations from #126499 and #126732 to binary field types for the ES819 codec.

		var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp");
		var addressDataOutput = dir.createTempOutput(data.getName(), "address-data", ioContext)

Conversation

jordan-powers commented Apr 11, 2025

Uh oh!

elasticsearchmachine commented Apr 14, 2025

Uh oh!

jordan-powers Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg commented Apr 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

jordan-powers Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

jordan-powers Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jordan-powers Apr 17, 2025

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

martijnvg Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Apr 18, 2025

💚 Backport successful

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

martijnvg Apr 15, 2025 •

edited

Loading

martijnvg Apr 15, 2025 •

edited

Loading

martijnvg commented Apr 15, 2025 •

edited

Loading

martijnvg Apr 17, 2025 •

edited

Loading