Coalesce getSortedNumeric calls for ES819 doc values merging#126732
Coalesce getSortedNumeric calls for ES819 doc values merging#126732jordan-powers merged 17 commits intoelastic:mainfrom
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| String addressDataOutputName = null; | ||
| try ( | ||
| var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp"); | ||
| // TODO: which IOContext should be used here? |
There was a problem hiding this comment.
This comment was in Martijn's initial implementation, and I didn't know the answer, so I left it. I'd appreciate suggestions
There was a problem hiding this comment.
@dnhatn Do you think usage of IOContext.DEFAULT is ok here or is there a better IOContext that can be used here?
There was a problem hiding this comment.
I think we need to do something like this: #126499 (comment)
martijnvg
left a comment
There was a problem hiding this comment.
This looks good Jordan.
Would you be able to change the ES819TSDBDocValuesFormatTests#testForceMergeDenseCase() and ES819TSDBDocValuesFormatTests#testForceMergeSparseCase() tests by also indexing multi valued sorted numeric doc values? For example by randomly indexing gauge_2 field with multiple values? (similar to the tags field)
| String addressDataOutputName = null; | ||
| try ( | ||
| var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp"); | ||
| // TODO: which IOContext should be used here? |
There was a problem hiding this comment.
@dnhatn Do you think usage of IOContext.DEFAULT is ok here or is there a better IOContext that can be used here?
|
I also re-ran the micro benchmark: Which looks better as was reported in #125403. UPDATE: Running the same micro benchmark from main branch: which is relatively slightly slower than what was reported above. |
martijnvg
left a comment
There was a problem hiding this comment.
Thanks for iterating. I left two more comments.
| } catch (final IOException ignored) { | ||
| // ignore exception | ||
| } |
There was a problem hiding this comment.
Can this try-catch be removed? This method signature does allow IOException. If addressDataOutputName is not null, then there should be a temp file, I think?
There was a problem hiding this comment.
Will do.
I originally added the try-catch because the draft implementation used org.apache.lucene.util.IOUtils.deleteFilesIgnoringExceptions here. That's a forbidden api so I couldn't use it, but it seemed like we were trying to suppress any IOException that happened during that deletion, so I added the try-catch.
There was a problem hiding this comment.
I see in DISIAccumulator#close(), you use @SuppressForbidden to call IOUtils.deleteFilesIgnoringExceptions, I'll do the same.
| var addressMetaOutput = new ByteBuffersIndexOutput(addressMetaBuffer, "meta-temp", "meta-temp"); | ||
| var addressDataOutput = dir.createTempOutput(data.getName(), "address-data", ioContext) |
There was a problem hiding this comment.
Maybe similarly to DISIAccumulator when can encapsulate the accumulation of the addresses in a class that implements Closable and has. a build method that copies data from temp file to actual data file and update metadata?
I think this could make the code a little bit more manageable similar to effect of what DISIAccumulator did.
There was a problem hiding this comment.
Makes sense to me, I'll add that
martijnvg
left a comment
There was a problem hiding this comment.
I left one tiny comment. Otherwise LGTM.
| FieldInfo field, | ||
| TsdbDocValuesProducer valuesProducer, | ||
| long maxOrd, | ||
| CheckedIntConsumer<IOException> docCountConsumer |
There was a problem hiding this comment.
nit: maybe just pass down the OffsetsAccumulator here? No need for an int consumer abstraction at this level.
💚 Backport successful
|
When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort. This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment. Relates to #126111
When writing the doc values addresses, we currently perform an iteration over all the sorted numeric doc values to calculate the addresses. When merging sorted segments, this iteration is expensive as it requires performing a merge sort.
This patch removes this iteration by instead calculating the addresses while we are writing the values, writing the addresses addresses to a temporary file. Afterwards, they are copied from the temporary file into the merged segment.
Relates to #126111