Reintroduce compression for binary doc_values#112416
Reintroduce compression for binary doc_values#112416dnhatn wants to merge 9 commits intoelastic:mainfrom
Conversation
6d7ee62 to
ccc63be
Compare
ccc63be to
a04558f
Compare
|
Hi @dnhatn, I've created a changelog YAML for you. |
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
| meta.writeByte(binaryDVCompressionMode.code); | ||
| switch (binaryDVCompressionMode) { | ||
| case NO_COMPRESS -> doAddUncompressedBinary(field, valuesProducer); | ||
| case COMPRESSED_WITH_LZ4 -> doAddCompressedBinary(field, valuesProducer); |
There was a problem hiding this comment.
Does it make sense to compress with zstd?
There was a problem hiding this comment.
I see the numbers in the description.. Seems like zstd offers a substantial improvement over lz4 as usual, wonder how much risk that would bring here though..
There was a problem hiding this comment.
Yes, I'm not sure why we're keeping zstd behind the feature flag. If we're okay with it, I can switch to zstd.
There was a problem hiding this comment.
zstd usage in stored fields is behind a feature flag and more specifically for get by id performance in the best speed scenario. Hopefully we can remove the feature flag soon after we have done a few more experiments with different setting for best speed mode.
I think in the case of binary doc values we should use zstd instead of lz4?
There was a problem hiding this comment.
Thanks @martijnvg. I will switch this to zstd.
There was a problem hiding this comment.
It would be worth checking how it affects queries/aggs that need binary doc values, e.g. maybe the geoshape track?
| this.addresses = addresses; | ||
| this.compressedData = compressedData; | ||
| // pre-allocate a byte array large enough for the biggest uncompressed block needed. | ||
| this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; |
There was a problem hiding this comment.
We need to be careful here. I have seen (when this was introduced) that this array could get unwildly big and then we can have big issues with humongous allocations. This is actually pretty dangerous.
There was a problem hiding this comment.
iirc as part of #105301 we tried to add compression to binary doc values, but then the same concern was raised as this one and we went with the approach that didn't do compression. Just in order to allow tsdb codecs to be used for all doc value fields.
There was a problem hiding this comment.
Let's initialize the size to something like min(16kB, biggestUncompressedBlockSize) and dynamically resize on read? This will still help small values by never having to resize the array in practice?
There was a problem hiding this comment.
This might result in OOMs I guess....
There was a problem hiding this comment.
My point here is that we are always adding the same number of doc values per block, regardless of the size of the binary doc values, so it can get pretty big. I think we should limit the block size so we can have different number of doc values per block.
iverase
left a comment
There was a problem hiding this comment.
In general I am against the change as it is.
We currently add x number of docs per blocks regardless the size of the binary doc value which can lead in having very big blocks. We need to make sure those blocks have a limit in soze or this becomes very dangerous.
| final BinaryDVCompressionMode binaryDVCompressionMode; | ||
|
|
||
| public ES87TSDBDocValuesFormat() { | ||
| this(BinaryDVCompressionMode.NO_COMPRESS); |
There was a problem hiding this comment.
Should this be change to COMPRESSED_WITH_LZ4? Otherwise compression doesn't get used outside tests?
Thanks, @iverase. I copied this from LUCENE-9211. That's a good point; I'll introduce a chunk size limit for it. |
Thanks @dnhatn, that will remove my concern here. |
jpountz
left a comment
There was a problem hiding this comment.
It seems to me that the NO_COMPRESS option is more about backward compatibility than about enabling users to disable compression on their binary doc values. If so, I wonder if we should fork a new format, e.g. ES816TSDBDocValuesFormat?
| meta.writeByte(binaryDVCompressionMode.code); | ||
| switch (binaryDVCompressionMode) { | ||
| case NO_COMPRESS -> doAddUncompressedBinary(field, valuesProducer); | ||
| case COMPRESSED_WITH_LZ4 -> doAddCompressedBinary(field, valuesProducer); |
There was a problem hiding this comment.
It would be worth checking how it affects queries/aggs that need binary doc values, e.g. maybe the geoshape track?
| final IndexOutput tempBinaryOffsets; | ||
|
|
||
| CompressedBinaryBlockWriter() throws IOException { | ||
| tempBinaryOffsets = EndiannessReverserUtil.createTempOutput( |
There was a problem hiding this comment.
We don't need to care about endianness here, do we?
| this.addresses = addresses; | ||
| this.compressedData = compressedData; | ||
| // pre-allocate a byte array large enough for the biggest uncompressed block needed. | ||
| this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; |
There was a problem hiding this comment.
Let's initialize the size to something like min(16kB, biggestUncompressedBlockSize) and dynamically resize on read? This will still help small values by never having to resize the array in practice?
@jpountz I started with forking the format, but it was too much code, so I reverted and applied the diff to the current codec. I will update the PR with forking codec :). |
|
Hi @dnhatn, I've created a changelog YAML for you. |
The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.
The keyword doc values field gets an extra binary doc values field, that encodes the order of how array values were specified at index time. This also captures duplicate values. This is stored in an offset to ordinal array that gets vint encoded into the binary doc values field. The additional storage required for this will likely be minimized with elastic#112416 (zstd compression for binary doc values) In case of the following string array for a keyword field: ["c", "b", "a", "c"]. Sorted set doc values: ["a", "b", "c"] with ordinals: 0, 1 and 2. The offset array will be: [2, 1, 0, 2] Limitations: * only support for keyword field mapper. * multi level leaf arrays are flattened. For example: [[b], [c]] -> [b, c] * empty arrays ([]) are not recorded * arrays are always synthesized as one type. In case of keyword field, [1, 2] gets synthesized as ["1", "2"]. These limitations can be addressed, but some require more complexity and or additional storage.
) Add compression for binary doc values using Zstd and blocks with a variable number of values. Block-wise LZ4 compression for binary doc values was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues. We investigated adding to adding the original Lucene implementation to ES in #112416 and #105301. This previous approach used a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal. This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version it stores an array of address for the start of each block. Additionally, it stores a parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.
…tic#137139) Add compression for binary doc values using Zstd and blocks with a variable number of values. Block-wise LZ4 compression for binary doc values was previously added to Lucene in LUCENE-9211. This was subsequently removed in LUCENE-9378 due to query performance issues. We investigated adding to adding the original Lucene implementation to ES in elastic#112416 and elastic#105301. This previous approach used a constant number of values per block (specifically 32 values). This is nice because it makes it very easy to map a given value index (eg docId for dense values) to the block containing it with blockId = docId / 32. Unfortunately, if values are very large we cannot reduce the number of values per block and (de)compressing a block could cause an OOM. Also, since this is a concern, we have to keep the number of values lower than ideal. This PR instead stores a variable number of documents per block. It stores a minimum of 1 document per block and stops adding values when the size of a block exceeds a threshold. Like the previous version it stores an array of address for the start of each block. Additionally, it stores a parallel array with the value index at the start of each block. When looking up a given value index, if it is not in the current block, we binary search the array of value index starts to find the blockId containing the value. Then look up the address of the block.
This change reintroduces the compression for binary doc_values from LUCENE-9211 for TSDB and logs indices.
I ran a quick test comparing lz4 and zstd and zstd could save approximately 25% more storage:
Should we consider using zstd instead of lz4 for compression here?
Relates #78266