LUCENE-9211 Add compression for Binary doc value fields#1234
LUCENE-9211 Add compression for Binary doc value fields#1234markharwood merged 8 commits intoapache:masterfrom
Conversation
|
FYI - related Jira issue is pending. I can't log in until Jira admins recover my identity which looks to have been accidentally merged with a different "markh". |
LOL, that's crazy -- you should go introduce yourself to the other |
|
@markharwood how can we reproduce these benchmarks? What were the log data documents storing as And how can indexing and searching get so much faster when compress/decompress is in the path! These numbers seem a little too good to believe ... but if they hold up, that's incredible. Lucene's default codec doesn't do any compression for |
mikemccand
left a comment
There was a problem hiding this comment.
Thanks @markharwood!
I think our testing of BINARY doc values may not be great ... maybe add a randomized test that sometimes stores very compressible and very incompressible, large, BINARY doc values?
Separately, another way we could increase test coverage is to make a mock codec that implements doc values formats on top of the BINARY format of the underlying codec. I.e. it'd implement NUMERIC by converting the longs to 8 bytes, and save to a separate BINARY field. This should massively improve our test coverage of BINARY ... I'll open an issue for that.
| data.writeBytes(v.bytes, v.offset, v.length); | ||
| minLength = Math.min(length, minLength); | ||
| maxLength = Math.max(length, maxLength); | ||
| class CompressedBinaryBlockWriter implements Closeable { |
There was a problem hiding this comment.
Extra space character before implements?
| start = data.getFilePointer(); | ||
| meta.writeLong(start); | ||
| private void flushData() throws IOException { | ||
| if(numDocsInCurrentBlock > 0) { |
|
|
||
|
|
||
| // already read and uncompressed? | ||
| if ( blockId != lastBlockId) { |
There was a problem hiding this comment.
Remove space after ( before blockId?
| } | ||
| } | ||
|
|
||
| // Position the Bytes ref to the relevant part of the uncompressed block |
| meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength | ||
| meta.writeShort(jumpTableEntryCount); | ||
| meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); | ||
| void addDoc(int doc, BytesRef v) throws IOException { |
There was a problem hiding this comment.
Remove one of the spaces after void before addDoc?
| totalChunks++; | ||
| long thisBlockStartPointer = data.getFilePointer(); | ||
| data.writeVInt(numDocsInCurrentBlock); | ||
| for (int i = 0; i < numDocsInCurrentBlock; i++) { |
There was a problem hiding this comment.
Probably we could (later, separate issue) optimize writing these lengths -- often all docs will have the same length?
| fp += filePointersIn.readVLong(); | ||
| } | ||
| if (maxPointer < fp) { | ||
| throw new CorruptIndexException("File pointers don't add up", filePointersIn); |
There was a problem hiding this comment.
Can you include maxPointer and fp in this exception message?
|
Thanks for looking at this, Mike.
I already reached out and we're working out the divorce proceedings :)
These were elasticsearch log file entries - so each value was a string which could be something short like
This was a test on my macbook with SSD and encrypted FS so perhaps not the best benchmarking setup. Maybe just writing more bytes = more overhead with the OS-level encryption?
Will do. @jimczi has suggested adding support for storing without compression when the content doesn't compress well. I guess that can be a combination of :
|
I tried benchmarking some straight-forward file read and write operations (no Lucene) and couldn't show LZ4 compression being faster (although it wasn't that much slower). Maybe the rate-limited merging in Lucene plays a part and size therefore matters in that context? |
|
|
||
| IndexOutput data, meta; | ||
| final int maxDoc; | ||
| private SegmentWriteState state; |
| int uncompressedBlockLength = 0; | ||
| int maxUncompressedBlockLength = 0; | ||
| int numDocsInCurrentBlock = 0; | ||
| int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; |
There was a problem hiding this comment.
we usually don't let spaces between the type of array elements and []
| int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; | |
| int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; |
| int maxUncompressedBlockLength = 0; | ||
| int numDocsInCurrentBlock = 0; | ||
| int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; | ||
| byte [] block = new byte [1024 * 16]; |
There was a problem hiding this comment.
| byte [] block = new byte [1024 * 16]; | |
| byte[] block = new byte [1024 * 16]; |
| totalChunks++; | ||
| long thisBlockStartPointer = data.getFilePointer(); | ||
| data.writeVInt(numDocsInCurrentBlock); | ||
| for (int i = 0; i < numDocsInCurrentBlock; i++) { |
| } catch (Throwable exception) { | ||
| IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't | ||
| throw exception; | ||
| } |
There was a problem hiding this comment.
we usually do this like that instead, which helps avoid catching Throwable
boolean success = false;
try {
// write header
} finally {
if (success == false) {
// close
}
}
There was a problem hiding this comment.
What was the "+1" comment for line 407 about?
I've seen encoding elsewhere that have n+1 offsets to record start of each value and the last offset is effectively the end of the last value. In this scenario I'm writing n value lengths.
There was a problem hiding this comment.
It was about optimizing for the case that all values have the same length. In that case we could still one bit of the first length to mean that all values have the same length for instance?
| void addDoc(int doc, BytesRef v) throws IOException { | ||
| if (blockAddressesStart < 0) { | ||
| blockAddressesStart = data.getFilePointer(); | ||
| } |
There was a problem hiding this comment.
it looks like we could set blockAddressesStart in the constructor instead?
There was a problem hiding this comment.
I tried that and it didn't work - something else was writing to data in between constructor and addDoc calls
There was a problem hiding this comment.
Have you found what this something else is?
| data.writeVInt(docLengths[i]); | ||
| } | ||
| maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength); | ||
| LZ4.compress(block, 0, uncompressedBlockLength, data, ht); |
There was a problem hiding this comment.
| LZ4.compress(block, 0, uncompressedBlockLength, data, ht); | |
| LZ4.compress(block, 0, uncompressedBlockLength, data, ht); |
| } | ||
|
|
||
| void writeMetaData() throws IOException { | ||
| if (blockAddressesStart < 0 ) { |
There was a problem hiding this comment.
this only happens if there are no values? when do we run into this condition?
There was a problem hiding this comment.
Looks to be when merges clear out deleted docs leaving no values.
There was a problem hiding this comment.
This makes sense, can you leave a comment about it?
| this.compressedData = compressedData; | ||
| // pre-allocate a byte array large enough for the biggest uncompressed block needed. | ||
| this.uncompressedBlock = new byte[biggestUncompressedBlockSize]; | ||
|
|
There was a problem hiding this comment.
we could initialize uncompressedBytesRef from the uncompressed block:
uncompressedBytesRef = new BytesRef(uncompressedBlock)
and avoid creating new BytesRefs over and over in decode
|
I've reclaimed my Jira log-in and opened https://issues.apache.org/jira/browse/LUCENE-9211 |
|
There was a suggestion from @jimczi that we fall back to writing raw data if content doesn't compress well. I'm not sure this logic is worth developing for the reasons outlined below: I wrote a compression buffer to see what the compression algo outputs before deciding whether to write the compressed or raw data to disk. The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts (adding 0.4% overhead to the original content e.g. 96,921 compressed vs 96,541 raw bytes). |
Did you also test read performance in this incompressible case? |
|
In the case of content that can't be compressed, the compressed data will consist of the number of bytes, followed by the bytes. So decompressing consists of decoding the length and then reading the bytes. The only overhead compared to reading bytes directly is the decoding of the number of bytes, so I would believe that the overhead is rather small. I don't have a strong preference regarding whether this case should be handled explicitly or not. It's true that not having a special "not-compressed" case helps keep the logic simpler. |
Just tried it and it does look 4x faster reading raw random bytes Vs compressed random bytes |
| uncompressedBlockLength += onlyLength; | ||
| } | ||
| } | ||
| uncompressedDocEnds[i] = uncompressedBlockLength; |
There was a problem hiding this comment.
maybe we could call it uncompressedDocStarts and set the index at i+1 which would then help below to remove the else block of the docInBlockId > 0 condition below?
| long blockStartOffset = addresses.get(blockId); | ||
| compressedData.seek(blockStartOffset); | ||
|
|
||
| numDocsInBlock = compressedData.readVInt(); |
There was a problem hiding this comment.
do we really need to record the number of documents in the block? It should be 32 for all blocks except for the last one? Maybe at index-time we could append dummy values to the last block to make sure it has 32 values too, and we wouldn't need this vInt anymore?
| // all other values are the same length | ||
| int lengthPlusSameInd = compressedData.readVInt(); | ||
| int sameIndicator = lengthPlusSameInd & 1; | ||
| int firstValLength = lengthPlusSameInd >>1; |
There was a problem hiding this comment.
Since you are stealing a bit, we should do an unsigned shift (>>>) instead.
This would never be a problem in practice, but imagine than the length was a 31-bits integer. Shifting by one bit on the left at index time would make this number negative. So here we need an unsigned shift rather than a signed shift that preserves the sign.
| private final IndexInput compressedData; | ||
| // Cache of last uncompressed block | ||
| private long lastBlockId = -1; | ||
| private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK]; |
There was a problem hiding this comment.
in the past we've put these constants in the meta file and BinaryEntry so that it's easier to change values over time
There was a problem hiding this comment.
@jpountz we should use the same structure while writing the data, in that case you will see all the properties of the class instead of adding comments in the code
|
|
||
| numDocsInBlock = compressedData.readVInt(); | ||
| assert numDocsInBlock <= Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK; | ||
| uncompressedDocEnds = new int[numDocsInBlock]; |
There was a problem hiding this comment.
can we reuse the same array across blocks?
|
@msokolov FWIW LZ4 only removes duplicate strings from a stream: when it finds one it inserts a reference to a previous sequence of bytes. In the special case that the content in incompressible, the LZ4 compressed data just consists of the number of bytes followed by the bytes, so the only overhead compared to reading the bytes directly is the decoding of the number of bytes, which should be rather low. I don't have a preference regarding whether we should have an explicit "not-compressed" case, but I understand how not having one helps keep things simpler. |
|
Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible? |
I may have been too hasty in that reply - I've not been able to reproduce that and the raw vs compressed timings are very similar in the additional tests I've done so echo what @jpountz expects. My first (faster) run had random bytes selected in the range 0-20 and not the 0-127 range where I'm seeing parity |
| try (CompressedBinaryBlockWriter blockWriter = new CompressedBinaryBlockWriter()){ | ||
| BinaryDocValues values = valuesProducer.getBinary(field); | ||
| long start = data.getFilePointer(); | ||
| meta.writeLong(start); // dataOffset |
There was a problem hiding this comment.
I think we should use the BinaryEntry object here, and the just make the object "Writable" to a given DataOutput and "Readable" from a DataInput (which is already the case: readBinaryEntry). This will avoid the comments in the code -2 == docsWithFieldOffset etc.
There was a problem hiding this comment.
I like the idea but would prefer doing it in a separate PR.
| if (numDocsWithField == 0) { | ||
| meta.writeLong(-2); // docsWithFieldOffset | ||
| meta.writeLong(0L); // docsWithFieldLength | ||
| meta.writeShort((short) -1); // jumpTableEntryCount | ||
| meta.writeByte((byte) -1); // denseRankPower | ||
| } else if (numDocsWithField == maxDoc) { | ||
| meta.writeLong(-1); // docsWithFieldOffset | ||
| meta.writeLong(0L); // docsWithFieldLength | ||
| meta.writeShort((short) -1); // jumpTableEntryCount | ||
| meta.writeByte((byte) -1); // denseRankPower | ||
| } else { | ||
| long offset = data.getFilePointer(); | ||
| meta.writeLong(offset); // docsWithFieldOffset | ||
| values = valuesProducer.getBinary(field); | ||
| final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER); | ||
| meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength | ||
| meta.writeShort(jumpTableEntryCount); | ||
| meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER); | ||
| } |
There was a problem hiding this comment.
Currently I'm working in a refactor of this code by having a doc id set iterator serializer capable to provide the correct instance based on the stored metadata. As you might see this is quite repetitive for the other fields
| static final String META_EXTENSION = "dvm"; | ||
| static final int VERSION_START = 0; | ||
| static final int VERSION_CURRENT = VERSION_START; | ||
| static final int VERSION_BIN_COMPRESSED = 1; |
There was a problem hiding this comment.
This could be potentially in the BinaryDocValuesFormat class
jpountz
left a comment
There was a problem hiding this comment.
I left some minor comments, but I'm +1 on this PR, it looks great.
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
| void addDoc(int doc, BytesRef v) throws IOException { | ||
| if (blockAddressesStart < 0) { | ||
| blockAddressesStart = data.getFilePointer(); | ||
| } |
There was a problem hiding this comment.
Have you found what this something else is?
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesProducer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene80/Lucene80DocValuesConsumer.java
Outdated
Show resolved
Hide resolved
b1048ba to
a20da17
Compare
… metadata, remove per-chunk storage of num docs and assume always same number of values. Change doc end pointers to be doc starts.
…sesStart in CompressedBinaryBlockWriter constructor, serialise shift for number of docs in block,
e830eda to
7d0db34
Compare
Stores groups of 32 binary doc values in LZ4-compressed blocks.
Stores groups of 32 binary doc values in LZ4-compressed blocks. (cherry picked from commit f549ee3)
The binary doc values implementation from LUCENE-9211 (apache/lucene-solr#1234) is used here, which stores the values in LZ4 compressed blocks, in order to reduce storage usage (in the Lucene default doc values coded binary doc values are stored without any compression). Follow up from elastic#105301
This PR stores groups of 32 doc values in LZ4 compressed blocks.
Write performance
Test results for loading 680mb of log data (1.8m docs) are as follows:
Read performance
Time taken to read 5,000 random doc IDs from above indices
On this particular test, the read + write speeds and resulting index size were all improved over the current master implementation. Obviously performance will vary with other tests with the main factors for changes being: