Skip to content

LUCENE-9211 Add compression for Binary doc value fields#1234

Merged
markharwood merged 8 commits intoapache:masterfrom
markharwood:compressedBinDV
Feb 18, 2020
Merged

LUCENE-9211 Add compression for Binary doc value fields#1234
markharwood merged 8 commits intoapache:masterfrom
markharwood:compressedBinDV

Conversation

@markharwood
Copy link
Contributor

This PR stores groups of 32 doc values in LZ4 compressed blocks.

Write performance

Test results for loading 680mb of log data (1.8m docs) are as follows:

Branch Load time (seconds, single thread) Resulting index size (mb)
master 16 680
this PR 11 78

Read performance

Time taken to read 5,000 random doc IDs from above indices

Branch Read time (milliseconds, single thread)
master 284
this PR 63

On this particular test, the read + write speeds and resulting index size were all improved over the current master implementation. Obviously performance will vary with other tests with the main factors for changes being:

  • size of fields,
  • compress-ability of field contents,
  • read access patterns (hitting same compressed blocks vs different ones)
  • variation from doc-to-doc in field value sizes,

@markharwood markharwood requested review from jimczi and jpountz February 3, 2020 17:32
@markharwood
Copy link
Contributor Author

FYI - related Jira issue is pending. I can't log in until Jira admins recover my identity which looks to have been accidentally merged with a different "markh".

@mikemccand
Copy link
Member

my identity which looks to have been accidentally merged with a different "markh".

LOL, that's crazy -- you should go introduce yourself to the other markh ;)

@mikemccand
Copy link
Member

@markharwood how can we reproduce these benchmarks? What were the log data documents storing as BINARY doc values fields?

And how can indexing and searching get so much faster when compress/decompress is in the path! These numbers seem a little too good to believe ... but if they hold up, that's incredible. Lucene's default codec doesn't do any compression for BINARY doc values so it'd be great to add that, especially with fun issues looking to give vectors first class support in Lucene (backed by BINARY doc values, I think).

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @markharwood!

I think our testing of BINARY doc values may not be great ... maybe add a randomized test that sometimes stores very compressible and very incompressible, large, BINARY doc values?

Separately, another way we could increase test coverage is to make a mock codec that implements doc values formats on top of the BINARY format of the underlying codec. I.e. it'd implement NUMERIC by converting the longs to 8 bytes, and save to a separate BINARY field. This should massively improve our test coverage of BINARY ... I'll open an issue for that.

data.writeBytes(v.bytes, v.offset, v.length);
minLength = Math.min(length, minLength);
maxLength = Math.max(length, maxLength);
class CompressedBinaryBlockWriter implements Closeable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space character before implements?

start = data.getFilePointer();
meta.writeLong(start);
private void flushData() throws IOException {
if(numDocsInCurrentBlock > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space after if before (?



// already read and uncompressed?
if ( blockId != lastBlockId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove space after ( before blockId?

}
}

// Position the Bytes ref to the relevant part of the uncompressed block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Bytes ref/BytesRef?

meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
meta.writeShort(jumpTableEntryCount);
meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
void addDoc(int doc, BytesRef v) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove one of the spaces after void before addDoc?

totalChunks++;
long thisBlockStartPointer = data.getFilePointer();
data.writeVInt(numDocsInCurrentBlock);
for (int i = 0; i < numDocsInCurrentBlock; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we could (later, separate issue) optimize writing these lengths -- often all docs will have the same length?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

fp += filePointersIn.readVLong();
}
if (maxPointer < fp) {
throw new CorruptIndexException("File pointers don't add up", filePointersIn);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you include maxPointer and fp in this exception message?

@markharwood
Copy link
Contributor Author

Thanks for looking at this, Mike.

LOL, that's crazy -- you should go introduce yourself to the other markh ;)

I already reached out and we're working out the divorce proceedings :)

@markharwood how can we reproduce these benchmarks? What were the log data documents storing as BINARY doc values fields?

These were elasticsearch log file entries - so each value was a string which could be something short like [instance-0000000048] users file [/app/config/users] changed. updating users... ) or an error with a whole stack trace.
My test rig is here if you want to try with some other data files

And how can indexing and searching get so much faster when compress/decompress is in the path!

This was a test on my macbook with SSD and encrypted FS so perhaps not the best benchmarking setup. Maybe just writing more bytes = more overhead with the OS-level encryption?

I think our testing of BINARY doc values may not be great ... maybe add a randomized test that sometimes stores very compressible and very incompressible, large, BINARY doc values?

Will do. @jimczi has suggested adding support for storing without compression when the content doesn't compress well. I guess that can be a combination of :

  1. A fast heuristic - e.g. if max value length for each of the docs in a block <=2 then store without compression and
  2. "Try it and see" compression - buffer compression output to byte array and only write compressed form to disk if size is less than the uncompressed input

@markharwood
Copy link
Contributor Author

And how can indexing and searching get so much faster when compress/decompress is in the path!

I tried benchmarking some straight-forward file read and write operations (no Lucene) and couldn't show LZ4 compression being faster (although it wasn't that much slower).

Maybe the rate-limited merging in Lucene plays a part and size therefore matters in that context?


IndexOutput data, meta;
final int maxDoc;
private SegmentWriteState state;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it final?

int uncompressedBlockLength = 0;
int maxUncompressedBlockLength = 0;
int numDocsInCurrentBlock = 0;
int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually don't let spaces between the type of array elements and []

Suggested change
int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
int[] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];

int maxUncompressedBlockLength = 0;
int numDocsInCurrentBlock = 0;
int [] docLengths = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
byte [] block = new byte [1024 * 16];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
byte [] block = new byte [1024 * 16];
byte[] block = new byte [1024 * 16];

totalChunks++;
long thisBlockStartPointer = data.getFilePointer();
data.writeVInt(numDocsInCurrentBlock);
for (int i = 0; i < numDocsInCurrentBlock; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

} catch (Throwable exception) {
IOUtils.closeWhileHandlingException(this); //self-close because constructor caller can't
throw exception;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually do this like that instead, which helps avoid catching Throwable

boolean success = false;
try {
  // write header
} finally {
  if (success == false) {
    // close
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the "+1" comment for line 407 about?
I've seen encoding elsewhere that have n+1 offsets to record start of each value and the last offset is effectively the end of the last value. In this scenario I'm writing n value lengths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was about optimizing for the case that all values have the same length. In that case we could still one bit of the first length to mean that all values have the same length for instance?

void addDoc(int doc, BytesRef v) throws IOException {
if (blockAddressesStart < 0) {
blockAddressesStart = data.getFilePointer();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like we could set blockAddressesStart in the constructor instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that and it didn't work - something else was writing to data in between constructor and addDoc calls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you found what this something else is?

data.writeVInt(docLengths[i]);
}
maxUncompressedBlockLength = Math.max(maxUncompressedBlockLength, uncompressedBlockLength);
LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LZ4.compress(block, 0, uncompressedBlockLength, data, ht);
LZ4.compress(block, 0, uncompressedBlockLength, data, ht);

}

void writeMetaData() throws IOException {
if (blockAddressesStart < 0 ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only happens if there are no values? when do we run into this condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks to be when merges clear out deleted docs leaving no values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, can you leave a comment about it?

this.compressedData = compressedData;
// pre-allocate a byte array large enough for the biggest uncompressed block needed.
this.uncompressedBlock = new byte[biggestUncompressedBlockSize];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could initialize uncompressedBytesRef from the uncompressed block:
uncompressedBytesRef = new BytesRef(uncompressedBlock)
and avoid creating new BytesRefs over and over in decode

@markharwood
Copy link
Contributor Author

I've reclaimed my Jira log-in and opened https://issues.apache.org/jira/browse/LUCENE-9211

@markharwood
Copy link
Contributor Author

There was a suggestion from @jimczi that we fall back to writing raw data if content doesn't compress well. I'm not sure this logic is worth developing for the reasons outlined below:

I wrote a compression buffer to see what the compression algo outputs before deciding whether to write the compressed or raw data to disk.
I tested with the most uncompressible content I could imagine:

public static void fillRandom(byte[] buffer, int length) {
    for (int i = 0; i < length; i++) {
        buffer[i] =  (byte) (Math.random() * Byte.MAX_VALUE);
    }
} 

The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts (adding 0.4% overhead to the original content e.g. 96,921 compressed vs 96,541 raw bytes).
On that basis I'm not sure if it's worth doubling the memory costs of the indexing logic (we would require a temporary output buffer that is at least the same size as the raw data being compressed) and additional byte shuffling.

@msokolov
Copy link
Contributor

msokolov commented Feb 7, 2020

The LZ4 compressed versions of this content were only marginally bigger than their raw counterparts

Did you also test read performance in this incompressible case?

@jpountz
Copy link
Contributor

jpountz commented Feb 7, 2020

In the case of content that can't be compressed, the compressed data will consist of the number of bytes, followed by the bytes. So decompressing consists of decoding the length and then reading the bytes. The only overhead compared to reading bytes directly is the decoding of the number of bytes, so I would believe that the overhead is rather small.

I don't have a strong preference regarding whether this case should be handled explicitly or not. It's true that not having a special "not-compressed" case helps keep the logic simpler.

@markharwood
Copy link
Contributor Author

Did you also test read performance in this incompressible case?

Just tried it and it does look 4x faster reading raw random bytes Vs compressed random bytes

uncompressedBlockLength += onlyLength;
}
}
uncompressedDocEnds[i] = uncompressedBlockLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could call it uncompressedDocStarts and set the index at i+1 which would then help below to remove the else block of the docInBlockId > 0 condition below?

long blockStartOffset = addresses.get(blockId);
compressedData.seek(blockStartOffset);

numDocsInBlock = compressedData.readVInt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to record the number of documents in the block? It should be 32 for all blocks except for the last one? Maybe at index-time we could append dummy values to the last block to make sure it has 32 values too, and we wouldn't need this vInt anymore?

// all other values are the same length
int lengthPlusSameInd = compressedData.readVInt();
int sameIndicator = lengthPlusSameInd & 1;
int firstValLength = lengthPlusSameInd >>1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are stealing a bit, we should do an unsigned shift (>>>) instead.

This would never be a problem in practice, but imagine than the length was a 31-bits integer. Shifting by one bit on the left at index time would make this number negative. So here we need an unsigned shift rather than a signed shift that preserves the sign.

private final IndexInput compressedData;
// Cache of last uncompressed block
private long lastBlockId = -1;
private int []uncompressedDocEnds = new int[Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the past we've put these constants in the meta file and BinaryEntry so that it's easier to change values over time

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpountz we should use the same structure while writing the data, in that case you will see all the properties of the class instead of adding comments in the code


numDocsInBlock = compressedData.readVInt();
assert numDocsInBlock <= Lucene80DocValuesFormat.BINARY_DOCS_PER_COMPRESSED_BLOCK;
uncompressedDocEnds = new int[numDocsInBlock];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the same array across blocks?

@jpountz
Copy link
Contributor

jpountz commented Feb 7, 2020

@msokolov FWIW LZ4 only removes duplicate strings from a stream: when it finds one it inserts a reference to a previous sequence of bytes. In the special case that the content in incompressible, the LZ4 compressed data just consists of the number of bytes followed by the bytes, so the only overhead compared to reading the bytes directly is the decoding of the number of bytes, which should be rather low.

I don't have a preference regarding whether we should have an explicit "not-compressed" case, but I understand how not having one helps keep things simpler.

@msokolov
Copy link
Contributor

msokolov commented Feb 7, 2020

Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible?

@markharwood
Copy link
Contributor Author

markharwood commented Feb 7, 2020

Strange that Mark would measure 4x slowdown from decoding the lengths... Perhaps the random bytes are not totally incompressible, just barely compressible?

I may have been too hasty in that reply - I've not been able to reproduce that and the raw vs compressed timings are very similar in the additional tests I've done so echo what @jpountz expects. My first (faster) run had random bytes selected in the range 0-20 and not the 0-127 range where I'm seeing parity

try (CompressedBinaryBlockWriter blockWriter = new CompressedBinaryBlockWriter()){
BinaryDocValues values = valuesProducer.getBinary(field);
long start = data.getFilePointer();
meta.writeLong(start); // dataOffset

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the BinaryEntry object here, and the just make the object "Writable" to a given DataOutput and "Readable" from a DataInput (which is already the case: readBinaryEntry). This will avoid the comments in the code -2 == docsWithFieldOffset etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea but would prefer doing it in a separate PR.

Comment on lines +522 to +538
if (numDocsWithField == 0) {
meta.writeLong(-2); // docsWithFieldOffset
meta.writeLong(0L); // docsWithFieldLength
meta.writeShort((short) -1); // jumpTableEntryCount
meta.writeByte((byte) -1); // denseRankPower
} else if (numDocsWithField == maxDoc) {
meta.writeLong(-1); // docsWithFieldOffset
meta.writeLong(0L); // docsWithFieldLength
meta.writeShort((short) -1); // jumpTableEntryCount
meta.writeByte((byte) -1); // denseRankPower
} else {
long offset = data.getFilePointer();
meta.writeLong(offset); // docsWithFieldOffset
values = valuesProducer.getBinary(field);
final short jumpTableEntryCount = IndexedDISI.writeBitSet(values, data, IndexedDISI.DEFAULT_DENSE_RANK_POWER);
meta.writeLong(data.getFilePointer() - offset); // docsWithFieldLength
meta.writeShort(jumpTableEntryCount);
meta.writeByte(IndexedDISI.DEFAULT_DENSE_RANK_POWER);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I'm working in a refactor of this code by having a doc id set iterator serializer capable to provide the correct instance based on the stored metadata. As you might see this is quite repetitive for the other fields

static final String META_EXTENSION = "dvm";
static final int VERSION_START = 0;
static final int VERSION_CURRENT = VERSION_START;
static final int VERSION_BIN_COMPRESSED = 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be potentially in the BinaryDocValuesFormat class

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments, but I'm +1 on this PR, it looks great.

void addDoc(int doc, BytesRef v) throws IOException {
if (blockAddressesStart < 0) {
blockAddressesStart = data.getFilePointer();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you found what this something else is?

@janhoy janhoy changed the title Add compression for Binary doc value fields LUCENE-9211 Add compression for Binary doc value fields Feb 15, 2020
… metadata, remove per-chunk storage of num docs and assume always same number of values. Change doc end pointers to be doc starts.
…sesStart in CompressedBinaryBlockWriter constructor, serialise shift for number of docs in block,
@markharwood markharwood merged commit ce2959f into apache:master Feb 18, 2020
markharwood added a commit to markharwood/lucene-solr that referenced this pull request Feb 18, 2020
Stores groups of 32 binary doc values in LZ4-compressed blocks.
markharwood added a commit that referenced this pull request Feb 18, 2020
Stores groups of 32 binary doc values in LZ4-compressed blocks.
(cherry picked from commit f549ee3)
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Feb 12, 2024
The binary doc values implementation from LUCENE-9211 (apache/lucene-solr#1234) is used here, which stores the values in LZ4 compressed blocks, in order to reduce storage usage (in the Lucene default doc values coded binary doc values are stored without any compression).

Follow up from elastic#105301
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants