-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream #960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream #960
Conversation
…utStream To improve performance, all multi-byte access functionality from LittleEndianDataInputStream has been merged into ByteBufferInputStream. LittleEndianDataInputStream is marked deprecated.
Made exception catch more specific for read() and readByte().
Also fixed bug discovered by that testing
|
I added some tests for this new code. |
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java
Outdated
Show resolved
Hide resolved
Removed unnecessary comment Removed unused wrapper method
Removed unused wrappers and constructors
Removed line of code that was commented out instead of properly deleted
Cleaned up some comments
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java
Outdated
Show resolved
Hide resolved
Removed some more unused methods
@deprecated postpones to future PR
Removed unnecessary methods. Reverted whitespace change.
Removed unnecessary methods
Removed whitespace change
|
Is the '5% to 15%' gain from this change or along with other changes? If it is later, can you share the point to other changes? Like to see the overall changes before committing. |
|
That improvement comes from a larget set of changes. I have a design doc that goes over all those changes plus some more that make it possible to get even more performance improvements. https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing |
| } | ||
| } | ||
|
|
||
| public int readUnsignedVarInt() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it copied from BytesUtils.java? I wonder why we don't use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. The one in BytesUtils calls methods that read one byte at a time. This one can take advantage of faster methods that read whole words at a time. This is a critical-path method, so it's a performance win to eliminate the extra level of abstraction and all the extra overhead fetching individual bytes and shifting.
| return ((ch3 << 16) + (ch2 << 8) + (ch1 << 0)); | ||
| } | ||
|
|
||
| public int readIntLittleEndianPaddedOnBitWidth(int bitWidth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it copied from BytesUtils.java? I wonder why we don't use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one that reads three bytes may or may not be a win. A level of abstraction is eliminated by doing this. It's hard to say whether or not the JIT will be smart enough to do that automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the method here: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L120. It is exactly same but just different signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know about that method. The BytesUtils code always reads one byte at a time. My version will read a whole word at a time for short and int. This is faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait. Are you referring to readIntLittleEndianPaddedOnBitWidth or readIntLittleEndianOnThreeBytes?
The former is definitely faster. An argument could be made to remove the latter, although it'll take longer for the JIT to hide the extra layers of virtual calls.
| return Double.longBitsToDouble(readLong()); | ||
| } | ||
|
|
||
| public int readIntLittleEndianOnThreeBytes() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it copied from BytesUtils.java? I wonder why we don't use that directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment on this. These two methods have the the same outcome, but mine is faster. I believe this is warranted for a performance critical path.
| @Override | ||
| public int read(byte[] bytes) { | ||
| return read(bytes, 0, bytes.length); | ||
| public void readFully(byte[] bytes, int off, int len) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you cast a light why we need to add the implementation readFully() here? For performance improvement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are situations where we need to read an exact number of bytes and throw an exception if not enough are available. This is faster than reading maybe enough and then checking, and this is a performance critical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, the user of the class would read exactly the right number of bytes. These checks and exceptions exist only to catch bugs elsewhere. This is one reason why it's important to minimize the overhead of these checks in such performance-critical methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between this method and read() is mainly to precheck if there is enough remaining length. I believe this can be done by wrapping up the read() method and adding the prechecks. Duplicating the code makes it harder to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make these changes if you insist. But those prechecks are expensive, which is why I'm trying to avoid them when possible in a performance critical path.
| } | ||
|
|
||
| int bytesRead = 0; | ||
| while (bytesRead < len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems duplicate with above line 244.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two key differences that make it hard to combine them without hurting performance for one, the other, or both, and they're both performance critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating code is hard to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.
I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did test a lot of tradeoffs, but I don't think I tested this one thing directly. It's also been quite a while since I did this, so I don't think I'd be able to figure out which spreadsheets have the relevant data.
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
|
@sunchao Can you have a review? |
Use constants for byte size of words
|
Is this mostly a refactoring PR? I also don't see |
I initially marked |
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java
Outdated
Show resolved
Hide resolved
|
|
||
| try { | ||
| buffer.position(buffer.position() + (int)n); | ||
| } catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of try and catch, can we check if the new position is greater than the buffer.limit and throw EOF if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way on purpose. This way is always faster. Doing the check has to wait until there's profiling info and the C2 compiler gets hold of this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already don't like the fact that I have to check the argument to make sure it's not negative and not bigger than int max. buffer.positiion() already checks for the position going out of bounds and throws an exception, so it would be redundant to have another check for the exact same thing here. A catch for an exception that never happens is basically always free, while a test for a condition that never happens is not free until profiling gets enough info about it for the C2 compiler to eliminate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The try/catch is also more expensive than checking. I agree with Chao to have the check instead of try/catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.
That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly tested this, and it made a small but measurable difference.
| this.buffer.order(java.nio.ByteOrder.LITTLE_ENDIAN); | ||
| } | ||
|
|
||
| SingleBufferInputStream(byte[] inBuf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems these 3 constructors are not used - can we at least add some tests to cover them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I put this here I think because I'm trying to make BufferInputStream have the same functionality as multiple other things that I'm combining into one. Some tests would be good. I'll see about writing some tests soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this constructor because it's much better to have this be a compile-time error than a runtime error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and added two tests to cover the unused SingleBufferInputStream constructors. I considered just deleting these constructors, but I decided that it might be valuable to include them as documentation on how to do this in a way that is congruent to the behavior of HeapByteBuffer, just in case anyone wanted to do this in the future. There's also the risk that someone would think they have to wrap an array with ByteBuffer before using SingleBufferInputStream, but it would be better to avoid the overhead of ByteBuffer.duplicate(). (ByteBuffer.duplicate() appears to take constant time by making a reference to the same backing array, but it's a big constant with loads of checks.)
| this.startPosition = 0; | ||
| } | ||
|
|
||
| SingleBufferInputStream(List<ByteBuffer> inBufs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why we need this constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, I used to have a similar constructor for ByteBufferInputStream, but I was advised to remove it, so I had this here for uniformity. We can remove this.
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java
Outdated
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
| @Override | ||
| public int read(byte[] bytes) { | ||
| return read(bytes, 0, bytes.length); | ||
| public void readFully(byte[] bytes, int off, int len) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Show resolved
Hide resolved
parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java
Outdated
Show resolved
Hide resolved
Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions. Removed constructor that just throws exception.
Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions.
…tream. Rather than deleting these, I decided to keep the constructs as documentation on how to do these things according to how HeapByteBuffer wraps arrays. And it would be best if we didn't have to suffer the overhead of creating a ByteBuffer first in order to create a SingleBufferInputStream.
parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java
Outdated
Show resolved
Hide resolved
| readFully(b, 0, b.length); | ||
| } | ||
|
|
||
| public void readFully(byte b[], int off, int len) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see where it is used. Don't know why it is 'public'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what the method signatures for DataInputStream.readFully look like.
I also have a whole bunch of other performance improvements I want to contribute (https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing), and I think this might get used in some of that code.
I'm very soon going to publish an open preview of all of my proposed changes to a branch of my own fork, so we'll be able to check this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the complete preview of my changes to ParquetMR: https://github.com/theosib-amazon/parquet-mr/tree/batch-read-optimizations
Changed byte b[] to byte[] b
wgtmac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the open source community priorities code maintainability more than performance gain as the project is widely adopted and maintained by many users and developers. However, if there is any concrete benchmark result to provide compelling evidence, it will be helpful to the discussion. @theosib-amazon @shangxinli @sunchao
|
|
||
| try { | ||
| buffer.position(buffer.position() + (int)n); | ||
| } catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.
That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.
| } | ||
|
|
||
| int bytesRead = 0; | ||
| while (bytesRead < len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.
I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.
| return delegate.readUnsignedByte(); | ||
| } | ||
|
|
||
| public short readShort() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea to provide these read functions to enable larger read. BTW, is there any use case to read a batch of shorts (and other numeric types)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the new batch read methods heavily in some optimizations I made to Trino. As for short, I can't say I recall any uses in Trino of readShorts(). readShort() is used indirectly through a method that reads a variable sized representation.
|
@theosib-amazon Thanks again for your contribution! I see the comments are generally around duplicating code, refactoring, and making code maintainable. If you have a measurement of improvement on this change alone, it would help the reviewers |
I broke up #953 into more digestible pieces. This new PR is the lowest level set of changes. By themselves, these additions to ByteBufferInputStream don't yield much improvement, so future PRs will include modifications to other source files that take advantage of this new functionality.
The complete set of changes (including subsequent PRs) is for performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3.
All of LittleEndianDataInputStream functionality is moved into ByteBufferInputStream, without changing any pre-existing interfaces or functionality. These changes yield the following benefits:
This PR also marks LittleEndianDataInputStream as deprecated.
Context:
I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I look forward to offering further performance improvements.