Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream #960

theosib-amazon · 2022-04-25T21:14:52Z

I broke up #953 into more digestible pieces. This new PR is the lowest level set of changes. By themselves, these additions to ByteBufferInputStream don't yield much improvement, so future PRs will include modifications to other source files that take advantage of this new functionality.

The complete set of changes (including subsequent PRs) is for performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3.

All of LittleEndianDataInputStream functionality is moved into ByteBufferInputStream, without changing any pre-existing interfaces or functionality. These changes yield the following benefits:

Elimination of extra layers of abstraction and method call overhead
Enable the use of intrinsics for readInt, readLong, etc.
Availability of faster access methods like readFully and skipFully, without the need for helper functions

This PR also marks LittleEndianDataInputStream as deprecated.

Context:
I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I look forward to offering further performance improvements.

…utStream To improve performance, all multi-byte access functionality from LittleEndianDataInputStream has been merged into ByteBufferInputStream. LittleEndianDataInputStream is marked deprecated.

Made exception catch more specific for read() and readByte().

Also fixed bug discovered by that testing

theosib-amazon · 2022-04-26T21:40:02Z

I added some tests for this new code.

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

Removed unnecessary comment Removed unused wrapper method

Removed unused wrappers and constructors

Removed line of code that was commented out instead of properly deleted

Cleaned up some comments

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

Removed some more unused methods

@deprecated

@deprecated postpones to future PR

Removed unnecessary methods. Reverted whitespace change.

Removed unnecessary methods

Removed whitespace change

shangxinli · 2022-05-16T00:22:38Z

Is the '5% to 15%' gain from this change or along with other changes? If it is later, can you share the point to other changes? Like to see the overall changes before committing.

theosib-amazon · 2022-05-16T15:37:02Z

That improvement comes from a larget set of changes. I have a design doc that goes over all those changes plus some more that make it possible to get even more performance improvements.

https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing

shangxinli · 2022-07-24T21:22:21Z

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

+    }
+  }
+
+  public int readUnsignedVarInt() throws IOException {


Is it copied from BytesUtils.java? I wonder why we don't use that directly?

Not exactly. The one in BytesUtils calls methods that read one byte at a time. This one can take advantage of faster methods that read whole words at a time. This is a critical-path method, so it's a performance win to eliminate the extra level of abstraction and all the extra overhead fetching individual bytes and shifting.

shangxinli · 2022-07-24T21:24:17Z

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

+    return ((ch3 << 16) + (ch2 << 8) + (ch1 << 0));
+  }
+
+  public int readIntLittleEndianPaddedOnBitWidth(int bitWidth)


Is it copied from BytesUtils.java? I wonder why we don't use that directly?

The one that reads three bytes may or may not be a win. A level of abstraction is eliminated by doing this. It's hard to say whether or not the JIT will be smart enough to do that automatically.

I meant the method here: https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L120. It is exactly same but just different signature.

I know about that method. The BytesUtils code always reads one byte at a time. My version will read a whole word at a time for short and int. This is faster.

Wait. Are you referring to readIntLittleEndianPaddedOnBitWidth or readIntLittleEndianOnThreeBytes?

The former is definitely faster. An argument could be made to remove the latter, although it'll take longer for the JIT to hide the extra layers of virtual calls.

shangxinli · 2022-07-24T21:25:32Z

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

+    return Double.longBitsToDouble(readLong());
+  }
+
+  public int readIntLittleEndianOnThreeBytes() throws IOException {


Is it copied from BytesUtils.java? I wonder why we don't use that directly?

See my other comments.

Check this https://github.com/apache/parquet-mr/blob/master/parquet-common/src/main/java/org/apache/parquet/bytes/BytesUtils.java#L110

See my other comment on this. These two methods have the the same outcome, but mine is faster. I believe this is warranted for a performance critical path.

shangxinli · 2022-07-24T21:33:04Z

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

  @Override
-  public int read(byte[] bytes) {
-    return read(bytes, 0, bytes.length);
+  public void readFully(byte[] bytes, int off, int len) throws IOException {


Can you cast a light why we need to add the implementation readFully() here? For performance improvement?

There are situations where we need to read an exact number of bytes and throw an exception if not enough are available. This is faster than reading maybe enough and then checking, and this is a performance critical path.

Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?

Normally, the user of the class would read exactly the right number of bytes. These checks and exceptions exist only to catch bugs elsewhere. This is one reason why it's important to minimize the overhead of these checks in such performance-critical methods.

The difference between this method and read() is mainly to precheck if there is enough remaining length. I believe this can be done by wrapping up the read() method and adding the prechecks. Duplicating the code makes it harder to maintain.

I'll make these changes if you insist. But those prechecks are expensive, which is why I'm trying to avoid them when possible in a performance critical path.

shangxinli · 2022-07-24T21:35:26Z

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

+    }
+
+    int bytesRead = 0;
+    while (bytesRead < len) {


This seems duplicate with above line 244.

There are two key differences that make it hard to combine them without hurting performance for one, the other, or both, and they're both performance critical.

Duplicating code is hard to maintain.

Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.

I did test a lot of tradeoffs, but I don't think I tested this one thing directly. It's also been quite a while since I did this, so I don't think I'd be able to figure out which spreadsheets have the relevant data.

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

shangxinli · 2022-07-24T21:45:53Z

@sunchao Can you have a review?

Use constants for byte size of words

sunchao · 2022-07-25T22:56:15Z

Is this mostly a refactoring PR? I also don't see LittleEndianDataInputStream being marked as deprecated.

theosib-amazon · 2022-07-26T14:37:18Z

Is this mostly a refactoring PR? I also don't see LittleEndianDataInputStream being marked as deprecated.

I initially marked LittleEndianDataInputStream as deprecated. But I seem to recall that I was advised by Ryan Blue to do that in a later PR.

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

sunchao · 2022-07-28T21:15:35Z

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

+
+    try {
+      buffer.position(buffer.position() + (int)n);
+    } catch (IllegalArgumentException e) {


instead of try and catch, can we check if the new position is greater than the buffer.limit and throw EOF if so?

I did it this way on purpose. This way is always faster. Doing the check has to wait until there's profiling info and the C2 compiler gets hold of this code.

I already don't like the fact that I have to check the argument to make sure it's not negative and not bigger than int max. buffer.positiion() already checks for the position going out of bounds and throws an exception, so it would be redundant to have another check for the exact same thing here. A catch for an exception that never happens is basically always free, while a test for a condition that never happens is not free until profiling gets enough info about it for the C2 compiler to eliminate it.

The try/catch is also more expensive than checking. I agree with Chao to have the check instead of try/catch.

A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.

I directly tested this, and it made a small but measurable difference.

sunchao · 2022-07-28T21:23:51Z

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

+    this.buffer.order(java.nio.ByteOrder.LITTLE_ENDIAN);
+  }
+
+  SingleBufferInputStream(byte[] inBuf) {


it seems these 3 constructors are not used - can we at least add some tests to cover them?

Yeah. I put this here I think because I'm trying to make BufferInputStream have the same functionality as multiple other things that I'm combining into one. Some tests would be good. I'll see about writing some tests soon.

I removed this constructor because it's much better to have this be a compile-time error than a runtime error.

I went ahead and added two tests to cover the unused SingleBufferInputStream constructors. I considered just deleting these constructors, but I decided that it might be valuable to include them as documentation on how to do this in a way that is congruent to the behavior of HeapByteBuffer, just in case anyone wanted to do this in the future. There's also the risk that someone would think they have to wrap an array with ByteBuffer before using SingleBufferInputStream, but it would be better to avoid the overhead of ByteBuffer.duplicate(). (ByteBuffer.duplicate() appears to take constant time by making a reference to the same backing array, but it's a big constant with loads of checks.)

sunchao · 2022-07-28T21:24:19Z

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

+    this.startPosition = 0;
+  }
+
+  SingleBufferInputStream(List<ByteBuffer> inBufs) {


hmm why we need this constructor?

IIRC, I used to have a similar constructor for ByteBufferInputStream, but I was advised to remove it, so I had this here for uniformity. We can remove this.

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

sunchao · 2022-07-28T22:12:16Z

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

  @Override
-  public int read(byte[] bytes) {
-    return read(bytes, 0, bytes.length);
+  public void readFully(byte[] bytes, int off, int len) throws IOException {


Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions. Removed constructor that just throws exception.

Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions.

…tream. Rather than deleting these, I decided to keep the constructs as documentation on how to do these things according to how HeapByteBuffer wraps arrays. And it would be best if we didn't have to suffer the overhead of creating a ByteBuffer first in order to create a SingleBufferInputStream.

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

shangxinli · 2022-08-21T22:26:57Z

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

+    readFully(b, 0, b.length);
+  }
+
+  public void readFully(byte b[], int off, int len) throws IOException {


Don't see where it is used. Don't know why it is 'public'.

This is what the method signatures for DataInputStream.readFully look like.

I also have a whole bunch of other performance improvements I want to contribute (https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing), and I think this might get used in some of that code.

I'm very soon going to publish an open preview of all of my proposed changes to a branch of my own fork, so we'll be able to check this out.

Here's the complete preview of my changes to ParquetMR: https://github.com/theosib-amazon/parquet-mr/tree/batch-read-optimizations

Changed byte b[] to byte[] b

wgtmac

IMHO, the open source community priorities code maintainability more than performance gain as the project is widely adopted and maintained by many users and developers. However, if there is any concrete benchmark result to provide compelling evidence, it will be helpful to the discussion. @theosib-amazon @shangxinli @sunchao

wgtmac · 2022-11-11T14:34:36Z

parquet-common/src/main/java/org/apache/parquet/bytes/SingleBufferInputStream.java

+
+    try {
+      buffer.position(buffer.position() + (int)n);
+    } catch (IllegalArgumentException e) {


A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.

wgtmac · 2022-11-11T15:18:33Z

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java

+    }
+
+    int bytesRead = 0;
+    while (bytesRead < len) {


Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.

wgtmac · 2022-11-11T15:21:05Z

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java

+    return delegate.readUnsignedByte();
+  }
+
+  public short readShort() throws IOException {


I like the idea to provide these read functions to enable larger read. BTW, is there any use case to read a batch of shorts (and other numeric types)?

I use the new batch read methods heavily in some optimizations I made to Trino. As for short, I can't say I recall any uses in Trino of readShorts(). readShort() is used indirectly through a method that reads a variable sized representation.

shangxinli · 2022-12-03T19:30:05Z

@theosib-amazon Thanks again for your contribution! I see the comments are generally around duplicating code, refactoring, and making code maintainable. If you have a measurement of improvement on this change alone, it would help the reviewers

theosib-amazon and others added 4 commits April 25, 2022 20:50

Move all LittleEndianDataInputStream functionality into ByteBufferInp…

fcc2d75

…utStream To improve performance, all multi-byte access functionality from LittleEndianDataInputStream has been merged into ByteBufferInputStream. LittleEndianDataInputStream is marked deprecated.

Unhacked thrift version requirement in pom

cbcb139

Update MultiBufferInputStream.java

8c6d66e

Made exception catch more specific for read() and readByte().

Added tests for ByteBufferInputStream

832c009

Also fixed bug discovered by that testing

rdblue reviewed Apr 27, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 27, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 27, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java Outdated Show resolved Hide resolved

theosib-amazon added 4 commits April 27, 2022 15:24

Update ByteBufferInputStream.java

0dff9a6

Removed unnecessary comment Removed unused wrapper method

Update ByteBufferInputStream.java

cf60c02

Removed unused wrappers and constructors

Update ByteBufferInputStream.java

4ed532c

Removed line of code that was commented out instead of properly deleted

Update SingleBufferInputStream.java

9270501

Cleaned up some comments

rdblue reviewed Apr 27, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java Outdated Show resolved Hide resolved

rdblue reviewed Apr 27, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java Outdated Show resolved Hide resolved

theosib-amazon added 5 commits April 27, 2022 16:38

Update ByteBufferInputStream.java

a793ee3

Removed some more unused methods

Update LittleEndianDataInputStream.java

2adac33

@deprecated postpones to future PR

Update MultiBufferInputStream.java

491dd6c

Removed unnecessary methods. Reverted whitespace change.

Update SingleBufferInputStream.java

2042085

Removed unnecessary methods

Update MultiBufferInputStream.java

d59b1f9

Removed whitespace change

parthchandra mentioned this pull request May 18, 2022

PARQUET-2149: Async IO implementation for ParquetFileReader #968

Open

shangxinli reviewed Jul 24, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/MultiBufferInputStream.java Outdated Show resolved Hide resolved

Update MultiBufferInputStream.java

8e8bde1

Use constants for byte size of words

sunchao reviewed Jul 28, 2022

View reviewed changes

theosib-amazon and others added 4 commits August 1, 2022 11:36

Update SingleBufferInputStream.java

736aedd

Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions. Removed constructor that just throws exception.

Update MultiBufferInputStream.java

39cb3f3

Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions.

Fixed compile and test failures

d94ad68

shangxinli reviewed Aug 21, 2022

View reviewed changes

parquet-common/src/main/java/org/apache/parquet/bytes/ByteBufferInputStream.java Outdated Show resolved Hide resolved

shangxinli reviewed Aug 21, 2022

View reviewed changes

Update ByteBufferInputStream.java

b7049bc

Changed byte b[] to byte[] b

wgtmac reviewed Nov 11, 2022

View reviewed changes

Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream #960

Are you sure you want to change the base?

Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream #960

Uh oh!

Conversation

theosib-amazon commented Apr 25, 2022

Uh oh!

theosib-amazon commented Apr 26, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shangxinli commented May 16, 2022

Uh oh!

theosib-amazon commented May 16, 2022

Uh oh!

shangxinli Jul 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Jul 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shangxinli commented Jul 24, 2022

Uh oh!

sunchao commented Jul 25, 2022

Uh oh!

theosib-amazon commented Jul 26, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shangxinli Jul 24, 2022 •

edited

Loading

shangxinli Jul 24, 2022 •

edited

Loading

shangxinli Aug 21, 2022 •

edited

Loading