Skip to content

Conversation

@theosib-amazon
Copy link
Contributor

I broke up #953 into more digestible pieces. This new PR is the lowest level set of changes. By themselves, these additions to ByteBufferInputStream don't yield much improvement, so future PRs will include modifications to other source files that take advantage of this new functionality.

The complete set of changes (including subsequent PRs) is for performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3.

All of LittleEndianDataInputStream functionality is moved into ByteBufferInputStream, without changing any pre-existing interfaces or functionality. These changes yield the following benefits:

  • Elimination of extra layers of abstraction and method call overhead
  • Enable the use of intrinsics for readInt, readLong, etc.
  • Availability of faster access methods like readFully and skipFully, without the need for helper functions

This PR also marks LittleEndianDataInputStream as deprecated.

Context:
I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I look forward to offering further performance improvements.

theosib-amazon and others added 4 commits April 25, 2022 20:50
…utStream

To improve performance, all multi-byte access functionality from
LittleEndianDataInputStream has been merged into ByteBufferInputStream.
LittleEndianDataInputStream is marked deprecated.
Made exception catch more specific for read() and readByte().
Also fixed bug discovered by that testing
@theosib-amazon
Copy link
Contributor Author

I added some tests for this new code.

Removed unnecessary comment
Removed unused wrapper method
Removed unused wrappers and constructors
Removed line of code that was commented out instead of properly deleted
Cleaned up some comments
Removed some more unused methods
Removed unnecessary methods. Reverted whitespace change.
Removed unnecessary methods
Removed whitespace change
@shangxinli
Copy link
Contributor

Is the '5% to 15%' gain from this change or along with other changes? If it is later, can you share the point to other changes? Like to see the overall changes before committing.

@theosib-amazon
Copy link
Contributor Author

That improvement comes from a larget set of changes. I have a design doc that goes over all those changes plus some more that make it possible to get even more performance improvements.

https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing

}
}

public int readUnsignedVarInt() throws IOException {
Copy link
Contributor

@shangxinli shangxinli Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it copied from BytesUtils.java? I wonder why we don't use that directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. The one in BytesUtils calls methods that read one byte at a time. This one can take advantage of faster methods that read whole words at a time. This is a critical-path method, so it's a performance win to eliminate the extra level of abstraction and all the extra overhead fetching individual bytes and shifting.

return ((ch3 << 16) + (ch2 << 8) + (ch1 << 0));
}

public int readIntLittleEndianPaddedOnBitWidth(int bitWidth)
Copy link
Contributor

@shangxinli shangxinli Jul 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it copied from BytesUtils.java? I wonder why we don't use that directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one that reads three bytes may or may not be a win. A level of abstraction is eliminated by doing this. It's hard to say whether or not the JIT will be smart enough to do that automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know about that method. The BytesUtils code always reads one byte at a time. My version will read a whole word at a time for short and int. This is faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. Are you referring to readIntLittleEndianPaddedOnBitWidth or readIntLittleEndianOnThreeBytes?

The former is definitely faster. An argument could be made to remove the latter, although it'll take longer for the JIT to hide the extra layers of virtual calls.

return Double.longBitsToDouble(readLong());
}

public int readIntLittleEndianOnThreeBytes() throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it copied from BytesUtils.java? I wonder why we don't use that directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment on this. These two methods have the the same outcome, but mine is faster. I believe this is warranted for a performance critical path.

@Override
public int read(byte[] bytes) {
return read(bytes, 0, bytes.length);
public void readFully(byte[] bytes, int off, int len) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you cast a light why we need to add the implementation readFully() here? For performance improvement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are situations where we need to read an exact number of bytes and throw an exception if not enough are available. This is faster than reading maybe enough and then checking, and this is a performance critical path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, the user of the class would read exactly the right number of bytes. These checks and exceptions exist only to catch bugs elsewhere. This is one reason why it's important to minimize the overhead of these checks in such performance-critical methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference between this method and read() is mainly to precheck if there is enough remaining length. I believe this can be done by wrapping up the read() method and adding the prechecks. Duplicating the code makes it harder to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make these changes if you insist. But those prechecks are expensive, which is why I'm trying to avoid them when possible in a performance critical path.

}

int bytesRead = 0;
while (bytesRead < len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems duplicate with above line 244.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two key differences that make it hard to combine them without hurting performance for one, the other, or both, and they're both performance critical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicating code is hard to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test a lot of tradeoffs, but I don't think I tested this one thing directly. It's also been quite a while since I did this, so I don't think I'd be able to figure out which spreadsheets have the relevant data.

@shangxinli
Copy link
Contributor

@sunchao Can you have a review?

Use constants for byte size of words
@sunchao
Copy link
Member

sunchao commented Jul 25, 2022

Is this mostly a refactoring PR? I also don't see LittleEndianDataInputStream being marked as deprecated.

@theosib-amazon
Copy link
Contributor Author

Is this mostly a refactoring PR? I also don't see LittleEndianDataInputStream being marked as deprecated.

I initially marked LittleEndianDataInputStream as deprecated. But I seem to recall that I was advised by Ryan Blue to do that in a later PR.


try {
buffer.position(buffer.position() + (int)n);
} catch (IllegalArgumentException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of try and catch, can we check if the new position is greater than the buffer.limit and throw EOF if so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way on purpose. This way is always faster. Doing the check has to wait until there's profiling info and the C2 compiler gets hold of this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already don't like the fact that I have to check the argument to make sure it's not negative and not bigger than int max. buffer.positiion() already checks for the position going out of bounds and throws an exception, so it would be redundant to have another check for the exact same thing here. A catch for an exception that never happens is basically always free, while a test for a condition that never happens is not free until profiling gets enough info about it for the C2 compiler to eliminate it.

Copy link
Contributor

@shangxinli shangxinli Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/catch is also more expensive than checking. I agree with Chao to have the check instead of try/catch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly tested this, and it made a small but measurable difference.

this.buffer.order(java.nio.ByteOrder.LITTLE_ENDIAN);
}

SingleBufferInputStream(byte[] inBuf) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems these 3 constructors are not used - can we at least add some tests to cover them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I put this here I think because I'm trying to make BufferInputStream have the same functionality as multiple other things that I'm combining into one. Some tests would be good. I'll see about writing some tests soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this constructor because it's much better to have this be a compile-time error than a runtime error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and added two tests to cover the unused SingleBufferInputStream constructors. I considered just deleting these constructors, but I decided that it might be valuable to include them as documentation on how to do this in a way that is congruent to the behavior of HeapByteBuffer, just in case anyone wanted to do this in the future. There's also the risk that someone would think they have to wrap an array with ByteBuffer before using SingleBufferInputStream, but it would be better to avoid the overhead of ByteBuffer.duplicate(). (ByteBuffer.duplicate() appears to take constant time by making a reference to the same backing array, but it's a big constant with loads of checks.)

this.startPosition = 0;
}

SingleBufferInputStream(List<ByteBuffer> inBufs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm why we need this constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I used to have a similar constructor for ByteBufferInputStream, but I was advised to remove it, so I had this here for uniformity. We can remove this.

@Override
public int read(byte[] bytes) {
return read(bytes, 0, bytes.length);
public void readFully(byte[] bytes, int off, int len) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just track the remaining bytes of the stream on the client side and check before reading any bytes?

theosib-amazon and others added 4 commits August 1, 2022 11:36
Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions. Removed constructor that just throws exception.
Minor code changes on request. Removed redundant code. Fixed code formatting nits. More informative exceptions.
…tream.

Rather than deleting these, I decided to keep the constructs as documentation
on how to do these things according to how HeapByteBuffer wraps arrays. And
it would be best if we didn't have to suffer the overhead of creating a
ByteBuffer first in order to create a SingleBufferInputStream.
readFully(b, 0, b.length);
}

public void readFully(byte b[], int off, int len) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see where it is used. Don't know why it is 'public'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the method signatures for DataInputStream.readFully look like.

I also have a whole bunch of other performance improvements I want to contribute (https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing), and I think this might get used in some of that code.

I'm very soon going to publish an open preview of all of my proposed changes to a branch of my own fork, so we'll be able to check this out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed byte b[] to byte[] b
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the open source community priorities code maintainability more than performance gain as the project is widely adopted and maintained by many users and developers. However, if there is any concrete benchmark result to provide compelling evidence, it will be helpful to the discussion. @theosib-amazon @shangxinli @sunchao


try {
buffer.position(buffer.position() + (int)n);
} catch (IllegalArgumentException e) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A try/catch is basically free when no exception is thrown, while a check is not. I have tested this, and the try/catch is empirically faster, since the no-exception case is the common case. Putting in the check means we have to wait until the C2 compiler produces a trace without the branch. But even then, there's always the overhead of a check somewhere to be able to fall back to interpretation if the condition is not correct for the trace. I'm avoiding all of that entirely, making this faster in of the most common case.

That's interesting. A caveat is that try/catch may prohibit code optimization. But I doubt whether it makes significant difference on the simple buffer operation.

}

int bytesRead = 0;
while (bytesRead < len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, my objective here is to maximize performance. So we have to decide between maintainability and performance. Let's deliberate over this a bit more, and I'll do what you think is best.

I deeply understand this is a difficult trade-off. Do you have any evidence on the performance penalty if we wrap read and readFully methods to share some common logic? If the penalty is acceptable, we should definitely go for maintainability.

return delegate.readUnsignedByte();
}

public short readShort() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea to provide these read functions to enable larger read. BTW, is there any use case to read a batch of shorts (and other numeric types)?

Copy link
Contributor Author

@theosib-amazon theosib-amazon Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the new batch read methods heavily in some optimizations I made to Trino. As for short, I can't say I recall any uses in Trino of readShorts(). readShort() is used indirectly through a method that reads a variable sized representation.

@shangxinli
Copy link
Contributor

@theosib-amazon Thanks again for your contribution! I see the comments are generally around duplicating code, refactoring, and making code maintainable. If you have a measurement of improvement on this change alone, it would help the reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants