-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-2149: Async IO implementation for ParquetFileReader #968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Anyone know why the CI checks are failing with a SocketTimeout exception, and what to do to address this? |
|
@parthchandra do you have performance benchmark? Thanks |
|
I have some numbers from an internal benchmark using Spark. I didn't see any benchmarks in the Parquet codebase that I could reuse. Here are the numbers from my own benchmark -
After removing the two highest and two lowest runs for each case, and taking the median value: Async: 90 sec |
|
Great effort! WIll have a look after the build succeed. |
|
@parthchandra Would you mind having a look at my I/O performance optimization plan for ParquetMR? I think we should coordinate, since we have some ideas that might overlap what we touch. |
|
@theosib-amazon I read your document and went thru #960. It looks like for the most part, #960 and this PR and complement each other. The overlap I see is in the changes to |
|
@parthchandra One thing that confuses me a bit is that these buffers have only ByteBuffer inside them. There's no actual I/O, so it's not possible to block. Do you have subclasses that provide some sort of access to real I/O? |
| /** | ||
| * Close the page reader. By default it is no-op. | ||
| */ | ||
| default void close() throws IOException {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add Closeable as an implemented interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I know a lot of times I check if instances of closeable and then call close.
| checkState(); | ||
| // hack: parent constructor can call this method before this class is fully initialized. | ||
| // Just return without doing anything. | ||
| if (readFutures == null) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because checkState has synchronization, would it be safe to move the checkState before this somehow or add some kind of less expensive boolean check that we can set to true immediately after the super() call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you made me look. The synchronization was needed because I was saving any exception that might have occurred in the reader thread. However, that exception will be captured by the Future object and thrown as an ExecutionException when Future,get is called in line 135.
So we don't really need to save the exception and the synchronization around it can also be removed.
Good point. |
|
@theosib-amazon I applied my PR on top of your PR, ran thru some tests using Spark, and hit no issues. (All unit tests passed as well). |
kazuyukitanimura
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @parthchandra A few cosmetic comments
| import java.io.IOException; | ||
| import java.util.ArrayDeque; | ||
| import java.util.HashMap; | ||
| import java.util.Iterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't seem to find Iterator is used... Should we remove if it is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. This and a bunch of other imports are not used. Removing them all.
| import java.util.Queue; | ||
|
|
||
| import java.util.concurrent.LinkedBlockingDeque; | ||
| import org.apache.parquet.ParquetRuntimeException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I am not sure if ParquetRuntimeException is used...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
| int fetchIndex = 0; | ||
| int readIndex = 0; | ||
| ExecutorService threadPool; | ||
| LinkedBlockingQueue<Future<Void>> readFutures; | ||
| boolean closed = false; | ||
|
|
||
| LongAdder totalTimeBlocked = new LongAdder(); | ||
| LongAdder totalCountBlocked = new LongAdder(); | ||
| LongAccumulator maxTimeBlocked = new LongAccumulator(Long::max, 0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to change these to private if they are not accessed from other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| import java.util.Map; | ||
| import java.util.Optional; | ||
| import java.util.PrimitiveIterator; | ||
| import java.util.Queue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Queue is no longer used with this change. Should we remove the import?
hadoop is adding a vectored IO api intended for libraries like orc and parquet to be able to use, where the application provides an unordered list of ranges, a bytebuffer supplier and gets back a list of futures to wait for. the base implementation simply reads using readFully APi. s3a (and later abfs) will do full async retrieval itself, using the http connection pool. both vectored io and s3a prefetching will ship this summer in hadoop 3.4.0. i don't see this change conflicting with this, though they may obsolete a lot of it. have you benchmarked this change with abfs or google gcs connectors to see what difference it makes there? |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments. i would recommend looking at https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/filesystem/fsdatainputstream.html and lifting the tests in hadoop-common to make sure they work with this stream. because, correct or not, the semantics there are what applications often expect.
| LOG.debug("ASYNC Stream: READ - {}", timeSpent / 1000.0); | ||
| long putStart = System.nanoTime(); | ||
| long putCompleted = System.nanoTime(); | ||
| LOG.debug("ASYNC Stream: FS READ (output) BLOCKED - {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't. Sorry, this got left behind after a cleanup.
| } | ||
|
|
||
| @Override | ||
| public int read(ByteBuffer out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you 100% confident that all uses of read() in parquet are in a single thread? it is not unknown for apps to read across threads, even though the java APIs say "don't"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty sure it's always in a single thread. The stream is read from in only one place in ParquetFileRaader and the original code was always single threaded. The async + parallel column reader creates a new instance for every column reader and no column reader crosses a thread boundary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know.
| if (iterator.hasNext()) { | ||
| current = iterator.next(); | ||
| } else { | ||
| throw new EOFException("End of streams"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InputStream.read mandates return -1 on eof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does.
I based this on the fact that for most cases this class would encapsulate a MultiBufferInputStream and that class throws an EOFException instead of returning -1 and I was afraid that not throwing the exception would cause something to fail because the calling code would not be checking for -1.
Let me try to change this to -1 and see.
|
@steveloughran thank you very much for taking the time to review and provide feedback!
I was working with s3a -
I haven't worked with abfs or gcs. If the connectors do async pre-fetching, that would be great. Essentially, the time the Parquet reader would have to block in the file system API would reduce substantially. In such a case, we could turn the async reader on/off and rerun the benchmark to compare. From past experience with the MaprFS which had very aggressive read ahead in its hdfs client, I would still expect better parquet speeds. The fact that the prefetch is turned off when a seek occurs is usual behaviour, but we may see no benefit from the connector in that case. So a combination of async reader and async connector might end up being a great solution (maybe at a slightly greater CPU utilization). We would still have to do a benchmark to see the real effect.
Yes, I became aware of this recently. I'm discussing integration of these efforts in a separate channel. At the moment I see no conflict, but have yet to determine how much of this async work would need to be changed. I suspect we may be able to eliminate or vastly simplify
No I have not. Would love help from anyone in the community with access to these. I only have access to S3. |
thanks., that means you are current with all shipping improvments. the main one extra is to use openFile(), passing in length and requesting randomio. this guarantees ranged GET requests and cuts the initial HEAD probe for existence/size of file.
that I have. FWIW, with the right tuning of abfs prefetch (4 threads, 128 MB blocks) i can get full FTTH link rate from a remote store; 700 mbit/s . that's to the base station. once you add wifi the bottlenecks move. |
| * | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you're running in to the same bug in IntelliJ as I am, where it makes whitespace changes without authorization. Would you mind commenting on this bug report that I filed?
https://youtrack.jetbrains.com/issue/IDEA-293197/IntelliJ-makes-unauthorized-changes-to-whitespace-in-comments-wo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
|
Is byte (and arrays and buffers of bytes) the only datatype you support? My PR is optimizing code paths that pull ints, longs, and other sizes out of the data buffers. Are those not necessary for any of the situations where you're using an async buffer? |
By
Wow! That is nearly as fast as local HDD. At this point the bottlenecks in parquet begin to move towards decompression and decoding but IO remains the slowest link in the chain. One thing we get with my PR is that the ParquetFileReader had assumptions built in that all data must be read before downstream can proceed. Some of my changes are related to removing these assumptions and ensuring that downstream processing does not block until an entire column is read so we get efficient pipelining. |
|
Latency is the killer; in an HTTP request you want read enough but not discard data or break an http connection if the client suddenly does a seek() or readFully() somewhere else. file listings, existence checks etc.
That'd be great. now, if you could also handle requesting different columns in parallel and processing them out of order.
this was the abfs client set to do four GET requests of 128MB each. this would be awful for columns stores where smaller ranges are often requested/processed before another seek is made, but quite often parquet does do more back to back reads than just one read/readFully request
be good to think about vectored IO. and yes, updating parquet dependencies would be good, hadoop 3.3.0 should be the baseline. just sketched out my thoughts on this. I've played with some of this in my own branch. I think the next step would be for me to look at the benchmark code to make it targetable elsewhere. https://docs.google.com/document/d/1y9oOSYbI6fFt547zcQJ0BD8VgvJWdyHBveaiCHzk79k/ |
|
This is interesting, because when I did profiling of Trino, I found that although I/O (from S3, over the network no less) was significant, even more time was spent in compute. Maybe you're getting improved performance because you're increasing parallelism between I/O and compute. |
I do. The Parquet file reader api that reads row groups in sync mode reads all columns in sequence. In async mode, it fires off a task for every column blocking only to read the first page of every column before returning. This part also uses a different thread pool from the IO tasks so that IO tasks never wait because there are no available threads in the thread pool.
I think I know how to integrate this PR with the vectored IO, but this is only after a cursory look.
Who can drive this (presumably) non-trivial change? I myself have no karma points :(
This is great. I now have much more context of where you are coming from (and going to) ! |
I'm still learning Parquet's structure. So it sounds to me like these buffer input streams are used twice. Once to get data and decompress it, and then once again to decode it into data structures. Is that correct? So it sounds like you're optimizing one layer of processing, and I'm optimizing the next layer up, and it's kindof a coincidence that we're touching some of the same classes just because code reuse has been possible here. |
|
BTW, adding more tests for the InputStream implementations. |
Yeah, kind of cool :) |
It may be because I was using Spark's vectorized parquet decoding which is an order or magnitude faster than parquet library's row by row decoding (see Spark benchmarks). If trino is not doing vectorized decoding (I took a very quick look and I don't think it is), I would suggest you can look into that next. All the cool kids are doing it. |
+1 on upgrading to 3.3.0, although currently parquet is using 2.10.1 as a provided dependency and we need to make sure it continues to work with hadoop 2.x
Presto already has a batch reader but seems the feature is not in Trino yet. The batch reader did help a lot to reduce the CPU load. See the slides. |
|
@wgtmac Thank you! |
| readResult.get(1, TimeUnit.MILLISECONDS); | ||
| } | ||
| } catch (Exception e) { | ||
| // Do nothing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think adding a log make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not really useful to log anything here because there is nothing to be done. Some of the calls in the try block throw exceptions and we catch them to prevent them from propagating. We are shutting down and cleaning up the futures so it doesn't matter if there was an exception or not.
| @Override | ||
| public boolean nextBuffer() { | ||
| checkState(); | ||
| // hack: parent constructor can call this method before this class is fully initialized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the comment with 'hack'. What is the proper implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix would be to change behavior of the parent class. Basically the parent class implicitly assumes that there will be no derived classes and called a private method nextBuffer in the constructor. I extended the class and made next buffer protected so it is no longer correct to call it in the constructor.
Changing the parent class behavior broke too many unit tests. I could have fixed the unit tests as well, but wasn't sure that downstream projects might be depending on the behavior.
In the end, it seemed that handling this in the derived class was a safer approach.
It's really a workaround to make things easier to implement. Perhaps I should not have called it a hack.
|
@kazuyukitanimura @steveloughran @kbendick @ggershinsky @wgtmac @theosib-amazon Do you still have comments? |
It looks good to me. Please feel free to merge as you see fit. @shangxinli |
|
Hi, I am very interested in this optimization and just have some questiones when testing in a cluster with 4nodes/96 cores using spark3.1. Unfortunately, I see little improvement. |
You're likely to see improvement in cases where file i/o is the bottleneck. Most TPC-DS queries are join heavy and you will see little improvement there. You might do better with TPC-H.
It's probably best to keep the parquet (read) buffer size untouched. You should keep |
|
My test is spark.sql("select * from store_sales order by ss_customer_sk limit 10"), store_sales is table of 1TB TP-CDS. I also take the follow test with local filesystem using 100GB TP-CDS store_sales table,and I see there is a degradation with async io feature. test("parquet reader") { without this feature -> time: 7240 What process did I go wrong and can you show me the correct way to use this feature? |
|
Looks correct to me. Couple of questions, are you running this on a cluster or on local system? Also, is the data on SSD's? If you are on a single machine, there might not be enough CPU and the async threads may be contending for time with the processing threads. We'll need some profiling info to get a better diagnosis. Also, with SSD's, reads are so fast from the file system that the async feature might show very little improvement. |
OK,I will test with the debug logging for more details these days. |
|
Also try a query like which avoids the expensive sort |
|
Sorry to interrupt, just wondering if this PR can work with S3AsyncClient by any chance? @parthchandra Thanks! |
This PR uses the hdfs interface to s3 (s3a) if the url is s3a. I don't think that the s3a implementation uses any of the implementations of S3AsyncClient. |
|
@parthchandra Do you have time to resolve the conflicts? I think it would be nice to be included in the next release. |
Done |
|
@hazelnutsgz hadoop 3.3.5 supports vector IO on an s3 stream; async parallel fetch of blocks, which also works on local fs (and with gcs, abfs TODO items). we see significant performance increases there. There's a PR for it, though as it is 3.3.5+ only, not merged in to asf parquet branches unless the move or we finish a shim library to offer (serialized) support for the api on older releases. |
|
FWIW the hadoop 3.3.5 vector io changes might make this PR redundant. |
FWIW i'll be at berlin buzzwords in june and 1 of my 2 talks will be on this...if anyone from the parquet dev team is around, it'd be great to have you in the session. |
|
@steveloughran 🤚🏻 Looking forward to your talk! |
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finally sat down to scan through this; I think I'd need to look some more at this to understand it and compare with vector IO.
| ); | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); | ||
| throw new RuntimeException(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
be nice if it was always UncheckedIOException being raised here and elsewhere
| ParquetFileReader.setAsyncIOThreadPool(parquetIOThreadPool, false); | ||
| ParquetFileReader.setAsyncProcessThreadPool(parquetProcessThreadPool, false ); | ||
| try { | ||
| conf.set("parquet.read.async.io.enabled", Boolean.toString(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or use setBoolean()
| public static String PARQUET_READ_PARALLELISM = "parquet.metadata.read.parallelism"; | ||
|
|
||
| private final ParquetMetadataConverter converter; | ||
| // Thread pool to read column chunk data from disk. Applications should call setAsyncIOThreadPool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we have javadocs here so IDEs do popups of their details
| private final ParquetMetadataConverter converter; | ||
| // Thread pool to read column chunk data from disk. Applications should call setAsyncIOThreadPool | ||
| // to initialize this with their own implementations. | ||
| public static ExecutorService ioThreadPool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be public?
| boolean useColumnIndexFilter, | ||
| boolean usePageChecksumVerification, | ||
| boolean useBloomFilter, | ||
| boolean enableAsyncIOReader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
life would be a lot simpler if this code moved to a parameter object containing all the args: adding new options would be straightforward. We did this for the S3ClientFactory to stop external implementations having incompatible signatures when new stuff went in
| try { | ||
| compressedPage = compressedPages.take().orElse(null); | ||
| } catch (InterruptedException e) { | ||
| throw new RuntimeException("Error reading parquet page data.") ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: include caught exception
| break; | ||
| case DATA_PAGE: | ||
| DataPageHeader dataHeaderV1 = pageHeader.getData_page_header(); | ||
| pageBytes = chunk.readAsBytesInput(compressedPageSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've had to do the pool pair in other code to avoid this. what we also do is (and i forget where...) collect statistics on the time spent waiting for an executor. This helps identify when pool size is limiting performance.
|
Thanks for the comments @steveloughran. Not sure if we are still thinking of merging this. I'm not sure how well this will play with vector io. |
|
thanks. really looking at how to maximise parquet cloud read perf right now. IF you have traces, flame graphs etc that'd be good, especially for queries with different types and selectivity. |
|
I'll be able to spend some time on this this week (I bet it will not rebase on top of latest without a ton of conflicts). |
on heap or off-heap buffers?
Recent profiling of Hive Tez has shown that because it closes all fs instances in worker process after sub-query execution, time to instantiate S3A filesystem and negotiate TLS connections become significant |
On heap. |
5fafc01 to
e6647b7
Compare
|
@steveloughran I rebased this so now we can read using either vectorIO and async reader (but not both). Haven't been able to find time to compare. |
|
I think it'd be great to have a benchmark which could target remote stores; the current one does local where the latencies are less so the differences less tangible (though vector IO is fast with SSD as you can do parallel reads directly into buffers) |

Jira
This PR addresses the following PARQUET-2149: Implement async IO for Parquet file reader
Tests
This PR adds the following unit tests
AsyncMultiBufferInputStream.*
TestMultipleWriteRead.testReadWriteAsync
TestColumnChunkPageWriteStore.testAsync
The PR is also tested by changing the default configuration to make all reads async and then ensuring all unit tests pass