HADOOP-11867. Add a high performance vectored read API to file system.#3499
HADOOP-11867. Add a high performance vectored read API to file system.#3499mukund-thakur wants to merge 13 commits intoapache:trunkfrom
Conversation
|
Quick review.
would be good for QE tests
|
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/PositionedReadable.java
Outdated
Show resolved
Hide resolved
...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/AsyncReaderUtils.java
Outdated
Show resolved
Hide resolved
...-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/CombinedFileRange.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
Outdated
Show resolved
Hide resolved
bogthe
left a comment
There was a problem hiding this comment.
Did an initial review and looks promising.
I'm super curious to see what an async read will bring in terms of performance improvements for a job (even if it's a quick scrappy implementation in the FS).
hadoop-common-project/benchmark/src/main/java/org/apache/hadoop/benchmark/AsyncBenchmark.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
Outdated
Show resolved
Hide resolved
Conflicts: hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java pom.xml
1a03999 to
fa3ebc1
Compare
hadoop-common-project/hadoop-common/src/site/markdown/filesystem/fsdatainputstream.md
Outdated
Show resolved
Hide resolved
|
I just realized I had some pending comments on one of the updates; maybe out of date now and they were all minor. Let me re-review. |
steveloughran
left a comment
There was a problem hiding this comment.
here's some comments I had outstanding since 2021 .. sorry. will do a bigger review now
...mon-project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/TestVectoredReadUtils.java
Show resolved
Hide resolved
...mon-project/hadoop-common/src/test/java/org/apache/hadoop/fs/impl/TestVectoredReadUtils.java
Show resolved
Hide resolved
...-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/VectoredReadUtils.java
Show resolved
Hide resolved
| return ChecksumFSOutputSummer.CHKSUM_AS_FRACTION * size; | ||
| } | ||
|
|
There was a problem hiding this comment.
These are all needless and making the patch a bit bigger than it should be. was yetus complaining?
There was a problem hiding this comment.
I don't know how it is showing up here. In intellJ diff I don't see this.
| FileSystem fs = getFileSystem(); | ||
| List<FileRange> fileRanges = new ArrayList<>(); | ||
| fileRanges.add(new FileRangeImpl(1293, 25837)); | ||
| try (FSDataInputStream in = fs.open(path(VECTORED_READ_FILE_1MB_NAME))) { |
There was a problem hiding this comment.
can you use openFile here? just to make sure that codepath is happy
...oop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractVectoredReadTest.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
|
OK, I've had a look at the recent changes, and had some shared screen reviews over zoom here's what I propose
where those changes can be
then, once happy we can do a rebase followed by merge commit into trunk/branch-3. key point: this patch has been going and is ready to stabilize, which can be done with incremental changes in a feature branch. |
mehakmeet
left a comment
There was a problem hiding this comment.
Did an initial Review, looks really good.
I think we should add an integration test around the ordered disjoint list of ranges as well to showcase them not merging and reading or having a few ranges where some are merged and some aren't by validating them.
Also, are both minSeek and maxReadSize going to be configurable? I assume they are hardcoded just so you can test them out..
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
...oop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractVectoredReadTest.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/test/MoreAsserts.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInputStream.java
Show resolved
Hide resolved
...mmon/src/test/java/org/apache/hadoop/fs/contract/localfs/TestLocalFSConractVectoredRead.java
Show resolved
Hide resolved
Yeah already in my TODO. Thanks.
Also Please refer to the new PR now. #3904 |
|
now this is in the feature branch, this can be closed |
Rebased work on top of #1830
Conflicts:
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/BufferedFSInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/ChecksumFileSystem.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSDataInputStream.java
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/RawLocalFileSystem.java
pom.xml
Description of PR
Adding support for multiple ranged read async api in PositionedReadable. The default iterates through the ranges to read each synchronously, but the intent is that FSDataInputStream subclasses can make more efficient readers especially object stores implementation.
How was this patch tested?
Added benchmarks.
Added UT's
Added new contract tests for new API spec.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?