Improve HadoopCatalog performance/scalability #2124#2125
Improve HadoopCatalog performance/scalability #2124#2125rdblue merged 5 commits intoapache:masterfrom
Conversation
Move to RemoteIterator for scanning directories. It's not as elegant as using the java8 streaming, but it works with the prefetching that the s3a and (soon) abfs connectors do, as well as bailing out more efficiently. Also added check for access errors to also look for AccessDeniedException
eb8408f to
4195a3a
Compare
* corrected the assertEquals() for the correct error messages * only using listIterator on the two operation which is doing nested file I/O underneath; this is where speedups could be observed.
| try { | ||
| return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1; | ||
| } catch (FileNotFoundException e) { | ||
| } catch (FileNotFoundException e) { |
There was a problem hiding this comment.
Accidental whitespace change?
|
Looks good to me overall. Any reason why it is a draft? |
|
Looks like there are checkstyle issues to fix: |
cos for some reason some of the tests are failing, and I'm just getting through your build process before I start bothering people for reviews. one of the listIterator changes isn't working, and I want to understand why |
|
Tests look like they're passing to me? |
|
updated to fix the checkstyle. Also did a quick fix for azure exception reporting...this patch will continue to work with branches with and without the hadoop changes |
|
A couple more checkstyle errors: The JDK 11 test failure is a flaky test with Hive that we need to track down. |
|
ok. I thought those imports were in use. Will fix; just a bit distracted today. |
|
Looks good. Thanks for fixing this! |
|
thanks! |
Move to RemoteIterator for scanning directories.
It's not as elegant as using the java8 streaming, but it works with
the prefetching that the s3a and (soon) abfs connectors do, as well
as bailing out more efficiently.
Because each directory is probed with its own getFileStatus and list calls, the overhead of the outer list could be entirely swallowed by those inner probes -at least if there is >1 page of results in the listing and the implementation is prefetching.
Also added check for access errors to also look for AccessDeniedException; that's to support other filesystems and to prepare for HADOOP-15710