Skip to content

Improve HadoopCatalog performance/scalability #2124#2125

Merged
rdblue merged 5 commits intoapache:masterfrom
steveloughran:outgoing/2124-HadoopCatalog
Jan 28, 2021
Merged

Improve HadoopCatalog performance/scalability #2124#2125
rdblue merged 5 commits intoapache:masterfrom
steveloughran:outgoing/2124-HadoopCatalog

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

Move to RemoteIterator for scanning directories.
It's not as elegant as using the java8 streaming, but it works with
the prefetching that the s3a and (soon) abfs connectors do, as well
as bailing out more efficiently.

Because each directory is probed with its own getFileStatus and list calls, the overhead of the outer list could be entirely swallowed by those inner probes -at least if there is >1 page of results in the listing and the implementation is prefetching.

Also added check for access errors to also look for AccessDeniedException; that's to support other filesystems and to prepare for HADOOP-15710

@github-actions github-actions bot added the core label Jan 20, 2021
@steveloughran steveloughran marked this pull request as draft January 20, 2021 14:37
Move to RemoteIterator for scanning directories.
It's not as elegant as using the java8 streaming, but it works with
the prefetching that the s3a and (soon) abfs connectors do, as well
as bailing out more efficiently.

Also added check for access errors to also look for AccessDeniedException
@steveloughran steveloughran force-pushed the outgoing/2124-HadoopCatalog branch from eb8408f to 4195a3a Compare January 20, 2021 14:50
* corrected the assertEquals() for the correct error messages
* only using listIterator on the two operation which is doing nested
  file I/O underneath; this is where speedups could be observed.
try {
return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1;
} catch (FileNotFoundException e) {
} catch (FileNotFoundException e) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental whitespace change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jan 22, 2021

Looks good to me overall. Any reason why it is a draft?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jan 22, 2021

Looks like there are checkstyle issues to fix:

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177: Line is longer than 120 characters (found 124). [LineLength]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177:15: '||' should be on the previous line. [OperatorWrap]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:229:20: Local variable name 's' must match pattern '^[a-z][a-zA-Z0-9]+$'. [LocalVariableName]

@steveloughran
Copy link
Copy Markdown
Contributor Author

Looks good to me overall. Any reason why it is a draft?

cos for some reason some of the tests are failing, and I'm just getting through your build process before I start bothering people for reviews.

one of the listIterator changes isn't working, and I want to understand why

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jan 22, 2021

Tests look like they're passing to me?

@steveloughran steveloughran marked this pull request as ready for review January 25, 2021 13:44
@steveloughran
Copy link
Copy Markdown
Contributor Author

updated to fix the checkstyle. Also did a quick fix for azure exception reporting...this patch will continue to work with branches with and without the hadoop changes
apache/hadoop#2648

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jan 25, 2021

A couple more checkstyle errors:

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]

The JDK 11 test failure is a flaky test with Hive that we need to track down.

@steveloughran
Copy link
Copy Markdown
Contributor Author

ok. I thought those imports were in use. Will fix; just a bit distracted today.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jan 28, 2021

Looks good. Thanks for fixing this!

@rdblue rdblue merged commit c84d441 into apache:master Jan 28, 2021
@steveloughran
Copy link
Copy Markdown
Contributor Author

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants