Improve HadoopCatalog performance/scalability #2124 by steveloughran · Pull Request #2125 · apache/iceberg

steveloughran · 2021-01-20T14:29:54Z

Move to RemoteIterator for scanning directories.
It's not as elegant as using the java8 streaming, but it works with
the prefetching that the s3a and (soon) abfs connectors do, as well
as bailing out more efficiently.

Because each directory is probed with its own getFileStatus and list calls, the overhead of the outer list could be entirely swallowed by those inner probes -at least if there is >1 page of results in the listing and the implementation is prefetching.

Also added check for access errors to also look for AccessDeniedException; that's to support other filesystems and to prepare for HADOOP-15710

Move to RemoteIterator for scanning directories. It's not as elegant as using the java8 streaming, but it works with the prefetching that the s3a and (soon) abfs connectors do, as well as bailing out more efficiently. Also added check for access errors to also look for AccessDeniedException

* corrected the assertEquals() for the correct error messages * only using listIterator on the two operation which is doing nested file I/O underneath; this is where speedups could be observed.

rdblue · 2021-01-22T01:46:50Z

core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java

    try {
      return fs.listStatus(metadataPath, TABLE_FILTER).length >= 1;
-    } catch (FileNotFoundException e) {
+    } catch (FileNotFoundException  e) {


Accidental whitespace change?

rdblue · 2021-01-22T01:48:06Z

Looks good to me overall. Any reason why it is a draft?

rdblue · 2021-01-22T01:49:27Z

Looks like there are checkstyle issues to fix:

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177: Line is longer than 120 characters (found 124). [LineLength]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:177:15: '||' should be on the previous line. [OperatorWrap]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:229:20: Local variable name 's' must match pattern '^[a-z][a-zA-Z0-9]+$'. [LocalVariableName]

steveloughran · 2021-01-22T11:31:58Z

Looks good to me overall. Any reason why it is a draft?

cos for some reason some of the tests are failing, and I'm just getting through your build process before I start bothering people for reviews.

one of the listIterator changes isn't working, and I want to understand why

rdblue · 2021-01-22T18:14:17Z

Tests look like they're passing to me?

steveloughran · 2021-01-25T14:36:02Z

updated to fix the checkstyle. Also did a quick fix for azure exception reporting...this patch will continue to work with branches with and without the hadoop changes
apache/hadoop#2648

rdblue · 2021-01-25T18:26:13Z

A couple more checkstyle errors:

Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:32:8: Unused import - java.util.stream.Collectors. [UnusedImports]
Error: eckstyle] [ERROR] /home/runner/work/iceberg/iceberg/core/src/main/java/org/apache/iceberg/hadoop/HadoopCatalog.java:33:8: Unused import - java.util.stream.Stream. [UnusedImports]

The JDK 11 test failure is a flaky test with Hive that we need to track down.

steveloughran · 2021-01-26T21:50:15Z

ok. I thought those imports were in use. Will fix; just a bit distracted today.

rdblue · 2021-01-28T17:50:50Z

Looks good. Thanks for fixing this!

steveloughran · 2021-01-29T11:04:31Z

thanks!

github-actions bot added the core label Jan 20, 2021

steveloughran marked this pull request as draft January 20, 2021 14:37

steveloughran force-pushed the outgoing/2124-HadoopCatalog branch from eb8408f to 4195a3a Compare January 20, 2021 14:50

Fix build/test

576e89b

* corrected the assertEquals() for the correct error messages * only using listIterator on the two operation which is doing nested file I/O underneath; this is where speedups could be observed.

rdblue reviewed Jan 22, 2021

View reviewed changes

rdblue approved these changes Jan 22, 2021

View reviewed changes

steveloughran marked this pull request as ready for review January 25, 2021 13:44

checkstyle

46a56fa

steveloughran added 2 commits January 27, 2021 12:08

removed unused imports

0579990

review comments: fix whitespace error

6698e42

rdblue merged commit c84d441 into apache:master Jan 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HadoopCatalog performance/scalability #2124#2125

Improve HadoopCatalog performance/scalability #2124#2125
rdblue merged 5 commits intoapache:masterfrom
steveloughran:outgoing/2124-HadoopCatalog

steveloughran commented Jan 20, 2021

Uh oh!

rdblue Jan 22, 2021

Uh oh!

steveloughran Jan 28, 2021

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

steveloughran commented Jan 22, 2021

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

steveloughran commented Jan 25, 2021

Uh oh!

rdblue commented Jan 25, 2021 •

edited

Loading

Uh oh!

steveloughran commented Jan 26, 2021

Uh oh!

rdblue commented Jan 28, 2021

Uh oh!

steveloughran commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

steveloughran commented Jan 20, 2021

Uh oh!

rdblue Jan 22, 2021

Choose a reason for hiding this comment

Uh oh!

steveloughran Jan 28, 2021

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

steveloughran commented Jan 22, 2021

Uh oh!

rdblue commented Jan 22, 2021

Uh oh!

steveloughran commented Jan 25, 2021

Uh oh!

rdblue commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

steveloughran commented Jan 26, 2021

Uh oh!

rdblue commented Jan 28, 2021

Uh oh!

steveloughran commented Jan 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rdblue commented Jan 25, 2021 •

edited

Loading