[HUDI-3834] Fixing performance hits in reading Column Stats Index by alexeykudinkin · Pull Request #5266 · apache/hudi

alexeykudinkin · 2022-04-08T19:24:02Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Fixing performance hits in reading Column Stats Index:

[HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s)
Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.

Brief change log

See above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan

Really good job on the finding and the fix.

nsivabalan · 2022-04-08T22:10:51Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java

   * @return Base path
+   * @deprecated please use {@link #getBasePathV2()}
   */
+  @Deprecated


I see we are using getBasePath() in our baseRelation classes. do we need to fix them to getBasePathV2() ?

We should slowly rollover all uses of getBasePath into getBasePathV2 and then rename it

nsivabalan · 2022-04-08T22:12:58Z

hudi-common/src/main/java/org/apache/hudi/hadoop/LazyCachingPath.java

  //       reads/writes to references are always atomic (including 64-bit JVMs)
  //       https://docs.oracle.com/javase/specs/jls/se8/html/jls-17.html#jls-17.7
  private volatile String fileName;
+  private volatile String s;


minor. s -> path.

nsivabalan

LGTM

nsivabalan · 2022-04-09T12:45:03Z

@alexeykudinkin : there are some CI failures. can you please check it out

…d calls punitive `SpecificData.getForSchema` calls

…void memory churn)

…rs to base-/meta-paths to avoid construction of Hadoop's `Path` in the hot-path; Avoid churning `Path` objects

Rebased `HoodieTableMetaClient` onto `SerializablePath`

…to workaround CI issues of building Flink w/ incorrect version of Avro, breaking code-gen)

alexeykudinkin · 2022-04-09T19:55:23Z

@hudi-bot run azure

hudi-bot · 2022-04-09T21:19:20Z

CI report:

79639f7 UNKNOWN
a87c363 Azure: FAILURE Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

) Fixing performance hits in reading Column Stats Index: [HUDI-3834] There's substantial performance degradation in Avro 1.10 default generated Builder classes: they by default rely on SpecificData.getForSchema that load corresponding model's class using reflection, which takes a hit when executed on the hot-path (this was bringing overall runtime to read full Column Stats Index of 800k records to 60s, whereas now it's taking mere 3s) Addressing memory churn by over-used Hadoop's Path creation: Path ctor is not a lightweight sequence and produces quite a bit of memory churn adding pressure on GC. Cleaning such avoidable allocations up to make sure there's no unnecessarily added pressure on GC.

nsivabalan reviewed Apr 8, 2022

View reviewed changes

nsivabalan added the priority:blocker Production down; release blocker label Apr 9, 2022

nsivabalan self-assigned this Apr 9, 2022

nsivabalan approved these changes Apr 9, 2022

View reviewed changes

Alexey Kudinkin added 15 commits April 9, 2022 10:05

Stubbed out all Avro newBuilder invokations on the hot-path to avoi…

e5e0506

…d calls punitive `SpecificData.getForSchema` calls

Cleaned up BaseFile to avoid new Path calls in the hot-path (to a…

eee61c6

…void memory churn)

FileNameCachingPath > LazyCachingPath

d840703

Rebased HoodieTableMetaClient onto LazyCachingPath to hold pointe…

1b5b94e

…rs to base-/meta-paths to avoid construction of Hadoop's `Path` in the hot-path; Avoid churning `Path` objects

Fixing tests

5a4956b

Added comments

e6e5a28

Fixing serializability of HoodieTableMetaClient

f5c5f7b

Fixing compilation

c0aa913

LazyCachingPath > CachingPath;

9e06a74

Added SerializablePath;

f9ec572

Rebased `HoodieTableMetaClient` onto `SerializablePath`

Missing license

447e252

Removing transient annotations

a6b6d88

Implement missing equals

83979a6

lint

d68663c

Lazy init builder stubs to defer initialization until actually used (…

a87c363

…to workaround CI issues of building Flink w/ incorrect version of Avro, breaking code-gen)

alexeykudinkin force-pushed the ak/col-stat-perf-fix branch from 723c55a to a87c363 Compare April 9, 2022 17:38

nsivabalan merged commit 7a9d48d into apache:master Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3834] Fixing performance hits in reading Column Stats Index#5266

[HUDI-3834] Fixing performance hits in reading Column Stats Index#5266
nsivabalan merged 15 commits intoapache:masterfrom
onehouseinc:ak/col-stat-perf-fix

alexeykudinkin commented Apr 8, 2022

Uh oh!

nsivabalan left a comment

Uh oh!

nsivabalan Apr 8, 2022

Uh oh!

alexeykudinkin Apr 9, 2022

Uh oh!

nsivabalan Apr 8, 2022

Uh oh!

nsivabalan left a comment

Uh oh!

nsivabalan commented Apr 9, 2022

Uh oh!

alexeykudinkin commented Apr 9, 2022

Uh oh!

hudi-bot commented Apr 9, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexeykudinkin commented Apr 8, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Apr 9, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan Apr 8, 2022

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Apr 9, 2022

Uh oh!

alexeykudinkin commented Apr 9, 2022

Uh oh!

hudi-bot commented Apr 9, 2022

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants