[HUDI-1552] Improve performance of key lookups from base file in Metadata Table. #2494

prashantwason · 2021-01-26T21:44:37Z

What is the purpose of the pull request

Improves the performance of key lookups from Metadata Table.

In my scale testing with 150 partitions and 100K+ files on HDFS, the time to read the key was reduced (100ms avg -> 10ms) and the total data read from the HFile was reduced (85MB -> 3MB). The size of the base file was 3MB so this means that the in-memory HFile block caching was also working.

Brief change log

Cache the KeyScanner across lookups so that the HFile index does not have to be read for each lookup.
Enable block caching in KeyScanner.
Move the lock to a limited scope of the code to reduce lock contention.

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

mvn test -pl hudi-client/hudi-spark-client -Dtest=TestHoodieBackedMetadata

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-01-27T01:56:43Z

Codecov Report

Merging #2494 (59b919a) into master (d8af24d) will increase coverage by 17.89%.
The diff coverage is n/a.

@@              Coverage Diff              @@
##             master    #2494       +/-   ##
=============================================
+ Coverage     51.53%   69.43%   +17.89%     
+ Complexity     3491      363     -3128     
=============================================
  Files           462       53      -409     
  Lines         21881     1963    -19918     
  Branches       2327      235     -2092     
=============================================
- Hits          11277     1363     -9914     
+ Misses         9624      466     -9158     
+ Partials        980      134      -846

Flag	Coverage Δ	Complexity Δ
hudicli	`?`	`?`
hudiclient	`?`	`?`
hudicommon	`?`	`?`
hudiflink	`?`	`?`
hudihadoopmr	`?`	`?`
hudisparkdatasource	`?`	`?`
hudisync	`?`	`?`
huditimelineservice	`?`	`?`
hudiutilities	`69.43% <ø> (-0.06%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...ies/sources/helpers/DatePartitionPathSelector.java	`54.68% <0.00%> (-1.57%)`	`13.00% <0.00%> (ø%)`
...e/hudi/common/util/queue/BoundedInMemoryQueue.java
...udi/operator/partitioner/BucketAssignFunction.java
...pache/hudi/operator/KeyedWriteProcessOperator.java
...i/common/model/OverwriteWithLatestAvroPayload.java
...til/jvm/HotSpotMemoryLayoutSpecification64bit.java
...e/hudi/common/model/HoodieRollingStatMetadata.java
...he/hudi/exception/HoodieNotSupportedException.java
...udi/common/table/timeline/dto/ClusteringOpDTO.java
.../apache/hudi/common/model/ClusteringOperation.java
... and 394 more

vinothchandar · 2021-01-27T02:12:20Z

The size of the base file was 3MB so this means that the in-memory HFile block caching was also working.

Trying to understand this part. Was the workload, trying to fetch all the keys out of the HFile or just 1?

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java

nsivabalan · 2021-01-27T15:11:01Z

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java

do you think one line code warrants a separate method? not sure if we really need this. just curious.

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java

prashantwason · 2021-01-27T18:48:50Z

The size of the base file was 3MB so this means that the in-memory HFile block caching was also working.

Trying to understand this part. Was the workload, trying to fetch all the keys out of the HFile or just 1?

The workload was a commit followed by a Clean operation with num_versions_retained=1 so it will clean all partitions. Hence, number of key lookups should be equal to number of partitions and all the keys should have been read from the HFile.

prashantwason · 2021-01-27T21:04:22Z

With enableReuse=false, the caching of readers needs special handling because:

Multiple threads may call into HoodieBackedTableMetadata.getRecordByKeyFromMetadata() to read their respective keys
If enableReuse=false, then each of these threads will try to close the readers after reading the key

Hence, we essentially have two codepaths:

enableReuse=false then readers cannot be cached
enableReuse=true then the readers can be cached.

I have updated the patch to handle both these cases by modifying the openFileSliceIfNeeded function (renamed to getReader) which returns either:

cached readers when enableReuse=true
newly opened readers when enableReuse=false

vinothchandar

Comments on going back to using the member variables to have the open readers always and have getReaders() just assign new ones, if they are null. This style is what books like clean code typically recommend.

Otherwise LGTM

vinothchandar · 2021-01-29T03:57:13Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

We are sort of creating two code paths again for reuse and non-reuse. Can we please go back to just always initializing the member variable here always?
then closeIfNeeded() can continue to work with members alone.

vinothchandar · 2021-01-29T03:57:39Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

I prefer the previous approach for readability. i.e have it just close the member variables. If you disagree, may be we can chat about why you think this is better for reading. I did face this issue when I originally read the code here

nsivabalan · 2021-01-30T12:01:56Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

I'm sure you have tried it. But not sure why can't we achieve this w/o two diff set of variables for both code paths? Can't we have just one set of reader variables. one which gets closed and reopened every time (if reuse is not enabled), or the same one getting re-used (if config is enabled)

The reason is that we do not lock the entire function getRecordByKeyFromMetadata(). So if we have one set of the reader variables, one of the thread may be reading a key while the other thread may be calling a close() on the readers.

The getRecordByKeyFromMetadata() function does the following:

Get the correct readers (open new readers if reuse=false)

Read the key from the baseFileReader (reads key from HFile)

Bytes to HoodieRecordPayload conversion (bytes read in above step from HFile)

Read the key from the logRecordScanner (in-memory lookup)

Merge the two payloads to get the final value

Close the readers (if reuse=false)

We should only lock during Step 2 as HFile KeyScanner is not thread-safe. Rest of the steps can be done by multiple threads in parallel for max performance.

thnx for the detailed explanation Prasanth. will sync up w/ you offline. but still have some doubts/clarifications.
wrt statement 'one thread could be reading a key, while another could be closing the readers".
Let's talk about 2 scenarios.
a. reuse is not enabled. every thread is going to reinitialize the reader and close in the end and has a local copy of reader. And hence should not be an issue.
b. reuse is enabled. close() is invoked only when the HoodieBackedMetadataTable itself will be closed. So, I thought no thread will invoke close() when some other thread is reading.
May be I am missing something here. as I told, will sync up directly.

nsivabalan · 2021-02-05T13:08:06Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

please correct me if I am wrong. within this method, we don't care about reuse enabled or not. So, wondering if we need to remove this comment?

Are these addressed? can you please respond.

Fixed the comment.

nsivabalan · 2021-02-05T13:13:17Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

this is confusing to have same var names for both local variables and instance variables. Can you please change it. I was reviewing code assuming openReaders sets instance vars only to realize after a long time that these are local vars.

may be we could name instance vars as cachedBaseFileReader and cachedLogFileReader. wdyt?

There was a review comment earlier due to which I changed from cachedBaseFileReader.

What about using localBaseFileReader and logRecordScanner?

vinothchandar · 2021-02-17T11:31:42Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

if we synchronize access here, why synchronize at the lower levels?

The only function called from within this synchronized block of code is the openReaders which itself is not synchronized.

vinothchandar · 2021-02-17T11:34:40Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

can we please go back to the previous style of lazily opening/closing depending on the config and hiding it from the flow of reading and processing the values.

Happy to jump of a call if needed. But typically, reusing fewer variables and fewer branching the better. This does add signficant tax when reading the code. Thats why I changed them this way

I changed to this because the previous style was not clear enough. : |

nsivabalan

Looks like you haven't addressed or responded to some of my comments. Can approve once addressed. mostly minor ones.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

nsivabalan · 2021-02-17T11:40:56Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

Are these addressed? can you please respond.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

nsivabalan · 2021-02-17T11:55:07Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

Do you think this might simplify things.

Pair<HoodieFileReader, HoodieMetadataMergedLogRecordScanner> readers = null; if( ! reuse) readers = openReaders(); else if(cached readers are null) readers = openReaders(); // update cached/instance variables for readers else readers = Pair.of(cached base file reader, cached log file reader)

I suggest, we just stick to the instance variables and initialize/close them lazily as needed. This is typically the OO pattern

@vinothchandar We cannot use the instance variables across multiple threads when resuse=False.

Consider the two cases separately:

---------- reuse=FALSE

Each thread should open and close their own copy of readers

We cannot use instance variables

--------- reuse=TRUE

Each thread should share the same readers

The first thread should open the readers and initialize the instance variables

@vinothchandar : Can we sync up on this sometime and get a closure. Would like to have this in before our next release. May be at the end of next week's sync meeting, we can discuss on this.

@prashantwason and I already synced on this. Will be catching up on reviews.

prashantwason · 2021-03-09T18:25:26Z

@vinothchandar and I discussed simplifying this PR. The following changes are to be implemented:

Remove the "reuse" configuration as it does not make sense for performance reasons.
- When timeline server is used, reuse should be on
- When timeline server is not used, each executor has its own instance of the Metadata Reader and reuse is implicit.
Simplify the above code to use the instance variables
Locking is not required because of the usage pattern in Add hoodie-hadoop-mr module with support for InputFormat #1. Locking will still be required in HFileReader because KeyScanner is not thread safe.

I am working on updating this PR.

…data Table. 1. Cache the KeyScanner across lookups so that the HFile index does not have to be read for each lookup. 2. Enable block caching in KeyScanner. 3. Move the lock to a limited scope of the code to reduce lock contention. 4. Removed reuse configuration

prashantwason · 2021-03-12T18:10:19Z

@vinothchandar PTAL.

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java

vinothchandar

Let me take a stab at this and push a commit on top, if you don't mind @prashantwason

vinothchandar · 2021-03-14T20:11:31Z

@prashantwason We can remove the reuse configuration - i.e no need to have this behavior be user controlled.

but ultimately, we still need to close everything out, where metadata table is opened from executors. I am going to just introduce a boolean variable within HoodieBackedTableMetadata

…utors - Passing a reuse boolean into HoodieBackedTableMetadata - Preserve the fast return behavior when reusing and opening from multiple threads (no contention) - Handle concurrent close() and open readers, for reuse=false, by always synchronizing

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java

prashantwason · 2021-03-15T20:46:47Z

Looks good @vinothchandar

…data Table. (apache#2494) * [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. 1. Cache the KeyScanner across lookups so that the HFile index does not have to be read for each lookup. 2. Enable block caching in KeyScanner. 3. Move the lock to a limited scope of the code to reduce lock contention. 4. Removed reuse configuration * Properly close the readers, when metadata table is accessed from executors - Passing a reuse boolean into HoodieBackedTableMetadata - Preserve the fast return behavior when reusing and opening from multiple threads (no contention) - Handle concurrent close() and open readers, for reuse=false, by always synchronizing Co-authored-by: Vinoth Chandar <[email protected]>

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

prashantwason force-pushed the pw_rfc-15-perf branch from 39d0051 to 19894f6 Compare January 26, 2021 21:45

nsivabalan reviewed Jan 27, 2021

View reviewed changes

prashantwason force-pushed the pw_rfc-15-perf branch from 19894f6 to 490baff Compare January 27, 2021 20:21

vinothchandar assigned vinothchandar and nsivabalan Jan 29, 2021

vinothchandar reviewed Jan 29, 2021

View reviewed changes

nsivabalan reviewed Jan 30, 2021

View reviewed changes

vinothchandar assigned prashantwason Feb 3, 2021

prashantwason force-pushed the pw_rfc-15-perf branch from 490baff to 48fe1fb Compare February 4, 2021 00:43

nsivabalan reviewed Feb 5, 2021

View reviewed changes

prashantwason force-pushed the pw_rfc-15-perf branch 3 times, most recently from a890eed to e2e049b Compare February 10, 2021 22:10

nsivabalan added the priority:high Significant impact; potential bugs label Feb 11, 2021

vinothchandar requested changes Feb 17, 2021

View reviewed changes

nsivabalan reviewed Feb 17, 2021

View reviewed changes

prashantwason force-pushed the pw_rfc-15-perf branch from e2e049b to 5607368 Compare February 18, 2021 23:46

prashantwason force-pushed the pw_rfc-15-perf branch from 5607368 to c15080d Compare March 9, 2021 21:30

prashantwason force-pushed the pw_rfc-15-perf branch from c15080d to 874cc6d Compare March 9, 2021 21:43

vinothchandar added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs priority:critical Production degraded; pipelines stalled labels Mar 14, 2021

vinothchandar added priority:blocker Production down; release blocker priority:high Significant impact; potential bugs labels Mar 14, 2021

vinothchandar reviewed Mar 14, 2021

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java Show resolved Hide resolved

vinothchandar reviewed Mar 14, 2021

View reviewed changes

vinothchandar added priority:critical Production degraded; pipelines stalled and removed priority:high Significant impact; potential bugs labels Mar 15, 2021

vinothchandar force-pushed the pw_rfc-15-perf branch from 5557e1d to 59b919a Compare March 15, 2021 19:12

vinothchandar reviewed Mar 15, 2021

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java Show resolved Hide resolved

vinothchandar merged commit 3b36cb8 into apache:master Mar 15, 2021

[HUDI-1552] Improve performance of key lookups from base file in Metadata Table. #2494

[HUDI-1552] Improve performance of key lookups from base file in Metadata Table. #2494

Uh oh!

Conversation

prashantwason commented Jan 26, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vinothchandar commented Jan 27, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

prashantwason commented Jan 27, 2021

Uh oh!

prashantwason commented Jan 27, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

prashantwason commented Mar 9, 2021

Uh oh!

codecov-io commented Jan 27, 2021 •

edited

Loading