[HUDI-1532] Fixed suboptimal implementation of a magic sequence search #2440

vburenin · 2021-01-13T17:53:20Z

What is the purpose of the pull request

Fixed suboptimal implementation of a magic sequence search that may take days on the file sizes of a few megabytes.
Instead of using 6 bytes buffer to find a magic sequence it uses a lot larger buffer that speeds up process like 170k times in some cases. The inefficiency is very noticeable when GCS or S3 storages are begin used.

Brief change log

Rewrote scanForNextAvailableBlockOffset function to use a large buffer size.

Verify this pull request

This pull request is already covered by existing tests

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green

…ake days on the file sizes of a few megabytes.

vinothchandar · 2021-01-13T21:57:05Z

@n3nash can you please review this?

vinothchandar · 2021-01-13T21:57:54Z

we buffer the underlying reads, correct? why is this happening?

vburenin · 2021-01-13T22:02:35Z

@vinothchandar I suspect the buffering of underlaying reads is on FileSystem driver, isn't it? If it is, GCS clearly not buffering it that can be seen in a form of a significant time distance (60 ms) between the calls to the readyFully method.

404120 [Executor task launch worker for task 268] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - Current magic position: 263
404183 [Executor task launch worker for task 268] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - Current magic position: 264
404246 [Executor task launch worker for task 268] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - Current magic position: 265

vinothchandar · 2021-01-13T22:16:14Z

No I was referring to the BufferedFSInputStream code here

public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, int bufferSize,
      boolean readBlockLazily, boolean reverseReader) throws IOException {
    FSDataInputStream fsDataInputStream = fs.open(logFile.getPath(), bufferSize);
    if (fsDataInputStream.getWrappedStream() instanceof FSInputStream) {
      this.inputStream = new TimedFSDataInputStream(logFile.getPath(), new FSDataInputStream(
          new BufferedFSInputStream((FSInputStream) fsDataInputStream.getWrappedStream(), bufferSize)));
    } else {
      // fsDataInputStream.getWrappedStream() maybe a BufferedFSInputStream
      // need to wrap in another BufferedFSInputStream the make bufferSize work?
      this.inputStream = fsDataInputStream;
    }

    this.logFile = logFile;
    this.readerSchema = readerSchema;
    this.readBlockLazily = readBlockLazily;
    this.reverseReader = reverseReader;
    if (this.reverseReader) {
      this.reverseLogFilePosition = this.lastReverseLogFilePosition = fs.getFileStatus(logFile.getPath()).getLen();
    }
    addShutDownHook();
  }

vburenin · 2021-01-13T22:18:59Z

According to this stacktrace it is not the case:

	at com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:967)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:772)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:763)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:365)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
	- locked <0x0000000616319fb8> (a com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at java.io.DataInputStream.readFully(DataInputStream.java:195)
	at org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:339)
	at org.apache.hudi.common.table.log.HoodieLogFileReader.scanForNextAvailableBlockOffset(HoodieLogFileReader.java:280)```

vinothchandar · 2021-01-13T22:25:14Z

class GoogleHadoopFSInputStream extends FSInputStream { - Interesting.

Could you try printing out the actual class names with something like this? We can see if we can make it wrap properly

public HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, int bufferSize,
      boolean readBlockLazily, boolean reverseReader) throws IOException {
    FSDataInputStream fsDataInputStream = fs.open(logFile.getPath(), bufferSize);

    System.err.println(">>>" + fsDataInputStream.getClass().getCanonicalName() + "," + 
        fsDataInputStream.getWrappedStream().getClass().getCanonicalName());

    if (fsDataInputStream.getWrappedStream() instanceof FSInputStream) {
      this.inputStream = new TimedFSDataInputStream(logFile.getPath(), new FSDataInputStream(
          new BufferedFSInputStream((FSInputStream) fsDataInputStream.getWrappedStream(), bufferSize)));
    } else {
      // fsDataInputStream.getWrappedStream() maybe a BufferedFSInputStream
      // need to wrap in another BufferedFSInputStream the make bufferSize work?
      this.inputStream = fsDataInputStream;
    }

vburenin · 2021-01-13T22:31:12Z

In process of building and trying.

The original search method is still O(m*n), which is also worth to optimize.

vinothchandar · 2021-01-13T22:31:43Z

Ack.

codecov-io · 2021-01-13T22:46:44Z

Codecov Report

Merging #2440 (e9de8a2) into master (e926c1a) will decrease coverage by 0.02%.
The diff coverage is 43.75%.

@@             Coverage Diff              @@
##             master    #2440      +/-   ##
============================================
- Coverage     50.73%   50.70%   -0.03%     
+ Complexity     3064     3059       -5     
============================================
  Files           419      419              
  Lines         18797    18810      +13     
  Branches       1922     1924       +2     
============================================
+ Hits           9536     9537       +1     
- Misses         8485     8495      +10     
- Partials        776      778       +2

Flag	Coverage Δ	Complexity Δ
hudicli	`37.26% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`51.96% <43.75%> (-0.06%)`	`0.00 <1.00> (ø)`
hudiflink	`10.20% <ø> (ø)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.06% <ø> (ø)`	`0.00 <ø> (ø)`
hudisparkdatasource	`65.90% <ø> (ø)`	`0.00 <ø> (ø)`
hudisync	`48.61% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.84% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.43% <ø> (ø)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...c/main/java/org/apache/hudi/common/fs/FSUtils.java	`51.88% <ø> (ø)`	`61.00 <0.00> (ø)`
...che/hudi/common/table/log/HoodieLogFileReader.java	`67.85% <43.75%> (-3.22%)`	`22.00 <1.00> (-1.00)`
.../apache/hudi/common/config/SerializableSchema.java	`54.54% <0.00%> (-3.35%)`	`6.00% <0.00%> (ø%)`
...ain/java/org/apache/hudi/avro/HoodieAvroUtils.java	`56.09% <0.00%> (-1.19%)`	`37.00% <0.00%> (-4.00%)`
...rg/apache/hudi/metadata/HoodieMetadataPayload.java	`0.00% <0.00%> (ø)`	`0.00% <0.00%> (ø%)`
...e/hudi/common/table/log/HoodieLogFormatWriter.java	`79.68% <0.00%> (+1.56%)`	`26.00% <0.00%> (ø%)`

vburenin · 2021-01-13T22:54:47Z

LOG.info("Class Name: " + fsDataInputStream.getWrappedStream().getClass().getName());

473840 [Executor task launch worker for task 267] INFO  org.apache.hudi.common.table.log.HoodieLogFileReader  - Class Name: org.apache.hadoop.fs.FSDataInputStream

FSDataInputStream has nothing to do with FSInputStream which in turn, makes it always non buffered.

n3nash · 2021-01-14T07:44:25Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java

+    // Make buffer large enough to scan through the file as quick as possible especially if it is on S3/GCS.
+    // Using lower buffer is incurring a lot of API calls thus drastically increasing the cost of the storage
+    // and also may take days to complete scanning trough the large files.
+    byte[] dataBuf = new byte[1024 * 1024];


Instead of this, can we do the following in the constructor ?

if (fsDataInputStream.getWrappedStream() instanceof FSInputStream ) { this.inputStream = new TimedFSDataInputStream(logFile.getPath(), new FSDataInputStream( new BufferedFSInputStream((FSInputStream) fsDataInputStream.getWrappedStream(), bufferSize))); } else if (**fsDataInputStream.getWrappedStream() instanceof FSDataInputStream**) { <initialize buffered input stream> }

After this change, we can leave the scanForNextAvailableBlockOffset() unchanged ? Also, please add comments as to why we need to add the extra check to make the inputStream Buffered

I would consider keeping a new one as it does a lot less of memory copies and a lot less additional overhead, technically it is more efficient from the resource utilization stand point.

@vburenin Just to confirm, by "lot less additional overhead" you refer to the in-memory to in-memory bytes copy operation that needs to be done for every 6 bytes vs 1MB in a single go in this implementation ? (Since the number of comparison of byte arrays is the same as looks like here -> https://github.com/apache/hbase/blob/master/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java#L2303). Can we quantify this overhead ?
Additionally, what is the reasoning behind keeping it at 1MB ?

Buffered reader needs to check a few things to copy the right data, readFully logic itself is not trivial, there is also position modification each time it reads 6 bytes, etc so even without profiling it I bet the overhead is significant.
1MB seems like a good number to me, not too much, not too little. From my past experience dealing with FS IO going with blocks larger than 1MB was giving a diminished return. However, the best number would be the one that matches underlying block read size, but that depends on the reader which can be any.

@vburenin According to the current code, we still are using BufferedReader for all cases except GCS, so that doesn't go away with this code in a generic way. Additionally, we need buffered reader code (the one I pointed above) anyways for GCS in the happy code path (without the need to find the magic header in corrupt blocks) since this method is only called when it encounters a corrupt block.

readFully does a bunch of if conditions so branching could cause some perf degradation here, don't see any other extra logic apart from copying bytes which is the same for irrespective of doing 6 bytes vs 1MB

Position modification in BufferedInputStream is a variable assignment which should not cause any overhead.

Agree with you that the best number should be the one that matches the underlying block size. It would be great if you can do some microbenchmarking here. I'm OK to land this once you can add the if check for BufferedInputStream since that is needed anyways ?

All that simple logic runs every time for every byte offset, it all adds up no matter what, I also forgot to mention that we iterate over the data changing position just for 1 byte to copy next 6 bytes every time, so that it can potentially copy 6 times more data, which is not ideal.

I will add a BufferedInputStream as soon as I get back to work on Tuesday. It is super hard to find even 5 minutes when all kids are home.

n3nash · 2021-01-14T07:48:40Z

@vburenin Left a comment to restructure the code to support buffering, are you going to look into improving the O(m*n) search ?

vburenin · 2021-01-14T16:14:59Z

@vburenin Left a comment to restructure the code to support buffering, are you going to look into improving the O(m*n) search ?

At this point of time I think it is not necessary as the search pattern is trivial and the actually complexity is closer to O(n), the slowest point in the original code is memory copies and additional overhead associated with it.

vinothchandar · 2021-01-15T00:50:40Z

@n3nash can we take a call on this and get it into the current release. marking as blocker for now.

vinothchandar · 2021-01-15T00:51:25Z

@vburenin do you mind creating a JIRA for this issue.? We can give you perms if you can ping us your id from issue.apache.org/jira

n3nash · 2021-01-17T20:37:49Z

@vburenin any chance you can take a look at this soon ? I'd like to get this into 0.7.0

vburenin · 2021-01-18T02:27:35Z

@vburenin any chance you can take a look at this soon ? I'd like to get this into 0.7.0

I try to respond as soon as I can, but it is super hard to do it over the off days.

vinothchandar · 2021-01-18T02:33:26Z

@n3nash can you please drive this yourself if possible. we can get this into 0.7.0 if we can land tonight/early tomorrow

vburenin · 2021-01-18T04:28:21Z

Found 30 minutes to add a buffered reader. The addition doesn't look elegant though.

n3nash · 2021-01-18T20:31:40Z

@vburenin Thanks for the quick turnaround. I took the liberty to make a couple of changes, if it looks good to you, we can land this. I will timeout on this in the evening.

vburenin · 2021-01-18T22:05:32Z

Changes look good to me. Not sure why I didn't move 1MB into constants block though, it is so obvious.

n3nash · 2021-01-18T23:22:49Z

@vburenin That's fine, thanks for this PR. Like we discussed, let's do another audit of this code for performance issues when you get a chance next week. I will land this once the build succeeds.
@vinothchandar I'll be away till evening, please feel free to land if you notice the build succeed if I'm not around.

apache#2440) * Fixed suboptimal implementation of a magic sequence search on GCS. * Fix comparison. * Added buffered reader around plugged storage plugin such as GCS. * 1. Corrected some comments 2. Refactored GCS input stream check Co-authored-by: volodymyr.burenin <volodymyr.burenin@cloudkitchens.com> Co-authored-by: Nishith Agarwal <nagarwal@uber.com>

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

volodymyr.burenin added 2 commits January 13, 2021 11:38

Fixed suboptimal implementation of a magic sequence search that may t…

e07dffa

…ake days on the file sizes of a few megabytes.

Fix comparison.

a97f30c

vinothchandar assigned n3nash Jan 13, 2021

vinothchandar changed the title ~~Fixed suboptimal implementation of a magic sequence search that may take days~~ Fixed suboptimal implementation of a magic sequence search Jan 13, 2021

n3nash reviewed Jan 14, 2021

View reviewed changes

vinothchandar added the priority:blocker Production down; release blocker label Jan 15, 2021

vburenin changed the title ~~Fixed suboptimal implementation of a magic sequence search~~ [HUDI-1532] Fixed suboptimal implementation of a magic sequence search Jan 15, 2021

Added buffered reader around plugged storage plugin such as GCS.

179b2e6

n3nash self-requested a review January 18, 2021 23:22

n3nash approved these changes Jan 18, 2021

View reviewed changes

1. Corrected some comments 2. Refactored GCS input stream check

e9de8a2

n3nash merged commit a38612b into apache:master Jan 19, 2021

[HUDI-1532] Fixed suboptimal implementation of a magic sequence search #2440

[HUDI-1532] Fixed suboptimal implementation of a magic sequence search #2440

Uh oh!

Conversation

vburenin commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

vinothchandar commented Jan 13, 2021

Uh oh!

vinothchandar commented Jan 13, 2021

Uh oh!

vburenin commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Jan 13, 2021

Uh oh!

vburenin commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Jan 13, 2021

Uh oh!

vburenin commented Jan 13, 2021

Uh oh!

vinothchandar commented Jan 13, 2021

Uh oh!

codecov-io commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vburenin commented Jan 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

n3nash Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

n3nash Jan 14, 2021

Choose a reason for hiding this comment

Uh oh!

vburenin Jan 14, 2021

Choose a reason for hiding this comment

Uh oh!

n3nash Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vburenin Jan 16, 2021

Choose a reason for hiding this comment

Uh oh!

n3nash Jan 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vburenin Jan 18, 2021

Choose a reason for hiding this comment

Uh oh!

n3nash commented Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vburenin commented Jan 14, 2021 • edited by vinothchandar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Jan 15, 2021

Uh oh!

vinothchandar commented Jan 15, 2021

Uh oh!

n3nash commented Jan 17, 2021

Uh oh!

vburenin commented Jan 18, 2021

Uh oh!

vinothchandar commented Jan 18, 2021

Uh oh!

vburenin commented Jan 18, 2021

Uh oh!

n3nash commented Jan 18, 2021

Uh oh!

vburenin commented Jan 13, 2021 •

edited

Loading

vburenin commented Jan 13, 2021 •

edited

Loading

vburenin commented Jan 13, 2021 •

edited

Loading

codecov-io commented Jan 13, 2021 •

edited

Loading

vburenin commented Jan 13, 2021 •

edited

Loading

n3nash Jan 14, 2021 •

edited

Loading

n3nash Jan 16, 2021 •

edited

Loading

n3nash Jan 16, 2021 •

edited

Loading

n3nash commented Jan 14, 2021 •

edited

Loading

vburenin commented Jan 14, 2021 •

edited by vinothchandar

Loading