Skip to content

Conversation

@danny0405
Copy link
Contributor

@danny0405 danny0405 commented Jan 11, 2021

What is the purpose of the pull request

This is the #step 1 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal

in order to add a new pipeline to avoid the single parallelism operators.

Brief change log

  • The new pipeline removes the InstantGenerateOperator and CommitSink, add
    StreamWriteOperatorCoordinator to achieve the same function
  • Add HoodieOptions for configuration of all the Flink operators, this
    is also useful for the SQL connectors which would be introduced in
    following contributions
  • Add UT and IT cases for the new operators

Verify this pull request

  • StreamWriteFunctionTest to test the StreamWriteFunction and the related components
  • StreamWriteITCase is a IT to test all the operators

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little mistake --partition-path-field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for the reminder.

@codecov-io
Copy link

codecov-io commented Jan 11, 2021

Codecov Report

Merging #2430 (2c4fa32) into master (c4afd17) will increase coverage by 0.09%.
The diff coverage is 46.03%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2430      +/-   ##
============================================
+ Coverage     50.18%   50.27%   +0.09%     
- Complexity     3050     3119      +69     
============================================
  Files           419      430      +11     
  Lines         18931    19561     +630     
  Branches       1948     2003      +55     
============================================
+ Hits           9500     9835     +335     
- Misses         8656     8923     +267     
- Partials        775      803      +28     
Flag Coverage Δ Complexity Δ
hudicli 37.21% <ø> (ø) 0.00 <ø> (ø)
hudiclient 100.00% <ø> (ø) 0.00 <ø> (ø)
hudicommon 51.51% <44.44%> (+0.02%) 0.00 <6.00> (ø)
hudiflink 32.96% <46.05%> (+32.96%) 0.00 <64.00> (ø)
hudihadoopmr 33.16% <ø> (ø) 0.00 <ø> (ø)
hudisparkdatasource 65.85% <ø> (ø) 0.00 <ø> (ø)
hudisync 48.61% <ø> (ø) 0.00 <ø> (ø)
huditimelineservice 66.49% <ø> (ø) 0.00 <ø> (ø)
hudiutilities 69.43% <ø> (ø) 0.00 <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
.../apache/hudi/operator/InstantGenerateOperator.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...pache/hudi/operator/KeyedWriteProcessFunction.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
.../org/apache/hudi/operator/StreamWriteOperator.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...ache/hudi/operator/StreamWriteOperatorFactory.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...in/java/org/apache/hudi/schema/SchemaProvider.java 50.00% <ø> (+50.00%) 1.00 <0.00> (+1.00)
...src/main/java/org/apache/hudi/sink/CommitSink.java 0.00% <0.00%> (ø) 0.00 <0.00> (ø)
...di/source/JsonStringToHoodieRecordMapFunction.java 0.00% <ø> (ø) 0.00 <0.00> (ø)
.../org/apache/hudi/streamer/FlinkStreamerConfig.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
.../org/apache/hudi/streamer/HoodieFlinkStreamer.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
...rg/apache/hudi/streamer/HoodieFlinkStreamerV2.java 0.00% <0.00%> (ø) 0.00 <0.00> (?)
... and 25 more

@yanghua yanghua self-assigned this Jan 11, 2021
@yanghua yanghua self-requested a review January 11, 2021 13:27
@danny0405 danny0405 changed the title [HUDI-1522] Remove the single parallelism operator from the Flink writer [HUDI-1522] Add a new pipeline for Flink writer Jan 13, 2021
Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 thanks for your contribution. Just have a quick look, left some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And can we extract the same logic both in v1 and v2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep the consistency to make the comment of class level align to the existing classes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a OperationConverter in org.apache.hudi.utilities.deltastreamer let us use single one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

org.apache.hudi.utilities.deltastreamer OperationConverter is in a different module, i would suggest to not add dependency to that. The hoodie-flink module should not depend on the hudi-utilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that can we find a way to move this class into a common place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, if we fond more usages of this class, in current code base, only 2 places use it, and the class itself is pretty simple, it does not deserve to move the class to common utils.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to find a possible config file that we may not know its config information? Or only a new Object just enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find the correct hadoop configuration to create correct hadoop filesystem, this is also useful for HoodieFlinkEngineContext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that can we specify a discovery-order, e.g. first find the specified path, and then find a batch of default paths.

In some scenarios where storage is separated from computing, or data synchronization scenarios, users may not necessarily want to load the local "default" configuration.

Copy link
Contributor Author

@danny0405 danny0405 Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method firstly find the specified path fs.hdfs.hadoopconf, then directory HADOOP_CONF_DIR HADOOP_HOME/conf HADOOP_HOME/etc/hadoop from the system environment.

Even if storage is separated from computing, the FileSystem we created is still correct, if we split the hadoop conf files correctly.

In any case, we should not pass an empty hadoop configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method firstly find the specified path fs.hdfs.hadoopconf, then directory HADOOP_CONF_DIR HADOOP_HOME/conf HADOOP_HOME/etc/hadoop from the system environment.

I have watched the source code before raising this concern.

Even if storage is separated from computing, the FileSystem we created is still correct, if we split the hadoop conf files correctly.

In any case, we should not pass an empty hadoop configuration.

I mean, do we allow the user's explicit parameter assignment as the highest priority? Greater than the default convention that some users may not know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is reasonable that the explicit specified path has the highest priority.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Files.exists() to test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean java.nio.file.Files ? I guess new File(hadoopConfDir).exists() is more simpler.

Copy link
Contributor

@yanghua yanghua Jan 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do not change the literal to the variable that you defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should fix the logic for Avro version that higher than 1.10, because the higher version throws directly if the field name does not exist in the schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you have defined the isDeleteKey, why do not change the literal in this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey, got your idea, thanks ~

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make the changes in this file cleaner?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HoodieFlinkConfigOptions sounds better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally i prefer cleaner and shorter names, because options are phrase of Flink connectors, and it is under the package org.apache.hudi.operator, we can make it more specific if we introduce other config options class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine, it's not a key issue, different thought is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding Flink into the class name. We might add more engine support in the future. Use HoodieOptions make more sense to me if we add this connector under Flink's codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to add Flink to the class name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can not merge the next two lines into this line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally i prefer shorter line code. Refactoring it if Hoodie thinks long line code is ok.

@danny0405 danny0405 force-pushed the step1 branch 2 times, most recently from 7c96c0c to 3e3ef16 Compare January 16, 2021 01:45
Copy link
Member

@garyli1019 garyli1019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Thanks for your contribution, looks promising! Left some comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for adding Flink into the class name. We might add more engine support in the future. Use HoodieOptions make more sense to me if we add this connector under Flink's codebase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reuse the config under HoodieWriteConfiguration? That would be great if we can use the same set of config for both Spark and Flink.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This data also available in the WriteStatus. We have TotalWriteBytes and TotalRecords e.t.c. How are we gonna use this later on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test the data buffer we received, the statistics in WriteStatus are all about written data which is different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some docs here. Still a little bit confused about how this will be used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we use this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: toHoodieRecord?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, thanks ~

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using a hoodie internal exception here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is needed by the Flink coordinator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the later use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is available in the WriteOperationType. Can we reuse from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class is used by the HoodieFlinkStreamer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this linked to Spark code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this method being called? How do we generate the checkpointId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpointId is specified by the test cases, the test case needs to invoke the checkpointFunction by itself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this name is too generic. It would be better to make it more readable. What about renaming to FlinkStreamerConfig?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with the suggest name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many minor issues for the description items of these config options, e.g., describe it with a normal sentence(take care of the grammar?), Upper the first letter? In short, keep consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okey

@danny0405
Copy link
Contributor Author

Thanks @garyli1019 @yanghua for the review, i have updated based on your comments.

@danny0405 danny0405 force-pushed the step1 branch 2 times, most recently from b46310d to 9324312 Compare January 19, 2021 08:19
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the old version Avro mean here? This seems like being handled by the check on return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, i mean the new version Avro, let me fix the comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add another config to specify index type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can if we support more index type, currently the new pipeline only supports one. (Although not in this PR, but it is my plan though ~ I don't want to add it then remove ~)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some docs here. Still a little bit confused about how this will be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only visible for testing? IIUC, the actual writer need to know the current instant and add _hoodie_commit_time field to the record. How did the writer know the current instant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did the writer know the current instant

Fetch through the HoodieFlinkWriteClient, actually the HoodieTableMetaClient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a retry or ack mechanism here? What's the consequence if the coordinator does not receive the event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failover when next checkpoint starts. There is a check in the StreamWriteOperatorCoordinator.checkpointCoordinator.

Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Left some comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary comment for the constructor? it's easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

block seems hard to understand. Any better word?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied from the old code, how about instant-retry-times ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

This is the #step 1 of RFC-24:
https://cwiki.apache.org/confluence/display/HUDI/RFC+-+24%3A+Hoodie+Flink+Writer+Proposal

* The new pipeline removes the InstantGenerateOperator and CommitSink,
  add StreamWriteOperatorCoordinator to achieve the same function
* Add FlinkOptions for configuration of all the Flink operators, this
  is also useful for the SQL connectors which would be introduced in
  following contributions
* Add UT and IT cases for the new operators
Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Left some comments.

.key("write.table.type")
.stringType()
.defaultValue("COPY_ON_WRITE")
.withDescription("Type of table to write. COPY_ON_WRITE (or) MERGE_ON_READ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove . after write

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with , instead.

public static final ConfigOption<String> TABLE_TYPE = ConfigOptions
.key("write.table.type")
.stringType()
.defaultValue("COPY_ON_WRITE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us reduce the string literal via HoodieTableType.COPY_ON_WRITE.name()?

+ "By default 2000 and it will be doubled by every retry");

public static final ConfigOption<Boolean> IGNORE_FAILED_BATCH = ConfigOptions
.key("write.ignore.failed.batch")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.batch.ignore.failed sounds better? This config option describes a boolean flag, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 'batch' because Flink may evolve to streaming write in the near future.

|| WriteOperationType.fromValue(this.config.getString(FlinkOptions.OPERATION)) == WriteOperationType.UPSERT;
GenericRecord gr = (GenericRecord) this.converter.convert(this.avroSchema, record);
HoodieRecordPayload payload = shouldCombine
? StreamerUtil.createPayload(this.config.getString(FlinkOptions.PAYLOAD_CLASS), gr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method createPayload is too long and hard to read. Can you make it easy to read?

}
}

private void waitFor(long intervalMillis) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waitFor what? Let us give it a more readable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to sleepFor.

doCommit();
return;
} catch (Throwable throwable) {
String cause = throwable.getCause() == null ? "" : throwable.getCause().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check that how many exception categories? If there is not necessary to still retry, may we accept fast failed strategy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how many exception categories there ~, so we'd better keep the re-try because Flink is a streaming system, less failover is better, we can add the check for the exception categories in the future if it is needs, (for example, if there is a exception that does not deserve to retry, just fail fast.)

The user can forbidden the retry by setting the retry times to a negative though ~

@garyli1019
Copy link
Member

@danny0405 sorry for the delay on review, I was super busy this week. The bloom index was merged to master, can we add the bloom index option to this PR as well?

@danny0405
Copy link
Contributor Author

@danny0405 sorry for the delay on review, I was super busy this week. The bloom index was merged to master, can we add the bloom index option to this PR as well?

I'm not planning to using the BloomFilter index in the new pipeline, instead there is a BloomFilter index backed state index in the following PR, which is more suitable for streaming write.

Copy link
Contributor

@yanghua yanghua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Kindred thanks for your contribution and patient. I approve this PR now. If there are some minor issues, let's fix them later. @wangxianghu @garyli1019 Do you still have any concern, I will timeout and merge it tomorrow morning!

@yanghua
Copy link
Contributor

yanghua commented Jan 27, 2021

@danny0405 oh, CI failed... Please fix it before merging, 3ks.

@garyli1019
Copy link
Member

@danny0405 sorry for the delay on review, I was super busy this week. The bloom index was merged to master, can we add the bloom index option to this PR as well?

I'm not planning to using the BloomFilter index in the new pipeline, instead there is a BloomFilter index backed state index in the following PR, which is more suitable for streaming write.

@danny0405 yes, using bloom index in a streaming fashion is what we are planning to do next as well. No worry for now, we can try this PR out and continue the work on top of this. Really appreciate your work! This PR LGTM.

@yanghua yanghua merged commit bc0325f into apache:master Jan 28, 2021
prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021
* [HUDI-1522] Add a new pipeline for Flink writer
prashantwason added a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021
…OSS master

Summary:
[HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424)
[MINOR] Bumping snapshot version to 0.7.0 (apache#2435)
[HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453)
[HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451)
[HUDI-1532] Fixed suboptimal implementation of a magic sequence search  (apache#2440)
[HUDI-1535] Fix 0.7.0 snapshot (apache#2456)
[MINOR] Fixing setting defaults for index config (apache#2457)
[HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460)
[HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441)
[MINOR] Remove redundant judgments (apache#2466)
[MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444)
[MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468)
[MINOR] Make a separate travis CI job for hudi-utilities (apache#2469)
[HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412)
[HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434)
[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375)
[HUDI] Add bloom index for hudi-flink-client
[MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471)
[MINOR] Improve code readability,remove the continue keyword (apache#2459)
[HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473)
[HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474)
[MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477)
[HUDI-1476] Introduce unit test infra for java client (apache#2478)
[MINOR] Update doap with 0.7.0 release (apache#2491)
[MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492)
[HUDI-1234] Insert new records to data files without merging for "Insert" operation.  (apache#2111)
[MINOR] Add Jira URL and Mailing List (apache#2404)
[HUDI-1522] Add a new pipeline for Flink writer (apache#2430)
[HUDI-1522] Add a new pipeline for Flink writer
[HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455)
[HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502)
[HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418)
[MINOR] Quickstart.generateUpdates method add check (apache#2505)
[HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427)
[HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497)
[MINOR] Fix method comment typo (apache#2518)
[MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458)
[HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271)
[HUDI-1523] Call mkdir(partition) only if not exists (apache#2501)
[HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476)
[HUDI-1538] Try to init class trying different signatures instead of checking its name.
[HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521)
[MINOR] Fixing the default value for source ordering field for payload config (apache#2516)
[HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526)
[HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514)
[HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543)
[MINOR] Fix wrong logic for checking state condition (apache#2524)
[HUDI-1557] Make Flink write pipeline write task scalable (apache#2506)
[HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483)
[HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556)
[MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559)
[HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431)
[HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567)
[HUDI-1612] Fix write test flakiness in StreamWriteITCase
[MINOR] Default to empty list for unset datadog tags property (apache#2574)
[MINOR] Add clustering to feature list (apache#2568)
[HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553)
[HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485)
[HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579)
[HUDI-1381] Schedule compaction based on time elapsed (apache#2260)
[HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536)
[HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583)
[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534)
[HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359)
[HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540)
[HUDI-1624] The state based index should bootstrap from existing base files (apache#2581)
[HUDI-1477] Support copyOnWriteTable in java client (apache#2382)
[MINOR] Ensure directory exists before listing all marker files. (apache#2594)
[MINOR] hive sync checks for table after creating db if auto create is true (apache#2591)
[HUDI-1620] Add azure pipelines configs (apache#2582)
[HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188)
[HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599)
[HUDI-1638] Some improvements to BucketAssignFunction (apache#2600)
[HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227)
[HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443)
[HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565)
[Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584)
[HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593)
[HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562)
[HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610)
[HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595)
[HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539)
[MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617)
[HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495)
[HUDI-1587] Add latency and freshness support (apache#2541)
[HUDI-1647] Supports snapshot read for Flink (apache#2613)
[HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611)
[HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621)
[HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596)
[HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631)
[HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632)
[MINOR] Fix import in StreamerUtil.java (apache#2638)
[HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577)
[HUDI-1662] Fix hive date type conversion for mor table (apache#2634)
[HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642)
[MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646)
[HUDI-1635] Improvements to Hudi Test Suite (apache#2628)
[HUDI-1651] Fix archival of requested replacecommit (apache#2622)
[HUDI-1663] Streaming read for Flink MOR table (apache#2640)
[HUDI-1678] Row level delete for Flink sink (apache#2659)
[HUDI-1664] Avro schema inference for Flink SQL table (apache#2658)
[HUDI-1681] Support object storage for Flink writer (apache#2662)
[HUDI-1685] keep updating current date for every batch (apache#2671)
[HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500)
[HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669)
[HUDI-1692] Bounded source for stream writer (apache#2674)
[HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494)
[HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
[HUDI-1695] Fixed the error messaging (apache#2679)
[HUDI 1615] Fixing null schema in bulk_insert row writer path  (apache#2653)
[HUDI-845] Added locking capability to allow multiple writers (apache#2374)
[HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690)
[HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694)
[HUDI-1688]hudi write should uncache rdd, when the write operation is finnished (apache#2673)
[MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693)
[HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627)
[HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695)
[1568] Fixing spark3 bundles (apache#2625)
[HUDI-1650] Custom avro kafka deserializer. (apache#2619)
[HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636)
[HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699)
[MINOR][DOCUMENT] Update README doc for integ test (apache#2703)
[HUDI-1710] Read optimized query type for Flink batch reader (apache#2702)
[HUDI-1712] Rename & standardize config to match other configs (apache#2708)
[hotfix] Log the error message for creating table source first (apache#2711)
[HUDI-1495] Bump Flink version to 1.12.2 (apache#2718)
[HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731)
[HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608)
[HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732)
[HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727)
[HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733)
[HOTFIX] fix deploy staging jars script
[MINOR] Add Missing Apache License to test files (apache#2736)
[UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor.

Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq!

JIRA Issues: HUDI-593

Differential Revision: https://code.uberinternal.com/D5867129
prashantwason added a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021
…OSS master

Summary:
[HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424)
[MINOR] Bumping snapshot version to 0.7.0 (apache#2435)
[HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453)
[HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451)
[HUDI-1532] Fixed suboptimal implementation of a magic sequence search  (apache#2440)
[HUDI-1535] Fix 0.7.0 snapshot (apache#2456)
[MINOR] Fixing setting defaults for index config (apache#2457)
[HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460)
[HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441)
[MINOR] Remove redundant judgments (apache#2466)
[MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444)
[MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468)
[MINOR] Make a separate travis CI job for hudi-utilities (apache#2469)
[HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412)
[HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434)
[HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375)
[HUDI] Add bloom index for hudi-flink-client
[MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471)
[MINOR] Improve code readability,remove the continue keyword (apache#2459)
[HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473)
[HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474)
[MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477)
[HUDI-1476] Introduce unit test infra for java client (apache#2478)
[MINOR] Update doap with 0.7.0 release (apache#2491)
[MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492)
[HUDI-1234] Insert new records to data files without merging for "Insert" operation.  (apache#2111)
[MINOR] Add Jira URL and Mailing List (apache#2404)
[HUDI-1522] Add a new pipeline for Flink writer (apache#2430)
[HUDI-1522] Add a new pipeline for Flink writer
[HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455)
[HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502)
[HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418)
[MINOR] Quickstart.generateUpdates method add check (apache#2505)
[HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427)
[HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497)
[MINOR] Fix method comment typo (apache#2518)
[MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458)
[HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271)
[HUDI-1523] Call mkdir(partition) only if not exists (apache#2501)
[HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476)
[HUDI-1538] Try to init class trying different signatures instead of checking its name.
[HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521)
[MINOR] Fixing the default value for source ordering field for payload config (apache#2516)
[HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526)
[HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514)
[HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543)
[MINOR] Fix wrong logic for checking state condition (apache#2524)
[HUDI-1557] Make Flink write pipeline write task scalable (apache#2506)
[HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483)
[HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556)
[MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559)
[HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431)
[HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567)
[HUDI-1612] Fix write test flakiness in StreamWriteITCase
[MINOR] Default to empty list for unset datadog tags property (apache#2574)
[MINOR] Add clustering to feature list (apache#2568)
[HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553)
[HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485)
[HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579)
[HUDI-1381] Schedule compaction based on time elapsed (apache#2260)
[HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536)
[HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583)
[HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534)
[HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359)
[HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540)
[HUDI-1624] The state based index should bootstrap from existing base files (apache#2581)
[HUDI-1477] Support copyOnWriteTable in java client (apache#2382)
[MINOR] Ensure directory exists before listing all marker files. (apache#2594)
[MINOR] hive sync checks for table after creating db if auto create is true (apache#2591)
[HUDI-1620] Add azure pipelines configs (apache#2582)
[HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188)
[HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599)
[HUDI-1638] Some improvements to BucketAssignFunction (apache#2600)
[HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227)
[HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443)
[HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565)
[Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584)
[HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593)
[HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562)
[HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610)
[HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595)
[HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539)
[MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617)
[HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495)
[HUDI-1587] Add latency and freshness support (apache#2541)
[HUDI-1647] Supports snapshot read for Flink (apache#2613)
[HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611)
[HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621)
[HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596)
[HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631)
[HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632)
[MINOR] Fix import in StreamerUtil.java (apache#2638)
[HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577)
[HUDI-1662] Fix hive date type conversion for mor table (apache#2634)
[HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642)
[MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646)
[HUDI-1635] Improvements to Hudi Test Suite (apache#2628)
[HUDI-1651] Fix archival of requested replacecommit (apache#2622)
[HUDI-1663] Streaming read for Flink MOR table (apache#2640)
[HUDI-1678] Row level delete for Flink sink (apache#2659)
[HUDI-1664] Avro schema inference for Flink SQL table (apache#2658)
[HUDI-1681] Support object storage for Flink writer (apache#2662)
[HUDI-1685] keep updating current date for every batch (apache#2671)
[HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500)
[HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669)
[HUDI-1692] Bounded source for stream writer (apache#2674)
[HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494)
[HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
[HUDI-1695] Fixed the error messaging (apache#2679)
[HUDI 1615] Fixing null schema in bulk_insert row writer path  (apache#2653)
[HUDI-845] Added locking capability to allow multiple writers (apache#2374)
[HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690)
[HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694)
[HUDI-1688]hudi write should uncache rdd, when the write operation is finnished (apache#2673)
[MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693)
[HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627)
[HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695)
[1568] Fixing spark3 bundles (apache#2625)
[HUDI-1650] Custom avro kafka deserializer. (apache#2619)
[HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636)
[HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699)
[MINOR][DOCUMENT] Update README doc for integ test (apache#2703)
[HUDI-1710] Read optimized query type for Flink batch reader (apache#2702)
[HUDI-1712] Rename & standardize config to match other configs (apache#2708)
[hotfix] Log the error message for creating table source first (apache#2711)
[HUDI-1495] Bump Flink version to 1.12.2 (apache#2718)
[HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731)
[HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608)
[HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732)
[HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727)
[HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733)
[HOTFIX] fix deploy staging jars script
[MINOR] Add Missing Apache License to test files (apache#2736)
[UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor.

Reviewers: balajee

Reviewed By: balajee

JIRA Issues: HUDI-593

Differential Revision: https://code.uberinternal.com/D5867129
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants