merge master by fengjian428 · Pull Request #12 · fengjian428/hudi

fengjian428 · 2022-07-08T17:39:20Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

* Add Call Procedure for marker deletion

… NullPointerException (#5755) SeekTo top cells avoid NullPointerException

…able (#5759)

…rsion may cause the fileID of the task to not be loaded correctly (#5763) Co-authored-by: john.wick <john.wick@vipshop.com>

…mitMetadata` parsing (#5733) As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs. Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time. - Rebased onto new APIs to avoid excessive Hadoop's Path allocations - Eliminated hasOperationField completely to avoid repeatitive computations - Cleaning up duplication in HoodieActiveTimeline - Added caching for common instances of HoodieCommitMetadata - Made tableStructSchema lazy;

… bulk insert row writer with SimpleKeyGen and virtual keys (#5664) Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen. This patch fixes the issue.

) - When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.

) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock. - Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix #5208

* HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory * Resolve metastore uri config before loading fs conf * Skip hiveql due to CI issue Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

…Integration (#5737) There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs. This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.

* [MINOR] FlinkStateBackendConverter add more exception message

…es easily (#5744) Co-authored-by: yanenze <yanenze@keytop.com.cn>

…rite (#5619) Co-authored-by: xicm <xicm@asiainfo.com>

…etadata (#5829)

…ading metadata table (#5840) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.

Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers

- Upgrade junit to 5.7.2 - Downgrade surefire and failsafe to 2.22.2 - Fix test failures that were previously not reported - Improve azure pipeline configs Co-authored-by: liujinhui1994 <965147871@qq.com> Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

Co-authored-by: superche <superche@tencent.com>

#5790) TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields. This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing. Co-authored-by: xicm <xicm@asiainfo.com>

…te (#5727)

add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently when failOnDataLoss is set, fail explicitly

#5788) Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception

…6002) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net>

…s default (#5174) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

Co-authored-by: Wenning Ding <wenningd@amazon.com>

…ed machines. (#5951)

…SORTED (#5999)

#5907) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer * add ut Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

…5458)

…tream if using HDFS (#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

… api (#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com>

Co-authored-by: superche <superche@tencent.com>

* [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>

Co-authored-by: voonhou.su <voonhou.su@shopee.com>

… partitions through a standalone job. (#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

#6043)

#5286) Co-authored-by: xicm <xicm@asiainfo.com>

…nd (#6042)

…bine cause exception (#5828)

…ion plan at once (#5677) * [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once * [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted * [HUDI-4152] Provider UT & IT for compact multi compaction plan * [HUDI-4152] Put multi compaction plans into one compaction plan source * [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma * [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy

…egy (#6060)

…on. (#5995) * fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check * Fix - serde parameters getting overrided on table property update * removing stale syncConfig

jerryshao and others added 30 commits June 5, 2022 11:05

[HUDI-4168] Add Call Procedure for marker deletion (#5738)

bd26d63

* Add Call Procedure for marker deletion

[HUDI-4190] Include hbase-protocol for shading in the bundles (#5750)

5d18b80

[HUDI-4192] HoodieHFileReader scan top cells after bottom cells throw…

73b0be3

… NullPointerException (#5755) SeekTo top cells avoid NullPointerException

[HUDI-4188] Fix flaky ITTestDataSTreamWrite.testWriteCopyOnWrite (#5749)

22c45a7

[HUDI-4195] Bulk insert should use right keygen for non-partitioned t…

21ab0ff

…able (#5759)

[HUDI-4101] When BucketIndexPartitioner take partition path for dispe…

132c0aa

…rsion may cause the fileID of the task to not be loaded correctly (#5763) Co-authored-by: john.wick <john.wick@vipshop.com>

[HUDI-4171] Fixing Non partitioned with virtual keys in read path (#5747

7da97c8

) - When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.

[MINOR] Mark AWSGlueCatalogSyncClient experimental (#5775)

e5710a8

[MINOR][RFC-53] Fix typos (#5764)

4f5cad8

Co-authored-by: yuezhang <yuezhang@freewheel.tv>

[HUDI-4198] Fix hive config for AWSGlueClientFactory (#5768)

1349b59

* HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory * Resolve metastore uri config before loading fs conf * Skip hiveql due to CI issue Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>

[MINOR][DOCS] Update the README.md file in hudi-examples (#5803)

f5ab921

[MINOR] FlinkStateBackendConverter add more exception message (#5809)

8ff17b0

* [MINOR] FlinkStateBackendConverter add more exception message

[HUDI-4213] Infer keygen clazz for Spark SQL (#5815)

c608dbd

[HUDI-4139]improvement for flink write operator name to identify tabl…

ba47904

…es easily (#5744) Co-authored-by: yanenze <yanenze@keytop.com.cn>

[HUDI-3889] Do not validate table config if save mode is set to Overw…

2b3a855

…rite (#5619) Co-authored-by: xicm <xicm@asiainfo.com>

[HUDI-4221] Fixing getAllPartitionPaths perf hit w/ FileSystemBackedM…

08fe281

…etadata (#5829)

[HUDI-4223] Fix NullPointerException from getLogRecordScanner when re…

97ccf5d

…ading metadata table (#5840) When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.

[HUDI-4205] Fix NullPointerException in HFile reader creation (#5841)

fd8f7c5

Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers

[MINOR] fix AvroSchemaConverter duplicate branch in 'switch' (#5813)

c82e346

Strip extra spaces when creating new configuration (#5849)

14d8735

Co-authored-by: superche <superche@tencent.com>

[HUDI-3863] Add UT for drop partition column in deltastreamer testsui…

0d859fe

…te (#5727)

[HUDI-4207] HoodieFlinkWriteClient.getOrCreateWriteHandle throws an e… (

264b15d

#5788) Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception

zhangyue19921010 and others added 29 commits June 29, 2022 01:43

[HUDI-1575] Claim RFC-56: Early Conflict Detection For Multi-writer (#…

637660b

…6002) Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net>

[MINOR] Make CLI 'commit rollback' using rollbackUsingMarkers false a…

e71f047

…s default (#5174) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

[HUDI-4331] Allow loading external config file from class loader (#5987)

03a94d9

Co-authored-by: Wenning Ding <wenningd@amazon.com>

[HUDI-4336] Fix records overwritten bug with binary primary key (#5996)

3948b89

[MINOR] Following #2070, Fix BindException when running tests on shar…

6a01f70

…ed machines. (#5951)

[HUDI-4346] Fix params not update BULKINSERT_ARE_PARTITIONER_RECORDS_…

cdaaa3c

…SORTED (#5999)

[HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeseria… (

8547899

#5907) * [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer * add ut Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>

[HUDI-3984] Remove mandatory check of partiton path for cli command (#…

397fd30

…5458)

[HUDI-3634] Could read empty or partial HoodieCommitMetaData in downs…

62a0c96

…tream if using HDFS (#5048) Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file

[HUDI-3953]Flink Hudi module should support low-level source and sink…

bdf73b2

… api (#5445) Co-authored-by: jerryyue <jerryyue@didiglobal.com>

[HUDI-4353] Column stats data skipping for flink (#6026)

47792a3

[HUDI-3505] Add call procedure for UpgradeOrDowngradeCommand (#6012)

c00ea84

Co-authored-by: superche <superche@tencent.com>

[HUDI-3730] Improve meta sync class design and hierarchies (#5854)

c0e1587

* [HUDI-3730] Improve meta sync class design and hierarchies (#5754) * Implements class design proposed in RFC-55 Co-authored-by: jian.feng <fengjian428@gmial.com> Co-authored-by: jian.feng <jian.feng@shopee.com>

[HUDI-3511] Add call procedure for MetadataCommand (#6018)

e095404

[HUDI-3730] Add ConfigTool#toMap UT (#6035)

c091e4c

Co-authored-by: voonhou.su <voonhou.su@shopee.com>

[MINOR] Improve variable names (#6039)

6187622

[HUDI-3116]Add a new HoodieDropPartitionsTool to let users drop table…

45fdcf6

… partitions through a standalone job. (#4459) Co-authored-by: yuezhang <yuezhang@freewheel.tv>

[HUDI-4360] Fix HoodieDropPartitionsTool based on refactored meta sync (

fbda4ad

#6043)

[HUDI-3836] Improve the way of fetching metadata partitions from table (

23c9c5c

#5286) Co-authored-by: xicm <xicm@asiainfo.com>

[HUDI-4359] Support show_fs_path_detail command on Call Produce Comma…

8570c3a

…nd (#6042)

[HUDI-4356] Fix the error when sync hive in CTAS (#6029)

3670e82

[HUDI-4219] Merge Into when update expression "col=s.col+2" on precom…

b18c323

…bine cause exception (#5828)

[HUDI-4357] Support flink 1.15.x (#6050)

7eeaff9

[HUDI-4309] fix spark32 repartition error (#6033)

5673819

[HUDI-4366] Synchronous cleaning for flink bounded source (#6051)

c744848

[minor] following 4152, refactor the clazz about plan selection strat…

a998586

…egy (#6060)

[HUDI-4367] Support copyToTable on call (#6054)

f20acb8

fengjian428 merged commit 424c1da into fengjian428:master Jul 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge master#12

merge master#12
fengjian428 merged 113 commits intofengjian428:masterfrom
apache:master

fengjian428 commented Jul 8, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

fengjian428 commented Jul 8, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants