Skip to content

merge master#12

Merged
fengjian428 merged 113 commits intofengjian428:masterfrom
apache:master
Jul 8, 2022
Merged

merge master#12
fengjian428 merged 113 commits intofengjian428:masterfrom
apache:master

Conversation

@fengjian428
Copy link
Owner

Tips

What is the purpose of the pull request

(For example: This pull request adds quick-start document.)

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

jerryshao and others added 30 commits June 5, 2022 11:05
* Add Call Procedure for marker deletion
… NullPointerException (#5755)

SeekTo top cells avoid NullPointerException
…rsion may cause the fileID of the task to not be loaded correctly (#5763)

Co-authored-by: john.wick <john.wick@vipshop.com>
…mitMetadata` parsing (#5733)

As has been outlined in HUDI-4176, we've hit a roadblock while testing Hudi on a large dataset (~1Tb) having pretty fat commits where Hudi's commit metadata could reach into 100s of Mbs.
Given the size some of ours commit metadata instances Spark's parsing and resolving phase (when spark.sql(...) is involved, but before returned Dataset is dereferenced) starts to dominate some of our queries' execution time.

- Rebased onto new APIs to avoid excessive Hadoop's Path allocations
- Eliminated hasOperationField completely to avoid repeatitive computations
- Cleaning up duplication in HoodieActiveTimeline
- Added caching for common instances of HoodieCommitMetadata
- Made tableStructSchema lazy;
… bulk insert row writer with SimpleKeyGen and virtual keys (#5664)

Bulk insert row writer code path had a gap wrt hive style partitioning and default partition when virtual keys are enabled with SimpleKeyGen.  This patch fixes the issue.
)

- When async indexer is invoked only with "FILES" partition, it fails. Fixing it to work with Async indexer. Also, if metadata table itself is not initialized, and if someone is looking to build indexes via AsyncIndexer, first they are expected to index "FILES" partition followed by other partitions. In general, we have a limitation of building only one index at a time w/ AsyncIndexer and hence. Have added guards to ensure these conditions are met.
)

- When Non partitioned key gen is used with virtual keys, read path could break since partition path may not exist.
Co-authored-by: yuezhang <yuezhang@freewheel.tv>
- Key fetched from metadata table especially from base file reader is not sorted. and hence may result in throwing NPE (key prefix search) or unnecessary seeks to starting of Hfile (full key look ups). Fixing the same in this patch. This is not an issue with log blocks, since sorting is taking care within HoodieHfileDataBlock.
- Commit where the sorting was mistakenly reverted [HUDI-3760] Adding capability to fetch Metadata Records by prefix  #5208
* HiveConf needs to load fs conf to allow instantiation via AWSGlueClientFactory

* Resolve metastore uri config before loading fs conf

* Skip hiveql due to CI issue

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
…Integration (#5737)

There are multiple issues with our current DataSource V2 integrations: b/c we advertise Hudi tables as V2, Spark expects it to implement certain APIs which are not implemented at the moment, instead we're using custom Resolution rule (in HoodieSpark3Analysis) to instead manually fallback to V1 APIs.  This commit fixes the issue by reverting DSv2 APIs and making Spark use V1, except for schema evaluation logic.
* [MINOR] FlinkStateBackendConverter add more  exception message
…es easily (#5744)


Co-authored-by: yanenze <yanenze@keytop.com.cn>
…rite (#5619)


Co-authored-by: xicm <xicm@asiainfo.com>
…ading metadata table (#5840)

When explicitly specifying the metadata table path for reading in spark, the "hoodie.metadata.enable" is overwritten to true for proper read behavior.
Replace SerializableConfiguration with SerializableWritable for broadcasting the hadoop configuration before initializing HFile readers
- Upgrade junit to 5.7.2
- Downgrade surefire and failsafe to 2.22.2
- Fix test failures that were previously not reported
- Improve azure pipeline configs

Co-authored-by: liujinhui1994 <965147871@qq.com>
Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
Co-authored-by: superche <superche@tencent.com>
#5790)

TestReaderFilterRowKeys needs to get the key from RECORD_KEY_METADATA_FIELD, but the writer in current UT does not populate the meta field and the schema does not contains meta fields.

This fix writes data with schema which contains meta fields and calls writeAvroWithMetadata for writing.

Co-authored-by: xicm <xicm@asiainfo.com>
add new config key hoodie.deltastreamer.source.kafka.enable.failOnDataLoss
when failOnDataLoss=false (current behaviour, the default), log a warning instead of seeking to earliest silently
when failOnDataLoss is set, fail explicitly
#5788)

Adding more logs to assist in debugging with HoodieFlinkWriteClient.getOrCreateWriteHandle throwing exception
zhangyue19921010 and others added 29 commits June 29, 2022 01:43
…6002)

Co-authored-by: yuezhang <yuezhang@yuezhang-mac.freewheelmedia.net>
…s default (#5174)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
Co-authored-by: Wenning Ding <wenningd@amazon.com>
#5907)

* [HUDI-4285] add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer

* add ut

Co-authored-by: wangzixuan.wzxuan <wangzixuan.wzxuan@bytedance.com>
…tream if using HDFS (#5048)

Add the differentiated logic of creating immutable file in HDFS by first creating the file.tmp and then renaming the file
… api (#5445)

Co-authored-by: jerryyue <jerryyue@didiglobal.com>
Co-authored-by: superche <superche@tencent.com>
* [HUDI-3730] Improve meta sync class design and hierarchies (#5754)
* Implements class design proposed in RFC-55

Co-authored-by: jian.feng <fengjian428@gmial.com>
Co-authored-by: jian.feng <jian.feng@shopee.com>
Co-authored-by: voonhou.su <voonhou.su@shopee.com>
… partitions through a standalone job. (#4459)

Co-authored-by: yuezhang <yuezhang@freewheel.tv>
…ion plan at once (#5677)

* [HUDI-4152] Flink offline compaction allow compact multi compaction plan at once

* [HUDI-4152] Fix exception for duplicated uid when multi compaction plan are compacted

* [HUDI-4152] Provider UT & IT for compact multi compaction plan

* [HUDI-4152] Put multi compaction plans into one compaction plan source

* [HUDI-4152] InstantCompactionPlanSelectStrategy allow multi instant by using comma

* [HUDI-4152] Add IT for InstantCompactionPlanSelectStrategy
…on. (#5995)

* fix for updateTableParameters which is not excluding partition columns and updateTableProperties boolean check

* Fix - serde parameters getting overrided on table property update

* removing stale syncConfig
@fengjian428 fengjian428 merged commit 424c1da into fengjian428:master Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.