-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[DNM] Release 0.13.1 rc1 testing #8709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit adds the missing Apache License in some source files.
This commit fixes `scripts/release/validate_staged_release.sh` to skip checking `release/release_guide*` for "Binary Files Check" and "Licensing Check".
Recently we have more flakiness in our CI runs. So, taking a stab at fixing some of the high frequent tests. Tests that are fixed: TestHoodieClientOnMergeOnReadStorage ( testReadingMORTableWithoutBaseFile, testCompactionOnMORTable, testLogCompactionOnMORTable, testLogCompactionOnMORTableWithoutBaseFile) Reasoning for flakiness: we generate only 10 inserts in our tests and it does not guarantee we have records for all 3 partitions(HoodieTestDataGenerator). Fixes: HoodieTestDataGenerator was choosing random partition among list of partitions while generating insert records. Fixed that to do round robin. Also, bumped up the num of records inserted in some of the flaky tests to 100 from 10. Fixed respective MOR tests to disable small file handling.
…om Metadata Table (apache#7642) Most recently, while trying to use Metadata Table in Bloom Index it was resulting in failures due to exhaustion of S3 connection pool no matter how (reasonably big) we're setting the pool size (we've tested up to 3k connections). This PR focuses on optimizing the Bloom Index lookup sequence in case when it's leveraging Bloom Filter partition in Metadata Table. The premise of this change is based on the following observations: Increasing the size of the batch of the requests to MT allows to amortize the cost of processing it (bigger the batch, lesser the cost). Having too few partitions in the Bloom Index path however, starts to hurt parallelism when we actually probe individual files whether they actually contain target keys or not. Solution to this is to split these 2 in different stages w/ drastically different parallelism levels: constrain parallelism when reading from MT (10s of tasks) and keep at the current level for probing individual files (100s of tasks) Current way of partitioning records (relying on Spark's default partitioner) was entailing that every Spark executor with high likelihood will be opening up (and processing) every file-group of the MT Bloom Filter partition. To alleviate that same hashing algorithm used by MT should be used to partition records into Spark's individual partitions, so that we can limit every task to open no more than 1 file-group in Bloom Filter's partition of MT To achieve that following changes in Bloom Index sequence (leveraging MT) are implemented Bloom Filter probing and actual File Probing are split into 2 separate operations (so that parallelism of each of them could be controlled individually) Requests to MT are replaced to invoke batch APIs Custom partitioner is introduced AffineBloomIndexFileGroupPartitioner repartitioning dataset of filenames with corresponding record keys in a way that is affine w/ MT Bloom Filters' partitioning (allowing us to open no more than a single file-group per Spark's task) Additionally, this PR addresses some of the low-hanging performance optimizations that could considerably improve performance of the Bloom Index lookup sequence like mapping file-comparison pairs to PairRDD (where key is file-name, and value is record-key) instead of RDD so that we could: Do in-partition sorting by filename (to make sure we check all records w/in the file all at once) w/in a single Spark partition instead of global one (reducing shuffling as well) Avoid re-shuffling (by re-mapping from RDD to PairRDD later)
…#7476) This change switches default Write Executor to be SIMPLE ie one bypassing reliance on any kind of Queue (either BoundedInMemory or Disruptor's one). This should considerably trim down on Runtime (compared to BIMQ) Compute wasted (compared to BIMQ, Disruptor) Since it eliminates unnecessary intermediary "staging" of the records in the queue (for ex, in Spark such in-memory enqueueing occurs at the ingress points, ie shuffling), and allows to handle records writing in one pass (even avoiding making copies of the records in the future)
Fixing flaky parquet projection tests. Added 10% margin for expected bytes from col projection.
Change logging mode names for CDC feature to - op_key_only - data_before - data_before_after
…er-bundle` to root pom (apache#7774)" (apache#7782) This reverts commit 7352661.
…che#7759) Updates the HoodieAvroRecordMerger to use the new precombine API instead of the deprecated one. This fixes issues with backwards compatibility with certain payloads.
We introduced a new way to scan log blocks in LogRecordReader and have named it as "hoodie.log.record.reader.use.scanV2". Fixing the config name to be elegant: "hoodie.optimized.log.blocks.scan.enable". Fixing the corresponding Metadata config as well.
Fix tests and artifact deployment for metaserver.
…7784) Fixes deploy_staging_jars.sh to generate all hudi-utilities-slim-bundle.
Co-authored-by: hbg <[email protected]>
Cleaning up some of the recently introduced configs: Shortening file-listing mode override for Spark's FileIndex Fixing Disruptor's write buffer limit config Scoped CANONICALIZE_NULLABLE config to HoodieSparkSqlWriter
…ache#7790) - Ensures that Hudi CLI commands which require launching Spark can be executed with hudi-cli-bundle
…rs (apache#8558) Each writer updates the checkpoint in commit metadata with its own batchId info only. When checking to skip the current batch, we walk back in the timeline and find the current writer's last committed batchId. Also fixed bulk insert row writer path for checkpoint management with streaming writes.
…en archive timeline (apache#8443) Co-authored-by: hbg <[email protected]>
…he#8595) Passed the hadoop config options from per-job to the FileIndex correctly.
…stractTableFileSystemView (apache#8585)
The test TestIncrementalReadWithFullTableScan#testFailEarlyForIncrViewQueryForNonExistingFiles can fail due to changes in archival behavior because of hard-coded parameters. This commit improves the test to be more robust.
This commit removes `scala-maven-plugin` from `hudi-client-common` and `hudi-flink-client` which don't have any scala files yet have scala-maven-plugin specified in their pom files. Without the fix it would fail when building `hudi-trino-bundle` with JDK 17. Co-authored-by: Shawn Chang <[email protected]>
This commit fixes a bug introduced by apache#6847. apache#6847 extends the InProcessLockProvider to support multiple tables in the same process, by having an in-memory static final map storing the mapping of the table base path to the read-write reentrant lock, so that the writer uses the corresponding lock based on the base path. When closing the lock provider, close() removes the lock entry. Since close() is called when closing the write client, the lock is removed and subsequent concurrent writers will get a different lock instance on the same table, causing the locking mechanism on the same table to be useless. The fix gets rid of the lock removal operation in the `close()` call since it has to be kept for concurrent writers. A new test `TestInProcessLockProvider#testLockIdentity` based on the above scenario is added to guard the behavior.
… Glue Sync (apache#8388) - Avoid loading archived timeline during Hive and Glue Sync. - Add the fallback mechanism in Hive and Glue catalog sync so that if the last commit time synced falls behind to be before the start of the active timeline of Hudi table, the sync gets all partition paths on storage and resolves the difference compared to what's in the metastore, instead of reading archived timeline. - Enhances the tests to cover the new logic.
Fix typos and format text-blocks properly.
…g in duplicate data (apache#8503)
…out ACTION_STATE field (apache#8607)
apache#8631) * Use correct zone id while calculating earliestTimeToRetain * Use metaClient table config
…ition field (apache#7355) * Partition query in hive3 returns null for Hive 3.x.
* Disable vectorized reader for spark 3.3.2 only * Keep compile version to be Spark 3.3.1 --------- Co-authored-by: Rahil Chertara <[email protected]>
This commit adds the bundle validation on Spark 3.3.2 in Github Java CI to ensure compatibility after we fixed the compatibility issue in apache#8082.
…E_UPSERT is disabled (apache#7998)
There was a bug that the delete records are assumed to be marked by "_hoodie_is_deleted"; however, custom CDC payloads use "op" field to mark deletes. In such a case, AWS DMS payload and Debezium payload failed with deletes. This commit fixes the issue by adding a new API isDeleteRecord(GenericRecord genericRecord) in BaseAvroPayload to allow the payload to implement custom logic to indicate if a record is a delete record. Co-authored-by: Raymond Xu <[email protected]>
Co-authored-by: hbg <[email protected]>
…ype as not nullable (apache#8728)
Collaborator
Contributor
Author
|
Closing this as this is testing 0.13.1 RC1 only. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Change Logs
As above
Impact
Testing only
Risk level
none
Documentation Update
N/A
Contributor's checklist