-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-5739] Review Hudi Config Defaults #7912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass 1. Till HoodieWriteConfig.
| public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb."; | ||
|
|
||
| // Ethan: Can we provide a default value here that is less likely to collide, | ||
| // So this config also becomes optional? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lock provider should do use sth like hudi_locks as the default table name? then the user can override if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| .withDocumentation("For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service." | ||
| + " Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1"); | ||
|
|
||
| // Ethan: Need to confirm these are good defaults for cost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg
| .sinceVersion("0.10.0") | ||
| .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode"); | ||
|
|
||
| // Ethan: 10 min. Is this timeout too long? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shorten to 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| .sinceVersion("0.10.0") | ||
| .withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table"); | ||
|
|
||
| // Ethan: Move the infer logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for sane default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5892
| public class HoodieArchivalConfig extends HoodieConfig { | ||
| //cfg | ||
|
|
||
| // Ethan: Should we remove this? Archive should always be enabled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this become an unsafe/advanced config instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5893
| .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected " | ||
| + "number of records after merge operation."); | ||
|
|
||
| // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logically makes sense. but I would check the code for side effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go through the code. @nsivabalan do you have any concern around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5892 for now
| .withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)." | ||
| + " This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained."); | ||
|
|
||
| // Ethan: should we make the default 0, to reduce the write amplication and make MOR write fast? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave this alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg
| * Given the importance of supporting such cases for the user's migration to 0.5.x, we are proposing a safety flag | ||
| * (disabled by default) which will allow this old behavior. | ||
| */ | ||
| // Ethan: is this still needed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is used by uber. we should confirm with them.
| .withDocumentation("Whether to allow generation of empty commits, even if no data was written in the commit. " | ||
| + "It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data"); | ||
|
|
||
| // Ethan: maybe this can auto turned on for CDC? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk if we add this for CDC. do we?
| .sinceVersion("0.11.0") | ||
| .withDocumentation("Control to enable release all persist rdds when the spark job finish."); | ||
|
|
||
| // Ethan: shall we remove this and do auto adjustment everywhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should to the extent that its reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan let's discuss this too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a few JIRA tasks to scope the work to be done and parallelized.
| public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb."; | ||
|
|
||
| // Ethan: Can we provide a default value here that is less likely to collide, | ||
| // So this config also becomes optional? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| .sinceVersion("0.10.0") | ||
| .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode"); | ||
|
|
||
| // Ethan: 10 min. Is this timeout too long? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| .sinceVersion("0.10.0") | ||
| .withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table"); | ||
|
|
||
| // Ethan: Move the infer logic here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5892
| public class HoodieAWSConfig extends HoodieConfig { | ||
| // AWS Access Configs | ||
|
|
||
| // Ethan: These configs should only be necessary if no environmental variables are not set (to confirm with code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5892
| public class HoodieArchivalConfig extends HoodieConfig { | ||
| //cfg | ||
|
|
||
| // Ethan: Should we remove this? Archive should always be enabled? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5893
| .sinceVersion("0.11.0") | ||
| .withDocumentation("Control to enable release all persist rdds when the spark job finish."); | ||
|
|
||
| // Ethan: shall we remove this and do auto adjustment everywhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nsivabalan let's discuss this too
| .sinceVersion("0.11.0") | ||
| .withDocumentation("Control to enable release all persist rdds when the spark job finish."); | ||
|
|
||
| // Ethan: shall we remove this and do auto adjustment everywhere? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| .sinceVersion("0.11.0") | ||
| .withDocumentation("Auto adjust lock configurations when metadata table is enabled and for async table services."); | ||
|
|
||
| // Ethan: should we remove this and make the validation mandatory? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5782
| + "This table maintains the metadata about a given Hudi table (e.g file listings) " | ||
| + " to avoid overhead of accessing cloud storage, during queries.") | ||
| public final class HoodieMetadataConfig extends HoodieConfig { | ||
| //cfg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tacked in HUDI-5900
| .noDefaultValue() | ||
| .withDocumentation("Version of timeline used, by the table."); | ||
|
|
||
| // Ethan: to be merged with write payload class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked in HUDI-5892
| .defaultValue("gzip") | ||
| .withDocumentation("Compression Codec for parquet files"); | ||
|
|
||
| // Ethan: Need to annotate the supported Hudi versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hoodie.parquet.compression.codec" default to "snappy"?
|
Closing this as we're done with the review of important configs. |
Change Logs
Describe context and summary for this change. Highlight if any code was copied.
Impact
Describe any public API or user-facing feature change or any performance impact.
Risk level (write none, low medium or high below)
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist