Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Feb 9, 2023

Change Logs

Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact.

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua yihua self-assigned this Feb 9, 2023
@yihua yihua changed the title marked config classes Review Hudi Config Defaults Feb 10, 2023
@yihua yihua changed the title Review Hudi Config Defaults [HUDI-5739] Review Hudi Config Defaults Feb 14, 2023
@yihua yihua force-pushed the mark_all_configs branch from a7c9a35 to a5e96b5 Compare March 1, 2023 18:47
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pass 1. Till HoodieWriteConfig.

public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb.";

// Ethan: Can we provide a default value here that is less likely to collide,
// So this config also becomes optional?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock provider should do use sth like hudi_locks as the default table name? then the user can override if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

.withDocumentation("For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service."
+ " Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1");

// Ethan: Need to confirm these are good defaults for cost
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");

// Ethan: 10 min. Is this timeout too long?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shorten to 2?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table");

// Ethan: Move the infer logic here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for sane default

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5892

public class HoodieArchivalConfig extends HoodieConfig {
//cfg

// Ethan: Should we remove this? Archive should always be enabled?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this become an unsafe/advanced config instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5893

.withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
+ "number of records after merge operation.");

// Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logically makes sense. but I would check the code for side effects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go through the code. @nsivabalan do you have any concern around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5892 for now

.withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)."
+ " This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.");

// Ethan: should we make the default 0, to reduce the write amplication and make MOR write fast?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave this alone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

* Given the importance of supporting such cases for the user's migration to 0.5.x, we are proposing a safety flag
* (disabled by default) which will allow this old behavior.
*/
// Ethan: is this still needed?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used by uber. we should confirm with them.

.withDocumentation("Whether to allow generation of empty commits, even if no data was written in the commit. "
+ "It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data");

// Ethan: maybe this can auto turned on for CDC?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk if we add this for CDC. do we?

.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the spark job finish.");

// Ethan: shall we remove this and do auto adjustment everywhere?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should to the extent that its reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan let's discuss this too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

@hudi-bot
Copy link
Collaborator

hudi-bot commented Mar 2, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a few JIRA tasks to scope the work to be done and parallelized.

public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb.";

// Ethan: Can we provide a default value here that is less likely to collide,
// So this config also becomes optional?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");

// Ethan: 10 min. Is this timeout too long?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

.sinceVersion("0.10.0")
.withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table");

// Ethan: Move the infer logic here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5892

public class HoodieAWSConfig extends HoodieConfig {
// AWS Access Configs

// Ethan: These configs should only be necessary if no environmental variables are not set (to confirm with code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5892

public class HoodieArchivalConfig extends HoodieConfig {
//cfg

// Ethan: Should we remove this? Archive should always be enabled?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5893

.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the spark job finish.");

// Ethan: shall we remove this and do auto adjustment everywhere?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nsivabalan let's discuss this too

.sinceVersion("0.11.0")
.withDocumentation("Control to enable release all persist rdds when the spark job finish.");

// Ethan: shall we remove this and do auto adjustment everywhere?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

.sinceVersion("0.11.0")
.withDocumentation("Auto adjust lock configurations when metadata table is enabled and for async table services.");

// Ethan: should we remove this and make the validation mandatory?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5782

+ "This table maintains the metadata about a given Hudi table (e.g file listings) "
+ " to avoid overhead of accessing cloud storage, during queries.")
public final class HoodieMetadataConfig extends HoodieConfig {
//cfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tacked in HUDI-5900

.noDefaultValue()
.withDocumentation("Version of timeline used, by the table.");

// Ethan: to be merged with write payload class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked in HUDI-5892

.defaultValue("gzip")
.withDocumentation("Compression Codec for parquet files");

// Ethan: Need to annotate the supported Hudi versions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"hoodie.parquet.compression.codec" default to "snappy"?

@yihua
Copy link
Contributor

yihua commented Apr 14, 2023

Closing this as we're done with the review of important configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants