Skip to content

Conversation

@amrishlal
Copy link
Contributor

Change Logs

Add user-defined feature flag for disabling prepped merge.

Impact

New feature flag ENABLE_OPTIMIZED_MERGE_WRITES

Risk level (write none, low medium or high below)

Low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets also parametrize some tests for MIT as well

@jonvex
Copy link
Contributor

jonvex commented Jul 17, 2023

We can set ENABLE_PREPPED_MERGE_WRITES in MergeIntoHoodieTableCommand using hoodie.spark.sql.writes.optimized.enable. We don't need to introduce another public config that way

@amrishlal
Copy link
Contributor Author

Lets also parametrize some tests for MIT as well

Parameterized test case to test with both hoodie.spark.sql.optimized.merge.enable set to true and false

We can set ENABLE_PREPPED_MERGE_WRITES in MergeIntoHoodieTableCommand using hoodie.spark.sql.writes.optimized.enable. We don't need to introduce another public config that way

Moved SQL_MERGE_INTO_WRITES back to HoodieSparkSqlWriter.scala as it was originally, created ENABLE_OPTIMIZED_SQL_MERGE_WRITES in DataSourceOptions.scala for use as public config.

@nsivabalan
Copy link
Contributor

@hudi-bot run azure

@codope codope changed the title [HUDI-6315] [WIP] Feature flag for disabling prepped merge. [HUDI-6315] Feature flag for disabling prepped merge. Jul 18, 2023
@amrishlal
Copy link
Contributor Author

@hudi-bot run azure

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized we have two diff configs across MIT and UPDATE/DELETEs. not a good user exp imo. lets unfiy them.

So, we will have one config "hoodie.spark.sql.optimized.writes.enable" in HoodieWriteConfig for users to enable to disable the optimized flow. default value is true.

But internally, we can use two diff internal configs, one for MIT (_hoodie.datasource.merge.into.prepped) and one for UPDATE and DELETEs (_hoodie.datasource.writes.prepped).

.withDocumentation("Controls whether spark sql optimized update is enabled.")
.withDocumentation("Controls whether spark sql prepped update and delete is enabled.")

val ENABLE_OPTIMIZED_SQL_MERGE_WRITES: ConfigProperty[String] = ConfigProperty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also thinking, from a user standpoint we should have just 1 config to enable or disable the optimized flow (irrespective of whether its mIT or updates or deletes).
but internally we can use diff configs if we wish to differentiate MIT and rest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

+ "The class must be a subclass of `org.apache.hudi.callback.HoodieClientInitCallback`."
+ "By default, no Hudi client init callback is executed.");

public static final String WRITE_PREPPED_MERGE_KEY = "_hoodie.datasource.merge.into.prepped";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add java docs calling out the purpose of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. lets name this
_hoodie.spark.sql.merge.into.prepped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets name the variable
SPARK_SQL_MERGE_INTO_PREPPED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY,
optParams.getOrDefault(SQL_MERGE_INTO_WRITES.key(), SQL_MERGE_INTO_WRITES.defaultValue().toString))
.equalsIgnoreCase("true")) {
val df = if (optParams.getOrDefault(DATASOURCE_WRITE_PREPPED_KEY, "false").toBoolean || optParams.getOrDefault(WRITE_PREPPED_MERGE_KEY, "false").toBoolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also rename the variable
DATASOURCE_WRITE_PREPPED_KEY to
SPARK_SQL_WRITE_PREPPED_KEY

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the config key be
_hoodie.spark.sql.writes.prepped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

test("Test pkless complex merge cond") {
withRecordType()(withTempDir { tmp =>
spark.sql("set hoodie.payload.combined.schema.validate = true")
spark.sql("set hoodie.spark.sql.optimized.merge.enable=true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need to fix all these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@amrishlal
Copy link
Contributor Author

@hudi-bot run azure

public static HoodieIndex createIndex(HoodieWriteConfig config) {
boolean mergeIntoWrites = config.getProps().getBoolean(HoodieInternalConfig.SQL_MERGE_INTO_WRITES.key(),
HoodieInternalConfig.SQL_MERGE_INTO_WRITES.defaultValue());
boolean mergeIntoWrites = config.getProps().getBoolean(HoodieWriteConfig.SPARK_SQL_MERGE_INTO_PREPPED_KEY, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets align the var name w/ the config.
boolean sqlMergeIntoPrepped

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced all instances of mergeIntoWrites to sqlMergeIntoPrepped (except for the original reference in deduceWriterSchema)

* Config key with boolean value that indicates whether record being written is already prepped.
*/
val DATASOURCE_WRITE_PREPPED_KEY = "_hoodie.datasource.write.prepped";
val SPARK_SQL_WRITE_PREPPED_KEY = "_hoodie.spark.sql.writes.prepped";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor
"SPARK_SQL_WRITES_PREPPED_KEY"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

val mergeIntoWrites = parameters.getOrDefault(SQL_MERGE_INTO_WRITES.key(),
SQL_MERGE_INTO_WRITES.defaultValue.toString).toBoolean
val isPrepped = hoodieConfig.getBooleanOrDefault(SPARK_SQL_WRITE_PREPPED_KEY, false)
val mergeIntoWrites = parameters.getOrDefault(SPARK_SQL_MERGE_INTO_PREPPED_KEY, "false").toBoolean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets fix the var name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

hoodieConfig.setDefaultValue(DROP_PARTITION_COLUMNS)
hoodieConfig.setDefaultValue(KEYGENERATOR_CONSISTENT_LOGICAL_TIMESTAMP_ENABLED)
hoodieConfig.setDefaultValue(ENABLE_OPTIMIZED_SQL_WRITES)
hoodieConfig.setDefaultValue(SPARK_SQL_OPTIMIZED_WRITES)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets check if this is really required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. From what I can see it is not really needed.

.option(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key(), "true")
.option(DataSourceWriteOptions.ASYNC_CLUSTERING_ENABLE().key(), "true")
.option(DataSourceWriteOptions.ENABLE_OPTIMIZED_SQL_WRITES().key(), "true")
.option(DataSourceWriteOptions.SPARK_SQL_OPTIMIZED_WRITES().key(), "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for defaults, lets try to use the actual default instead of hard coding it
DataSourceWriteOptions.SPARK_SQL_OPTIMIZED_WRITES().default()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


// test with optimized sql writes enabled / disabled.
spark.sql(s"set hoodie.spark.sql.writes.optimized.enable=$optimizedSqlEnabled")
spark.sql(s"set hoodie.spark.sql.optimized.writes.enable=$optimizedSqlEnabled")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets see if we can use the variable to avoid any mis-steps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@amrishlal
Copy link
Contributor Author

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@nsivabalan
Copy link
Contributor

CI failed due to known flaky tests. going ahead w/ merging

@nsivabalan nsivabalan merged commit f2fdf8a into apache:master Jul 20, 2023
@amrishlal amrishlal deleted the prepped-mit-feature-flag branch August 7, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants