Skip to content

Conversation

@slfan1989
Copy link
Contributor

@slfan1989 slfan1989 commented Apr 17, 2023

Change Logs

JIRA: HUDI-6086
HiveSchemaUtil#generateCreateDDL code is not easy to read. With Danny's help, we decided to refactor this part of the code using StringBuilder. I have written detailed comments for each section of the code to make it more readable.

  • Before Refactor:

image

  • After Refactor:

image

Impact

none.

Risk level (write none, low medium or high below)

none.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@slfan1989
Copy link
Contributor Author

@danny0405 Can you help review this pr? Thank you very much!

@danny0405 danny0405 changed the title [HUDI-6086]. Improve HiveSchemaUtil#generateCreateDDL With ST. [HUDI-6086] Improve HiveSchemaUtil#generateCreateDDL With ST Apr 18, 2023
+ "<" + ROW_FORMAT + ">\n"
+ "<" + LOCATION_BLOCK + ">"
+ "TBLPROPERTIES (\n"
+ "<" + PROPERTIES + ">)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two concerns here:

  1. Where does the antlr jar comes from? Do we need to package antlr jar explicitly
  2. the antlr grammar template is not that straight-forward, should refactoring like using the stringBuilder with params be enough now?

Copy link
Contributor Author

@slfan1989 slfan1989 Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Thank you very much for helping review the code!

  1. Antlr was introduced into the Hudi project in [HUDI-4111] Bump ANTLR runtime version in Spark 3.x ([HUDI-4111] Bump ANTLR runtime version in Spark 3.x #5606), we can find the reference of antlr in the parent pom.xml
<antlr.version>4.8</antlr.version>

hudi-hive-sync references hive-exec, so we can use antlr directly.

  1. Thank you for your question! I agree with your point of view, the antlr grammar template is not that straight-forward. But for generating sql, I think we can use it, because the template can better describe the components of sql.

HiveSchemaUtil#CREATE_TABLE_TEMPLATE

private static final String CREATE_TABLE_TEMPLATE =
      "CREATE <" + EXTERNAL + ">TABLE <if(" + DATABASE_NAME + ")>`<" + DATABASE_NAME + ">`.<endif>"
      + "`<" + TABLE_NAME + ">`(\n"
      + "<" + LIST_COLUMNS + ">)\n"
      + "<" + PARTITIONS + ">\n"
      + "<" + BUCKETS + ">\n"
      + "<" + ROW_FORMAT + ">\n"
      + "<" + LOCATION_BLOCK + ">"
      + "TBLPROPERTIES (\n"
      + "<" + PROPERTIES + ">)";

Through this template, we can know the elements of the sql to be generated, such as the name of the table, the name of the database, and various attributes of the table. It is easier to read this part of the code.

We will not use very complicated templates, just to improve code readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of antlr, this introduces unnecessary complexity and dependency of the antlr jar, how about we refactor the code using just a string builder, each sub-clause can be built separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. I will refactor this part of code with string builder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 Thank you very much for helping to review the code and providing suggestions for improvement! I have refactored this part of the code using StringBuilder, as we discussed, to improve its readability. I also added some comments to provide additional context and clarification. Sorry for the delayed response.

When you have time, could you please review this PR again? Thank you very much!

@danny0405 danny0405 self-assigned this Apr 18, 2023
@danny0405 danny0405 added type:refactor Code refactoring and cleanup priority:medium Moderate impact; usability gaps labels Apr 18, 2023
@slfan1989 slfan1989 requested a review from danny0405 April 27, 2023 02:41
@slfan1989 slfan1989 changed the title [HUDI-6086] Improve HiveSchemaUtil#generateCreateDDL With ST [HUDI-6086] Improve HiveSchemaUtil#generateCreateDDL With StringBuilder Apr 27, 2023
@slfan1989
Copy link
Contributor Author

slfan1989 commented Apr 28, 2023

@hudi-bot run azure

@danny0405
Copy link
Contributor

Thanks for the contribution, I have reviewed and created a patch:
6086.patch.zip

I saw some sub-claused locations are changed, like the LOCATION and CLUSTETERED BY, is that as expected?

* @param serdeClass serdeClass.
* @param serdeProperties serdeProperties.
* @param tableProperties tableProperties.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just remove the patam docs if there is no much to address:

  /**
   * Create a table with the given params.
   */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion! I will simplify the comment information.

@slfan1989
Copy link
Contributor Author

@danny0405

Thank you very much for your reminder! I carefully readed the code again and modify the code.

The concatenation order is consistent between the original and modified versions, both of which consist of 7 steps.

  1. Append Create Table
  2. Append Columns
  3. Append Partitions
  4. Append Bucket
  5. Append Row Format
  6. Append Location
  7. Append TblProperties

Original Version.

image

Modified Version.

image

@slfan1989
Copy link
Contributor Author

I saw some sub-claused locations are changed, like the LOCATION and CLUSTETERED BY, is that as expected?

@danny0405 Thank you for your thorough review! The modification to the LOCATION code should be as expected, but there were some issues with the modification to the BUCKET code. I have reverted the BUCKET code back to its original state.

@danny0405
Copy link
Contributor

I saw some sub-claused locations are changed, like the LOCATION and CLUSTETERED BY, is that as expected?

@danny0405 Thank you for your thorough review! The modification to the LOCATION code should be as expected, but there were some issues with the modification to the BUCKET code. I have reverted the BUCKET code back to its original state.

We better have a straight-forward string paradigm of what the DDL looks like before and after the change, in case there are some discrepencies.

@slfan1989
Copy link
Contributor Author

I will add a set of unit tests to compare the generated SQL before and after modification.

String expectedCreateDataBaseSQL = "CREATE DATABASE IF NOT EXISTS `test_database`";
String testDataBase = HiveSchemaUtil.generateCreateDataBaseDDL("test_database");
assertEquals(expectedCreateDataBaseSQL, testDataBase);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have a test case for create table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion, I will submit the test code as soon as possible.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build


/**
* Create a table with the given params.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the doc correct?


/**
* Create a table with the given params.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


public static String generateSchemaString(MessageType storageSchema, List<String> colsToSkip, boolean supportTimestamp) throws IOException {
/**
* Generates the Column DDL string for creating a Hive table from the given schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the original document that we add a SQL statement demo for the method, can we keep that?

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Moderate impact; usability gaps size:M PR with lines of changes in (100, 300] type:refactor Code refactoring and cleanup

Projects

Status: 🆕 New

Development

Successfully merging this pull request may close these issues.

3 participants