-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-6086] Improve HiveSchemaUtil#generateCreateDDL With StringBuilder #8478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@danny0405 Can you help review this pr? Thank you very much! |
| + "<" + ROW_FORMAT + ">\n" | ||
| + "<" + LOCATION_BLOCK + ">" | ||
| + "TBLPROPERTIES (\n" | ||
| + "<" + PROPERTIES + ">)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two concerns here:
- Where does the antlr jar comes from? Do we need to package antlr jar explicitly
- the antlr grammar template is not that straight-forward, should refactoring like using the stringBuilder with params be enough now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 Thank you very much for helping review the code!
- Antlr was introduced into the Hudi project in [HUDI-4111] Bump ANTLR runtime version in Spark 3.x ([HUDI-4111] Bump ANTLR runtime version in Spark 3.x #5606), we can find the reference of antlr in the parent pom.xml
<antlr.version>4.8</antlr.version>
hudi-hive-sync references hive-exec, so we can use antlr directly.
- Thank you for your question! I agree with your point of view, the antlr grammar template is not that straight-forward. But for generating sql, I think we can use it, because the template can better describe the components of sql.
HiveSchemaUtil#CREATE_TABLE_TEMPLATE
private static final String CREATE_TABLE_TEMPLATE =
"CREATE <" + EXTERNAL + ">TABLE <if(" + DATABASE_NAME + ")>`<" + DATABASE_NAME + ">`.<endif>"
+ "`<" + TABLE_NAME + ">`(\n"
+ "<" + LIST_COLUMNS + ">)\n"
+ "<" + PARTITIONS + ">\n"
+ "<" + BUCKETS + ">\n"
+ "<" + ROW_FORMAT + ">\n"
+ "<" + LOCATION_BLOCK + ">"
+ "TBLPROPERTIES (\n"
+ "<" + PROPERTIES + ">)";
Through this template, we can know the elements of the sql to be generated, such as the name of the table, the name of the database, and various attributes of the table. It is easier to read this part of the code.
We will not use very complicated templates, just to improve code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of antlr, this introduces unnecessary complexity and dependency of the antlr jar, how about we refactor the code using just a string builder, each sub-clause can be built separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. I will refactor this part of code with string builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danny0405 Thank you very much for helping to review the code and providing suggestions for improvement! I have refactored this part of the code using StringBuilder, as we discussed, to improve its readability. I also added some comments to provide additional context and clarification. Sorry for the delayed response.
When you have time, could you please review this PR again? Thank you very much!
|
@hudi-bot run azure |
|
Thanks for the contribution, I have reviewed and created a patch: I saw some sub-claused locations are changed, like the |
| * @param serdeClass serdeClass. | ||
| * @param serdeProperties serdeProperties. | ||
| * @param tableProperties tableProperties. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just remove the patam docs if there is no much to address:
/**
* Create a table with the given params.
*/There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your suggestion! I will simplify the comment information.
|
Thank you very much for your reminder! I carefully readed the code again and modify the code. The concatenation order is consistent between the original and modified versions, both of which consist of
|
@danny0405 Thank you for your thorough review! The modification to the |
We better have a straight-forward string paradigm of what the DDL looks like before and after the change, in case there are some discrepencies. |
|
I will add a set of unit tests to compare the generated SQL before and after modification. |
| String expectedCreateDataBaseSQL = "CREATE DATABASE IF NOT EXISTS `test_database`"; | ||
| String testDataBase = HiveSchemaUtil.generateCreateDataBaseDDL("test_database"); | ||
| assertEquals(expectedCreateDataBaseSQL, testDataBase); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also have a test case for create table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, I will submit the test code as soon as possible.
|
|
||
| /** | ||
| * Create a table with the given params. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the doc correct?
|
|
||
| /** | ||
| * Create a table with the given params. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
|
||
| public static String generateSchemaString(MessageType storageSchema, List<String> colsToSkip, boolean supportTimestamp) throws IOException { | ||
| /** | ||
| * Generates the Column DDL string for creating a Hive table from the given schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the original document that we add a SQL statement demo for the method, can we keep that?


Change Logs
JIRA: HUDI-6086
HiveSchemaUtil#generateCreateDDL code is not easy to read. With Danny's help, we decided to refactor this part of the code using StringBuilder. I have written detailed comments for each section of the code to make it more readable.
Impact
none.
Risk level (write none, low medium or high below)
none.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist