-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Spark 4.0: Add Support for PartitionStatistics Files in RewriteTablePath. #13956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 4.0: Add Support for PartitionStatistics Files in RewriteTablePath. #13956
Conversation
9c651b8 to
c695ae2
Compare
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
|
also cc @szehon-ho @dramaticlly |
@huaxingao, thank you for helping to review the code! @dramaticlly @szehon-ho, could you please take a look at this PR? Thank you very much! |
...k/v4.0/spark/src/test/java/org/apache/iceberg/spark/actions/TestRewriteTablePathsAction.java
Outdated
Show resolved
Hide resolved
|
CC: @gaborkaszab - Another place where the stat files need to be considered |
dramaticlly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just one small nit on tests
| if (partitionColumn != null && !partitionColumn.isEmpty()) { | ||
| sqlStr = String.format("%s USING iceberg PARTITIONED BY (%s)", sqlStr, partitionColumn); | ||
| } | ||
| if (!location.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this starts to look a bit hard to read with many configurable parameter on table creation, Can we try move it out to a separate method for table creation, something like
private Table createMetastoreTable(
String location,
Map<String, String> properties,
String namespace,
String tableName,
int snapshotNumber,
String partitionColumn) {
spark.conf().set("spark.sql.catalog.hive", SparkCatalog.class.getName());
spark.conf().set("spark.sql.catalog.hive.type", "hive");
spark.conf().set("spark.sql.catalog.hive.default-namespace", "default");
spark.conf().set("spark.sql.catalog.hive.cache-enabled", "false");
sql("DROP TABLE IF EXISTS hive.%s.%s", namespace, tableName);
// create table
sql(tableDDL(location, properties, namespace, tableName, partitionColumn));
// Insert test data
for (int i = 0; i < snapshotNumber; i++) {
sql("INSERT INTO hive.%s.%s VALUES (%s, 'AAAAAAAAAA', 'AAAA')", namespace, tableName, i);
}
return catalog.loadTable(TableIdentifier.of(namespace, tableName));
}
private String tableDDL(
String location,
Map<String, String> properties,
String namespace,
String tableName,
String partitionColumn) {
String tblProperties =
properties.entrySet().stream()
.map(entry -> String.format("'%s'='%s'", entry.getKey(), entry.getValue()))
.collect(Collectors.joining(","));
StringBuilder createTableSql = new StringBuilder();
createTableSql
.append("CREATE TABLE hive.")
.append(namespace)
.append(".")
.append(tableName)
.append(" (c1 bigint, c2 string, c3 string)");
if (partitionColumn != null && !partitionColumn.isEmpty()) {
createTableSql.append(" USING iceberg PARTITIONED BY (").append(partitionColumn).append(")");
} else {
createTableSql.append(" USING iceberg");
}
if (!location.isEmpty()) {
createTableSql.append(" LOCATION '").append(location).append("'");
}
if (!tblProperties.isEmpty()) {
createTableSql.append(" TBLPROPERTIES (").append(tblProperties).append(")");
}
return createTableSql.toString();
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dramaticlly Thank you for the feedback! Optimized according to suggestions:
- Method renamed to generateCreateTableSQL (more semantic)
- Added complete method comments
|
@huaxingao @dramaticlly Could you please review this PR again? Thank you very much! |
huaxingao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@huaxingao @dramaticlly Thanks a lot for reviewing this PR! Are we good to merge it into the master branch? |
|
Thanks @slfan1989 for the PR! Thanks @dramaticlly for the review! |
Thank you again for your help! |
Related Issues