-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl #37287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this looks like more than code cleanup. This is a kind of major refactoring, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the previous code makeFunction(FunctionIdentifier(functionName, Option(dbName))) has a backward compatibility bug before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really depends on if SQL analyzer respect current catalog when resolving UnresolvedFunc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code is good as it always goes through the v1 code path. The new code appends the catalog name spark_catalog and goes through v2 code path, which is good as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to cause a compilation failure during this transition.
[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:321:10: constructor cannot be instantiated to expected type;
[error] found : org.apache.spark.sql.catalyst.analysis.ResolvedNamespace
[error] required: org.apache.spark.sql.connector.catalog.CatalogPlugin
[error] case ResolvedNamespace(catalog: CatalogPlugin, namespace) =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify: so the UnresolvedNamespace will follow currentCatalog when input catalog is Nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this is the same as SQL (when we run SHOW DATABASES without anything after).
|
maybe also API doc clean up in the Catalog interface? |
059438f to
9ce3e4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to check global temp view db. It doesn't belong to any catalog and v2 commands take care of it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't mind, could you avoid to use / here? You can use or literally. Otherwise, / could be read as another multi-layer unlike table/view case. We are not confused at table/view, but this new sentence looks a little confusing to me at least :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah it becomes tricky to document as
- namespace could be multiple layers, e.g. ns1.ns2.ns3.ns4
- I guess technically usually people may think database is a single name (and
.could be treated as a part of the name than a layer separator).
but keep the database will maintain backward understanding as people have get used to it.
I was thinking a few options
- database or namespace
- namespace (database)
- database (namespace).
I am not sure if there are more clear way to document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are not going to rename database in the APIs, I'd prefer database (namespace)
|
Is |
I think so, by always passing the fully qualified name to |
thanks for the confirmation! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is never documented before and is a hidden assumption. To cover this case I have to call TableCatalog.loadTable manually instead of running the v2 command.
|
The test is failing for example on I think the actual result is expected as of now. |
|
This is another instance that |
|
Got it. |
|
ready for review, cc @zhengruifeng @HyukjinKwon |
amaliujia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left several questions that worth discussing.
Once we are are good on those open questions, we can go to detail review. Overall speaking looks good but needs to help check if there are typo, etc.
| expect_error(listColumns("zxwtyswklpf", "default"), | ||
| paste("Error in listColumns : analysis error - Table", | ||
| "'zxwtyswklpf' does not exist in database 'default'")) | ||
| paste("Table or view not found: spark_catalog.default.zxwtyswklpf")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually is a user behavior change as it returns a different error message now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we treat error message change as behavior change. We change error messages from time to time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
|
|
||
| /** | ||
| * Returns the current default database in this session. | ||
| * Returns the current database (namespace) in this session. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a different way to refer to here: schema.
Are we have decided to use namespace in Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
namespace is more like the official name. database/schema is only for the hive catalog. We can change database to database/schema though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. Just wanted to confirm that we don't miss anything obvious.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @sunchao since he is an export on this area as Apache Hive PMC member.
amaliujia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactoring looks reasonable. There are some comments that document key decisions, thanks for adding those.
| /** | ||
| * Returns a list of columns for the given table/view in the specified database. | ||
| * Returns a list of columns for the given table/view in the specified database under the Hive | ||
| * Metastore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 this looks nice to explicitly says HMS
|
Did you include the test in https://github.com/apache/spark/pull/37241/files to test if |
| // setCurrentDatabase("catalog.db") it will search for a database catalog.db in the catalog. | ||
| val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName) | ||
| sparkSession.sessionState.catalogManager.setCurrentNamespace(ident.toArray) | ||
| // we assume `dbName` will not include the catalog mame. e.g. if you call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mame -> name
| // List user functions. | ||
| val plan1 = ShowFunctions(UnresolvedNamespace(namespace), | ||
| userScope = true, systemScope = false, None) | ||
| sparkSession.sessionState.executePlan(plan1).toRdd.collect().foreach { row => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to check whether function is temp here?
btw, what about adding some test cases of user defined temp functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the ShowFunctions command prints temp function names as a single part, and persistent function names as qualified names. Here we parse the name, and then look it up, which doesn't care it's temp or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we already have tests for listFunctions with temp function in CatalogSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
I've added tests for both listTables and listFunctions |
|
thanks for review, merging to master! |
What changes were proposed in this pull request?
CatalogImplhas been updated quite a bit recently, to support v2 catalogs. This PR revisits the recent changes and refines the code a little bit:qualified name with catalog.CatalogImpland it's unnecessary to keep the v1 code path inCatalogImpl. It also makes sure the behavior is consistent betweendb.tableandspark_catalog.db.table. Previously it was not consistent in some cases, see the updated tests for functions.try {v1 code path} catch {... v2 code path}toval name = if (table exists in HMS) {name qualified with spark_catalog} else {parsed name}; v2 code pathWhy are the changes needed?
code cleanup.
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests