Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jul 26, 2022

What changes were proposed in this pull request?

CatalogImpl has been updated quite a bit recently, to support v2 catalogs. This PR revisits the recent changes and refines the code a little bit:

  1. fix the naming "3 layer namespace". The spark catalog plugin supports n-part namespace. This PR changes it to qualified name with catalog.
  2. always use the v2 code path. Today the v2 code path can already cover all the functionalities of CatalogImpl and it's unnecessary to keep the v1 code path in CatalogImpl. It also makes sure the behavior is consistent between db.table and spark_catalog.db.table. Previously it was not consistent in some cases, see the updated tests for functions.
  3. simplify try {v1 code path} catch {... v2 code path} to val name = if (table exists in HMS) {name qualified with spark_catalog} else {parsed name}; v2 code path

Why are the changes needed?

code cleanup.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this looks like more than code cleanup. This is a kind of major refactoring, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the previous code makeFunction(FunctionIdentifier(functionName, Option(dbName))) has a backward compatibility bug before?

Copy link
Contributor

@amaliujia amaliujia Jul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This really depends on if SQL analyzer respect current catalog when resolving UnresolvedFunc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code is good as it always goes through the v1 code path. The new code appends the catalog name spark_catalog and goes through v2 code path, which is good as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cause a compilation failure during this transition.

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:321:10: constructor cannot be instantiated to expected type;
[error]  found   : org.apache.spark.sql.catalyst.analysis.ResolvedNamespace
[error]  required: org.apache.spark.sql.connector.catalog.CatalogPlugin
[error]     case ResolvedNamespace(catalog: CatalogPlugin, namespace) =>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify: so the UnresolvedNamespace will follow currentCatalog when input catalog is Nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is the same as SQL (when we run SHOW DATABASES without anything after).

@amaliujia
Copy link
Contributor

maybe also API doc clean up in the Catalog interface?

@cloud-fan cloud-fan force-pushed the catalog branch 2 times, most recently from 059438f to 9ce3e4c Compare July 28, 2022 14:36
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to check global temp view db. It doesn't belong to any catalog and v2 commands take care of it as well.

@cloud-fan cloud-fan changed the title [WIP] code cleanup for CatalogImpl [WIP] Refine CatalogImpl Jul 28, 2022
@cloud-fan cloud-fan changed the title [WIP] Refine CatalogImpl [SPARK-39912][SQL] Refine CatalogImpl Jul 28, 2022
Copy link
Member

@dongjoon-hyun dongjoon-hyun Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't mind, could you avoid to use / here? You can use or literally. Otherwise, / could be read as another multi-layer unlike table/view case. We are not confused at table/view, but this new sentence looks a little confusing to me at least :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it becomes tricky to document as

  1. namespace could be multiple layers, e.g. ns1.ns2.ns3.ns4
  2. I guess technically usually people may think database is a single name (and . could be treated as a part of the name than a layer separator).

but keep the database will maintain backward understanding as people have get used to it.

I was thinking a few options

  1. database or namespace
  2. namespace (database)
  3. database (namespace).

I am not sure if there are more clear way to document.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not going to rename database in the APIs, I'd prefer database (namespace)

@amaliujia
Copy link
Contributor

Is listTables() does not respect current catalog fixed in this PR?

@cloud-fan
Copy link
Contributor Author

Is listTables() does not respect current catalog fixed in this PR?

I think so, by always passing the fully qualified name to getTable in listTables. We can add tests later, to make this PR a pure refinement.

@amaliujia
Copy link
Contributor

Is listTables() does not respect current catalog fixed in this PR?

I think so, by always passing the fully qualified name to getTable in listTables. We can add tests later, to make this PR a pure refinement.

thanks for the confirmation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never documented before and is a hidden assumption. To cover this case I have to call TableCatalog.loadTable manually instead of running the v2 command.

@amaliujia
Copy link
Contributor

The test is failing for example on

Expected:
    Database(name='default', catalog=None, description='default database', ...
Got:
    Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/__w/spark/spark/python/target/4ab5b07b-a1fd-4be3-b29b-6bfc1ac33d6d/e75ca4c7-9d5e-409d-b319-fba011a1ad51')

I think the actual result is expected as of now.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Aug 2, 2022

This is another instance that db.tbl is inconsistent with spark_catalog.db.tbl. This PR fixed it.

@dongjoon-hyun
Copy link
Member

Got it.

@cloud-fan
Copy link
Contributor Author

ready for review, cc @zhengruifeng @HyukjinKwon

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several questions that worth discussing.

Once we are are good on those open questions, we can go to detail review. Overall speaking looks good but needs to help check if there are typo, etc.

expect_error(listColumns("zxwtyswklpf", "default"),
paste("Error in listColumns : analysis error - Table",
"'zxwtyswklpf' does not exist in database 'default'"))
paste("Table or view not found: spark_catalog.default.zxwtyswklpf"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually is a user behavior change as it returns a different error message now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we treat error message change as behavior change. We change error messages from time to time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG


/**
* Returns the current default database in this session.
* Returns the current database (namespace) in this session.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a different way to refer to here: schema.

Are we have decided to use namespace in Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

namespace is more like the official name. database/schema is only for the hive catalog. We can change database to database/schema though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. Just wanted to confirm that we don't miss anything obvious.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @sunchao since he is an export on this area as Apache Hive PMC member.

Copy link
Contributor

@amaliujia amaliujia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactoring looks reasonable. There are some comments that document key decisions, thanks for adding those.

/**
* Returns a list of columns for the given table/view in the specified database.
* Returns a list of columns for the given table/view in the specified database under the Hive
* Metastore.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this looks nice to explicitly says HMS

@amaliujia
Copy link
Contributor

Did you include the test in https://github.com/apache/spark/pull/37241/files to test if listTables respect current catalog?

// setCurrentDatabase("catalog.db") it will search for a database catalog.db in the catalog.
val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName)
sparkSession.sessionState.catalogManager.setCurrentNamespace(ident.toArray)
// we assume `dbName` will not include the catalog mame. e.g. if you call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mame -> name

// List user functions.
val plan1 = ShowFunctions(UnresolvedNamespace(namespace),
userScope = true, systemScope = false, None)
sparkSession.sessionState.executePlan(plan1).toRdd.collect().foreach { row =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to check whether function is temp here?
btw, what about adding some test cases of user defined temp functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ShowFunctions command prints temp function names as a single part, and persistent function names as qualified names. Here we parse the name, and then look it up, which doesn't care it's temp or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we already have tests for listFunctions with temp function in CatalogSuite

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@cloud-fan cloud-fan changed the title [SPARK-39912][SQL] Refine CatalogImpl [SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl Aug 8, 2022
@cloud-fan
Copy link
Contributor Author

Did you include the test in ...

I've added tests for both listTables and listFunctions

@cloud-fan
Copy link
Contributor Author

thanks for review, merging to master!

@cloud-fan cloud-fan closed this in 5c9175c Aug 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants