[SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl #37287

cloud-fan · 2022-07-26T07:10:11Z

What changes were proposed in this pull request?

CatalogImpl has been updated quite a bit recently, to support v2 catalogs. This PR revisits the recent changes and refines the code a little bit:

fix the naming "3 layer namespace". The spark catalog plugin supports n-part namespace. This PR changes it to qualified name with catalog.
always use the v2 code path. Today the v2 code path can already cover all the functionalities of CatalogImpl and it's unnecessary to keep the v1 code path in CatalogImpl. It also makes sure the behavior is consistent between db.table and spark_catalog.db.table. Previously it was not consistent in some cases, see the updated tests for functions.
simplify try {v1 code path} catch {... v2 code path} to val name = if (table exists in HMS) {name qualified with spark_catalog} else {parsed name}; v2 code path

Why are the changes needed?

code cleanup.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

dongjoon-hyun

Oh, this looks like more than code cleanup. This is a kind of major refactoring, isn't it?

dongjoon-hyun · 2022-07-26T11:34:08Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

Does this mean the previous code makeFunction(FunctionIdentifier(functionName, Option(dbName))) has a backward compatibility bug before?

This really depends on if SQL analyzer respect current catalog when resolving UnresolvedFunc.

The previous code is good as it always goes through the v1 code path. The new code appends the catalog name spark_catalog and goes through v2 code path, which is good as well.

dongjoon-hyun · 2022-07-26T11:37:04Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

This seems to cause a compilation failure during this transition.

[error] /home/runner/work/spark/spark/sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala:321:10: constructor cannot be instantiated to expected type; [error] found : org.apache.spark.sql.catalyst.analysis.ResolvedNamespace [error] required: org.apache.spark.sql.connector.catalog.CatalogPlugin [error] case ResolvedNamespace(catalog: CatalogPlugin, namespace) =>

amaliujia · 2022-07-26T18:50:52Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

To clarify: so the UnresolvedNamespace will follow currentCatalog when input catalog is Nil?

Yea, this is the same as SQL (when we run SHOW DATABASES without anything after).

amaliujia · 2022-07-26T18:52:17Z

maybe also API doc clean up in the Catalog interface?

cloud-fan · 2022-07-28T14:43:40Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

no need to check global temp view db. It doesn't belong to any catalog and v2 commands take care of it as well.

dongjoon-hyun · 2022-07-28T17:46:54Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala

If you don't mind, could you avoid to use / here? You can use or literally. Otherwise, / could be read as another multi-layer unlike table/view case. We are not confused at table/view, but this new sentence looks a little confusing to me at least :)

yeah it becomes tricky to document as

namespace could be multiple layers, e.g. ns1.ns2.ns3.ns4

I guess technically usually people may think database is a single name (and . could be treated as a part of the name than a layer separator).

but keep the database will maintain backward understanding as people have get used to it.

I was thinking a few options

database or namespace

namespace (database)

database (namespace).

I am not sure if there are more clear way to document.

Since we are not going to rename database in the APIs, I'd prefer database (namespace)

amaliujia · 2022-07-28T23:17:43Z

Is listTables() does not respect current catalog fixed in this PR?

cloud-fan · 2022-07-29T01:34:53Z

Is listTables() does not respect current catalog fixed in this PR?

I think so, by always passing the fully qualified name to getTable in listTables. We can add tests later, to make this PR a pure refinement.

amaliujia · 2022-07-29T05:51:02Z

Is listTables() does not respect current catalog fixed in this PR?

I think so, by always passing the fully qualified name to getTable in listTables. We can add tests later, to make this PR a pure refinement.

thanks for the confirmation!

cloud-fan · 2022-07-29T10:21:33Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

This is never documented before and is a hidden assumption. To cover this case I have to call TableCatalog.loadTable manually instead of running the v2 command.

amaliujia · 2022-08-01T23:56:30Z

The test is failing for example on

Expected:
    Database(name='default', catalog=None, description='default database', ...
Got:
    Database(name='default', catalog='spark_catalog', description='default database', locationUri='file:/__w/spark/spark/python/target/4ab5b07b-a1fd-4be3-b29b-6bfc1ac33d6d/e75ca4c7-9d5e-409d-b319-fba011a1ad51')

I think the actual result is expected as of now.

cloud-fan · 2022-08-02T01:00:22Z

This is another instance that db.tbl is inconsistent with spark_catalog.db.tbl. This PR fixed it.

dongjoon-hyun · 2022-08-02T01:44:19Z

Got it.

python/pyspark/sql/catalog.py

cloud-fan · 2022-08-04T12:45:22Z

ready for review, cc @zhengruifeng @HyukjinKwon

amaliujia

Left several questions that worth discussing.

Once we are are good on those open questions, we can go to detail review. Overall speaking looks good but needs to help check if there are typo, etc.

amaliujia · 2022-08-04T17:49:41Z

R/pkg/tests/fulltests/test_sparkSQL.R

  expect_error(listColumns("zxwtyswklpf", "default"),
-               paste("Error in listColumns : analysis error - Table",
-                     "'zxwtyswklpf' does not exist in database 'default'"))
+               paste("Table or view not found: spark_catalog.default.zxwtyswklpf"))


This actually is a user behavior change as it returns a different error message now?

I don't think we treat error message change as behavior change. We change error messages from time to time.

amaliujia · 2022-08-04T17:50:33Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala


  /**
-   * Returns the current default database in this session.
+   * Returns the current database (namespace) in this session.


there is a different way to refer to here: schema.

Are we have decided to use namespace in Spark?

namespace is more like the official name. database/schema is only for the hive catalog. We can change database to database/schema though.

sounds good. Just wanted to confirm that we don't miss anything obvious.

dongjoon-hyun

cc @sunchao since he is an export on this area as Apache Hive PMC member.

amaliujia

The refactoring looks reasonable. There are some comments that document key decisions, thanks for adding those.

amaliujia · 2022-08-05T21:48:17Z

sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala

  /**
-   * Returns a list of columns for the given table/view in the specified database.
+   * Returns a list of columns for the given table/view in the specified database under the Hive
+   * Metastore.


+1 this looks nice to explicitly says HMS

amaliujia · 2022-08-06T06:09:38Z

Did you include the test in https://github.com/apache/spark/pull/37241/files to test if listTables respect current catalog?

zhengruifeng · 2022-08-05T11:26:02Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

-    // setCurrentDatabase("catalog.db") it will search for a database catalog.db in the catalog.
-    val ident = sparkSession.sessionState.sqlParser.parseMultipartIdentifier(dbName)
-    sparkSession.sessionState.catalogManager.setCurrentNamespace(ident.toArray)
+    // we assume `dbName` will not include the catalog mame. e.g. if you call


mame -> name

zhengruifeng · 2022-08-05T11:35:01Z

sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala

+    // List user functions.
+    val plan1 = ShowFunctions(UnresolvedNamespace(namespace),
+      userScope = true, systemScope = false, None)
+    sparkSession.sessionState.executePlan(plan1).toRdd.collect().foreach { row =>


do we need to check whether function is temp here?
btw, what about adding some test cases of user defined temp functions?

the ShowFunctions command prints temp function names as a single part, and persistent function names as qualified names. Here we parse the name, and then look it up, which doesn't care it's temp or not.

And we already have tests for listFunctions with temp function in CatalogSuite

sounds good

cloud-fan · 2022-08-08T04:56:37Z

Did you include the test in ...

I've added tests for both listTables and listFunctions

cloud-fan · 2022-08-08T15:51:34Z

thanks for review, merging to master!

cloud-fan force-pushed the catalog branch from 7449eaf to c2bc02a Compare July 26, 2022 07:12

cloud-fan mentioned this pull request Jul 26, 2022

[SPARK-39700][SQL][DOCS] Update two-parameter listColumns/getTable/getFunction/tableExists/functionExists functions docs to mention limitation #37105

Closed

github-actions bot added the SQL label Jul 26, 2022

dongjoon-hyun reviewed Jul 26, 2022

View reviewed changes

amaliujia reviewed Jul 26, 2022

View reviewed changes

cloud-fan force-pushed the catalog branch 2 times, most recently from 059438f to 9ce3e4c Compare July 28, 2022 14:36

cloud-fan commented Jul 28, 2022

View reviewed changes

cloud-fan changed the title ~~[WIP] code cleanup for CatalogImpl~~ [WIP] Refine CatalogImpl Jul 28, 2022

cloud-fan changed the title ~~[WIP] Refine CatalogImpl~~ [SPARK-39912][SQL] Refine CatalogImpl Jul 28, 2022

cloud-fan force-pushed the catalog branch from 9ce3e4c to 4048903 Compare July 28, 2022 15:11

dongjoon-hyun reviewed Jul 28, 2022

View reviewed changes

cloud-fan force-pushed the catalog branch from 4048903 to 89ccedd Compare July 29, 2022 10:16

cloud-fan commented Jul 29, 2022

View reviewed changes

cloud-fan force-pushed the catalog branch from 89ccedd to bfdb3e0 Compare August 1, 2022 07:10

github-actions bot added CORE PYTHON R labels Aug 1, 2022

amaliujia mentioned this pull request Aug 2, 2022

[SPARK-39828][SQL] Catalog.listTables should respect currentCatalog #37241

Closed

cloud-fan commented Aug 3, 2022

View reviewed changes

python/pyspark/sql/catalog.py Outdated Show resolved Hide resolved

cloud-fan commented Aug 3, 2022

View reviewed changes

python/pyspark/sql/catalog.py Outdated Show resolved Hide resolved

cloud-fan force-pushed the catalog branch from fa6e968 to 712ab13 Compare August 3, 2022 17:27

code cleanup for CatalogImpl

1fc1847

cloud-fan force-pushed the catalog branch from 712ab13 to 1fc1847 Compare August 4, 2022 03:05

amaliujia reviewed Aug 4, 2022

View reviewed changes

dongjoon-hyun reviewed Aug 5, 2022

View reviewed changes

amaliujia approved these changes Aug 5, 2022

View reviewed changes

zhengruifeng approved these changes Aug 8, 2022

View reviewed changes

cloud-fan changed the title ~~[SPARK-39912][SQL] Refine CatalogImpl~~ [SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl Aug 8, 2022

cloud-fan added 2 commits August 8, 2022 11:40

add test

9dd757d

address comments

92a4dd2

dongjoon-hyun approved these changes Aug 8, 2022

View reviewed changes

zhengruifeng approved these changes Aug 8, 2022

View reviewed changes

cloud-fan closed this in 5c9175c Aug 8, 2022

[SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl #37287

[SPARK-39912][SPARK-39828][SQL] Refine CatalogImpl #37287

Uh oh!

Conversation

cloud-fan commented Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Jul 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Jul 26, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Jul 28, 2022

Uh oh!

cloud-fan commented Jul 29, 2022

Uh oh!

amaliujia commented Jul 29, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia commented Aug 1, 2022

Uh oh!

cloud-fan commented Aug 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Aug 2, 2022

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Aug 4, 2022

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

amaliujia left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jul 26, 2022 •

edited

Loading

amaliujia Jul 26, 2022 •

edited

Loading

dongjoon-hyun Jul 28, 2022 •

edited

Loading

cloud-fan commented Aug 2, 2022 •

edited

Loading