Skip to content

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Mar 1, 2022

What changes were proposed in this pull request?

Implement parser changes needed to support for DEFAULT column values as tracked in https://issues.apache.org/jira/browse/SPARK-38334.

Note that these are the parser changes only. Analysis support will take place in a following PR.

Background: in the future, CREATE TABLE and ALTER TABLE invocations will support setting column default values for later operations. Following INSERT, UPDATE, MERGE statements may then reference the value using the DEFAULT keyword as needed.

Examples:

CREATE TABLE T(a INT, b INT NOT NULL);

-- The default default is NULL
INSERT INTO T VALUES (DEFAULT, 0);
INSERT INTO T(b)  VALUES (1);
SELECT * FROM T;
(NULL, 0)
(NULL, 1)

-- Adding a default to a table with rows, sets the values for the
-- existing rows (exist default) and new rows (current default).
ALTER TABLE T ADD COLUMN c INT DEFAULT 5;
INSERT INTO T VALUES (1, 2, DEFAULT);
SELECT * FROM T;
(NULL, 0, 5)
(NULL, 1, 5)
(1, 2, 5) 

Why are the changes needed?

This new API helps users run DDL and DML statements to add new values to tables and scan existing values from tables more easily.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test coverage in DDLParserSuite.scala

@github-actions github-actions bot added the SQL label Mar 1, 2022
@dtenedor dtenedor marked this pull request as ready for review March 1, 2022 20:09
@github-actions github-actions bot added the DOCS label Mar 1, 2022
@dtenedor dtenedor changed the title [SPARK-38334][SQL] Implement parser support for DEFAULT column values [SPARK-38335][SQL] Implement parser support for DEFAULT column values Mar 2, 2022
Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except minor comments

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@dtenedor
Copy link
Contributor Author

dtenedor commented Mar 2, 2022

jenkins merge

@gengliangwang
Copy link
Member

gengliangwang commented Mar 3, 2022

Supporting default column values is very common among DBMS. However, this will be a breaking change for Spark SQL
Currently Spark SQL

> create table t(i int, j int);
> insert into t values(1);
Error in query: `default`.`t` requires that the data to be inserted have the same number of columns as the target table: target table has 2 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).

After supporting default column value:

> create table t(i int, j int);
> insert into t values(1);
> select * from t;
1	NULL

> create table t2(i int, j int default 0);
> insert into t2 values(1);
> select * from t2;
1	0

I am +1 with the change.
Before merging this PR, I would like to collect the opinions of more committers. We can send SPIP for voting if necessary.
cc @cloud-fan @dongjoon-hyun @viirya @dbtsai @huaxingao @maropu @zsxwing @wangyum @yaooqinn WDYT?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @gengliangwang .

@dongjoon-hyun
Copy link
Member

cc @aokolnychyi , @RussellSpitzer , @rdblue , too.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work.

For the parser change itself, looks okay. As this is a breaking change, I'd like to see some clarification on why this is necessary to have. What issue we have if we don't have this (because we don't have the default value for long time) and do we have any workaround now?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't seem to have the full body yet, what is your release target for this, @dtenedor and @gengliangwang ? I'm curious about the general error handling.

  • Creating NULL default value for NOT NULL column
  • Type mismatch between default value literal and column type.
  • Upcasting or not in case of type mismatch

@dtenedor
Copy link
Contributor Author

dtenedor commented Mar 3, 2022

Creating NULL default value for NOT NULL column
Type mismatch between default value literal and column type.
Upcasting or not in case of type mismatch

IMO:

  • Not Null column can't have Null default
  • Type mismatch between default value literal and column type: we can simply forbid this. Note that we have many numeric types(Byte/Short/Int/Long/Decimal/Float/Double). If both default value literal type and column type are Numeric, it is not considered a mismatch.
  • Upcasting or not in case of type mismatch: casting can happen if both of the literal type and column type are Numeric

@dtenedor WDYT?

Good questions, I replied above earlier. We can perform a type coercion from the provided type to the required type, or return an error if the types are not coercible in this way. We can use existing type coercion rules in the analyzer for this part for consistency with the rest of Spark. For example, coercing an integer to floating-point should work, but coercing a floating-point to boolean should return an error to the user.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @MaxGekk too since he is a release manager for Apache Spark 3.3 who need to cut branch-3.3.

@dtenedor
Copy link
Contributor Author

dtenedor commented Mar 4, 2022

This is ready for another review round @gengliangwang @viirya @wangyum @HyukjinKwon @dongjoon-hyun :)

@dtenedor
Copy link
Contributor Author

dtenedor commented Mar 6, 2022

(Note the prior merge conflict in the lexer has now been resolved.)

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good w/ this change.

@cloud-fan
Copy link
Contributor

This is a parser-only change and the feature is not implemented yet, so definitely not a breaking change. But I'd like to confirm that, is every new SQL feature a breaking change? e.g. adding a new SQL function means that a query failed with "function not found" before, now succeeds. This doesn't seem like a breaking change to me. The same applies to accepting more parameters in a SQL function, accepting more parameter types, etc.

The code change itself LGTM.

@gengliangwang
Copy link
Member

@dtenedor Thanks for the first contribution!
@HyukjinKwon @dongjoon-hyun @viirya @wangyum @cloud-fan thanks for the inputs! I am merging this parser-only PR to unblock @dtenedor's works in this feature.

@dongjoon-hyun
Copy link
Member

@gengliangwang and @cloud-fan . Why do we have Spark 3.4 patch in master branch for Apache Spark 3.3?

@dongjoon-hyun This is for Spark 3.4

Are we going to revert this from branch-3.3? cc @MaxGekk

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 8, 2022

As I mentioned #35690 (review), I assumed we were going to wait until @MaxGekk cut a branch-3.3.

@gengliangwang
Copy link
Member

@dongjoon-hyun I am merging this one to unblock @dtenedor's work on the actual changes in catalogs.
I will:

  • revert this one on branch-3.3
  • make sure no new related PRs merged to master until branch-3.3 is cut

@MaxGekk
Copy link
Member

MaxGekk commented Mar 8, 2022

Are we going to revert this from branch-3.3? cc @MaxGekk

If there is a risk that it can hurt stability. Let's revert it. I will open a blocker for 3.3 than to not forget this.

@dongjoon-hyun
Copy link
Member

Thank you for your confirmations, @gengliangwang and @MaxGekk !

@dongjoon-hyun
Copy link
Member

Hey, @MaxGekk . Did you make a block issue?
To @gengliangwang and @HyukjinKwon , I saw the JIRA was resolved as 3.3 because we want to avoid our merge script shows 3.4 as a new version.
Screen Shot 2022-03-15 at 11 43 10 AM

Since today is the feature freeze day, I recover it to 3.4.

@HyukjinKwon
Copy link
Member

👍

@MaxGekk
Copy link
Member

MaxGekk commented Mar 16, 2022

Hey, @MaxGekk . Did you make a block issue?

Here is the blocker SPARK-38566 and the PR which reverts the commit #35875

MaxGekk added a commit that referenced this pull request Mar 17, 2022
…support

### What changes were proposed in this pull request?
Revert the commit e21cb62 from `branch-3.3`.

### Why are the changes needed?
See discussion in the PR #35690.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #35885 from MaxGekk/revert-default-column-support-3.3.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants