Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Mar 4, 2023

What changes were proposed in this pull request?

#40252 supported some jdbc API that reuse the proto msg DataSource. The DataFrameReader also have another kind jdbc API that is unrelated to load data source.

Why are the changes needed?

This PR adds the new proto msg PartitionedJDBC to support the remaining jdbc API.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @since 1.4.0
* @since 3.4.0

@beliefer
Copy link
Contributor Author

beliefer commented Mar 6, 2023

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just put the predicates into the DataSource message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the transform path is very different from DataSource.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's simple to add a if-else in transformReadRel, if we can reuse existing DataSource message (with new field predicates )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's put the predicates into the DataSource message.

@beliefer beliefer force-pushed the SPARK-42555_followup branch from 7627a0a to 4d48895 Compare March 6, 2023 03:33
// (Optional) A list of path for file-system backed data sources.
repeated string paths = 4;

// (Optional) Condition in the where clause for each partition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment that this currently only works for jdbc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

table: String,
predicates: Array[String],
connectionProperties: Properties): DataFrame = {
sparkSession.newDataFrame { builder =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please set the format to JDBC? We are now relying the presence of predicates to figure out that something is a JDBC table. That is relying far too heavily on the client doing the right thing, for example what would happen if you set format = parquet and still define predicates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. we can't rely on client.

case s: StructType => reader.schema(s)
case other => throw InvalidPlanInput(s"Invalid schema $other")

if (rel.getDataSource.getPredicatesCount == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the logic a bit like this:

if (format == "jdbc" && rel.getDataSource.getPredicatesCount) {
  // Plan JDBC with predicates
} else id (rel.getDataSource.getPredicatesCount == 0) {
 // Plan datasource
} else {
  throw InvalidPlan(s"Predicates are not supported for $format datasources.)"
}


// (Optional) Condition in the where clause for each partition.
//
// Only work for JDBC data source.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only supported by the JDBC data source.

@beliefer
Copy link
Contributor Author

beliefer commented Mar 8, 2023

@hvanhovell Do you have any other advice? cc @HyukjinKwon @zhengruifeng @dongjoon-hyun

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hvanhovell pushed a commit that referenced this pull request Mar 8, 2023
… remaining jdbc API

### What changes were proposed in this pull request?
#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source.

### Why are the changes needed?
This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes #40277 from beliefer/SPARK-42555_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 39a5512)
Signed-off-by: Herman van Hovell <[email protected]>
@hvanhovell hvanhovell closed this in 39a5512 Mar 8, 2023
@beliefer
Copy link
Contributor Author

beliefer commented Mar 9, 2023

@hvanhovell @zhengruifeng Thank you.

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
… remaining jdbc API

### What changes were proposed in this pull request?
apache#40252 supported some jdbc API that reuse the proto msg `DataSource`. The `DataFrameReader` also have another kind jdbc API that is unrelated to load data source.

### Why are the changes needed?
This PR adds the new proto msg `PartitionedJDBC` to support the remaining jdbc API.

### Does this PR introduce _any_ user-facing change?
'No'.
New feature.

### How was this patch tested?
New test cases.

Closes apache#40277 from beliefer/SPARK-42555_followup.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
(cherry picked from commit 39a5512)
Signed-off-by: Herman van Hovell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants