[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc #12601

JustinPihony · 2016-04-22T06:01:58Z

What changes were proposed in this pull request?

This change modifies the implementation of DataFrameWriter.save such that it works with jdbc, and the call to jdbc merely delegates to save.

How was this patch tested?

This was tested via unit tests in the JDBCWriteSuite, of which I added one new test to cover this scenario.

Additional details

@rxin This seems to have been most recently touched by you and was also commented on in the JIRA.

This contribution is my original work and I license the work to the project under the project's open source license.

HyukjinKwon · 2016-04-22T07:18:07Z

@JustinPihony I think we haven't reached the conclusion yet and haven't got any feedback, from who knows this part well (or committers), whether we should deprecate read.jdbc() or support write.format(jdbc) in SPARK-14525.

Usually, we should discuss the proper ways to fix and problems first in JIRA and open a PR.

As I said in the JIRA, if we go for deprecating read.jdbc(), we might need to close this JIRA and create another one.

HyukjinKwon · 2016-04-22T07:19:38Z

I think Additional details could be said in comments not in the PR description because PR description describes what the PR is. Maybe Additional details is not related with the PR itself.

HyukjinKwon · 2016-04-22T10:43:20Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala


-    dataSource.write(mode, df)
+    dataSource.providingClass.newInstance() match {
+      case jdbc: execution.datasources.jdbc.DefaultSource =>


It looks a new method is introduced. I think we don't necessarily have to introduce this new function but use the existing interfaces, eg. CreatableRelationProvider in interfaces.

I agree and admit I was being lazy in not trying to figure out how to make the current implementation return a BaseRelation. I'll take another look today at just turning the DefaultSource into a CreatableRelationProvider

HyukjinKwon · 2016-04-22T11:09:01Z

BTW, it looks it is pretty general that Properties just works like HashMap[String, String] in most cases.

Firstly, I just checked java.sql.Driver API (which JDBC data source uses) and it describes the argument for Properties as below:

info - a list of arbitrary string tag/value pairs as connection arguments. Normally at least a "user" and "password" property should be included.

Secondly, apparently Spark uses the methods below in Properties.

public Set<String> stringPropertyNames()
public String getProperty(String key, String defaultValue)
public void store(OutputStream out, String comments)  // This converts keys and values to String internally.
public synchronized Object setProperty(String key, String value)

It looks they use String for keys and values.

So, I think it might be OK to support write.format("jdbc"). I believe read.format("jdbc") is already being supported ~~and I could not find JIRA issues about the problem for giving some options for read.format("jdbc")~~ found SPARK-10625.

HyukjinKwon · 2016-04-22T11:09:37Z

I think I can rework based on this because it is anyway opened already. Excuse my ping @rxin, @JoshRosen

HyukjinKwon · 2016-04-22T12:06:00Z

Question: I found a PPT which I think was used in Spark Summit by @marmbrus. In 30 page, write.format("jdbc") is used as a example. Is there any way to support this?

JustinPihony · 2016-04-22T14:58:23Z

@HyukjinKwon You will notice that I opted to not deprecate jdbc as I don't think that would be the correct path anyway (unless all format methods were to be deprecated). I'm not sure what gave the impression that I wanted to deprecate, but I think multiple methods to accomplish the same goal is perfectly fine. This change merely alters the underlying implementation so that both methods work, write.format("jdbc") and write.jdbc. So, I think that this realistically addresses all of your other comments surrounding this concern of deprecation.

As to additional details, I was simply following the contributer directions. I only put it under additional details because that was what it was to me.

HyukjinKwon · 2016-04-22T19:55:20Z

@JustinPihony A possible problem was noticed (are keys and values in Properties guaranteed to be converted to String?) in the JIRA before this PR and any evidence (like the ones above) was not prodivded or said before this PR.

Also, I think it is a minor but still it looks not sensible that Additional details is in the description because it does not look related with the PR itself directly +because the changes in this PR might not be related to cc for someone to review.

JustinPihony · 2016-04-23T04:33:54Z

@HyukjinKwon I just posted on the JIRA the background of Properties and how reasonable it is to assume it can be converted to a String.

maropu · 2016-04-27T07:24:07Z

Even if so, I think we need a kinda wrapper or something to safely convert Properties to Map<String, String> because newbie users could easily & wrongly put non-string values in Properties.

HyukjinKwon · 2016-05-03T06:01:02Z

@rxin I also noticed Python API is supporting properties, for reading and writing, as a dict having "arbitrary string tag/value", here. Could we go ahead with this?

JustinPihony · 2016-05-06T06:38:49Z

I can finish this next Monday(fix the conflicts that now exist), and will actually do that given the above comments. I'd still like to get an opinion on whether I should change the code to be a CreatableRelationProvider or not? I like the idea, but believe that there is much more room for error with that aspect.

JustinPihony · 2016-05-19T19:43:19Z

I just updated the branch to have no conflicts. Again, either the code looks good to merge, or I can make JDBC a CreatableRelationProvider (but that comes with additional baggage as already discussed...might be better in a separate change if at all?)

…saving

JustinPihony · 2016-06-07T04:52:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala

+    val resolvedSchema = JDBCRDD.resolveTable(url, table, properties)
+    providedSchemaOption match {
+      case Some(providedSchema) =>
+        if (providedSchema.sql.toLowerCase == resolvedSchema.sql.toLowerCase) resolvedSchema


This is the only area I'm unsure about. I'd like a second opinion on whether this seems ok, or if I need to build something more custom for schema comparison.

I guess it would make sense if it does not try to apply the resolved schema but just use the specified one when the schema is explicitly set like the other data sources.

I can easily do a simpler getOrElse as is done in spark-xml which has more of a benefit of being lazier. But if an error does occur due to a mismatch, then the error is further from the original issue. I'm fine with either scenario, but at least wanted to give the other side for this one. Thoughts?

I think JDBCRDD.resolveTable needs another query execution. Although it would be less expensive than inferring schemas in CSV or JSON, it would be still a bit of overhead. I am not 100% about this too. So, I think it might be better to be consistent with the others in this case.

JustinPihony · 2016-06-10T21:53:02Z

Bump :) Anybody able to review this one for me please?

HyukjinKwon · 2016-06-12T05:04:48Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

+    val partitionColumn = parameters.getOrElse("partitionColumn", null)
+    val lowerBound = parameters.getOrElse("lowerBound", null)
+    val upperBound = parameters.getOrElse("upperBound", null)
+    val numPartitions = parameters.getOrElse("numPartitions", null)


There is a class for those options, JDBCOptions. It would be nicer if those options are managed in a single place.

@HyukjinKwon Thanks, I did not know about this. Before I push code I was curious why JDBCOptions does not include the partitioning validation? That seems like a point of duplication also.

I think the validation can be done together in JDBCOptions.

JustinPihony · 2016-06-24T02:20:25Z

Bump @HyukjinKwon I have some comments to your comments. Could you please review them and I can push my changes.

…/spark into jdbc_reconciliation

gatorsmile · 2016-07-07T01:25:12Z

@JustinPihony Sorry, I did not realize you submitted a PR for the same issue. Could you please review my PR? #14077 I think my solution might be cleaner and simpler. Thanks!

JustinPihony · 2016-07-07T01:29:50Z

@gatorsmile I did just review it and still prefer mine...a simpler PR does not necessarily mean it is more correct.

JustinPihony · 2016-07-15T21:42:03Z

Bumping my JIRA comment to here for @rxin to respond please?

@rxin Given the bug found in SPARK-16401, the CreatableRelationProvider is not necessary. However it might be nice to have now that I've already implemented it. I can reduce the code by removing the CreatableRelationProvider aspect, so I would love your feedback on this PR. Even if just to say the code should be reduced or not.

JustinPihony · 2016-09-15T22:56:14Z

@srowen I had to fix something on my local machine to get proper test results, but this should be good to go now.

JustinPihony · 2016-09-22T03:07:40Z

@srowen Ping. I don't think there is anything on my plate. This should be mergeable

srowen · 2016-09-22T10:37:57Z

OK, this isn't really my area. It looks reasonable to me. @gatorsmile ?
I think #12601 (comment) was suggesting that some short notes should be added to the docs at http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases though I see the examples do in a way document the new behavior.

JustinPihony · 2016-09-23T18:08:11Z

@HyukjinKwon @gatorsmile Could you please review the documentation that I added so that we can put this to bed :)

gatorsmile · 2016-09-23T21:13:17Z

Sure, will build the document in my local computer and review it soon. Thanks!

gatorsmile · 2016-09-23T22:08:05Z

@JustinPihony The document changes in Scala, JAVA and Python look good to me, but could you please also add the examples for both SQL and R?

The R API example for write.jdbc can be found in the PR: [SPARK-12224][SPARKR] R support for JDBC source #10480
The SQL example for the write path is like: INSERT INTO TABLE schema.tablename SELECT * FROM schema1.resultset1

gatorsmile · 2016-09-23T22:51:49Z

Not sure you already knew it. Just want to share the commands how to build the doc.

SKIP_API=1 jekyll build
SKIP_API=1 jekyll serve

After the second command, you can visit the generated document:

    Server address: http://127.0.0.1:4000/

HyukjinKwon · 2016-09-24T01:46:08Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

 // $example off:schema_merging$

+import java.util.Properties;
+


I think we should put java.util.List and java.util.Properties imports together without additional newline. It seems you already know but just in case - imports.

Should this really be added to the example, though?

No reason to not follow the guildline?

JustinPihony · 2016-09-24T03:35:27Z

@gatorsmile I added the R and SQL documentation. I took the SQL portion from https://github.com/apache/spark/pull/6121/files

HyukjinKwon · 2016-09-24T03:50:12Z

examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java

 import java.util.List;
+import java.util.Properties;
 // $example off:schema_merging$



Oh, maybe, my previous comment was not clear. I meant

import java.util.List; // $example off:schema_merging$ import java.util.Properties;

I haven't tried to build the doc against the current state of this PR but I guess we won't need this import for Parquet`s schema mering example.

@HyukjinKwon Yes, that is what I was talking about...just fixed it back

HyukjinKwon · 2016-09-24T03:54:01Z

Thanks for mentioning me. It looks good to me except for few comments above in my personal view.

gatorsmile · 2016-09-24T04:50:35Z

docs/sql-programming-guide.md

 {% highlight sql %}

-CREATE TEMPORARY VIEW jdbcTable
+CREATE TEMPORARY TABLE jdbcTable


Please change it back. CREATE TEMPORARY TABLE is deprecated. You will get a Parser error

CREATE TEMPORARY TABLE is not supported yet. Please use CREATE TEMPORARY VIEW as an alternative.(line 1, pos 0)

Done, thanks. I had been going off of the tests

gatorsmile · 2016-09-24T05:05:42Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCWriteSuite.scala

+
+    df.write.format("jdbc")
+    .options(Map("url" -> url, "dbtable" -> "TEST.SAVETEST"))
+    .save


Nit: save -> save()

gatorsmile · 2016-09-24T05:08:51Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    this.extraOptions = this.extraOptions ++ (connectionProperties.asScala)
+    // explicit url and dbtable should override all
+    this.extraOptions += ("url" -> url, "dbtable" -> table)
+    format("jdbc").save


The omission of parentheses on methods should only be used when the method has no side-effects.

Thus, please change it to save()

gatorsmile · 2016-09-24T05:12:04Z

Mostly LGTM, except three minor comments.

Thank you for your hard work, @JustinPihony !

SparkQA · 2016-09-24T05:31:29Z

Test build #65858 has finished for PR 12601 at commit 06c1cba.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-24T06:05:44Z

Test build #65860 has finished for PR 12601 at commit 8fb86b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JustinPihony · 2016-09-26T03:47:59Z

@srowen The doc changes have been reviewed, so this should be good to go

SparkQA · 2016-09-26T05:41:02Z

Test build #65891 has finished for PR 12601 at commit 724bbe2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-26T08:54:39Z

Merged to master

cloud-fan · 2016-09-27T06:38:39Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcRelationProvider.scala

  }
+
+  /*
+   * The following structure applies to this code:


what does this table mean? what is CreateTable, saveTable, BaseRelation?

Now, at least, three of reviewers are confused of this bit. Do you mind if I submit a PR to clean up this part?

If the table does not exist and the mode is OVERWRITE, we create a table, then insert rows into the table, and finally return a BaseRelation.

I also took a look at @gatorsmile 's approach, I think it's easier to understand, why it's rejected? We can also get rid of the return:

if (tableExists) { mode match { case SaveMode.Ignore => ...... } } else { ...... }

…SparkR ## What changes were proposed in this pull request? `write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources. In addition we'd be able to use this way in Spark's JDBC datasource after apache#12601 is merged. **Before** - `read.df` ```r > read.df(source = "json") Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", : argument "x" is missing with no default ``` ```r > read.df(path = c(1, 2)) Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)", : argument "x" is missing with no default ``` ```r > read.df(c(1, 2)) Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300) at ... In if (is.na(object)) { : ... ``` - `write.df` ```r > write.df(df, source = "json") Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’ ``` ```r > write.df(df, source = c(1, 2)) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’ ``` ```r > write.df(df, mode = TRUE) Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’ ``` **After** - `read.df` ```r > read.df(source = "json") Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually; ``` ```r > read.df(path = c(1, 2)) Error in f(x, ...) : path should be charactor, null or omitted. ``` ```r > read.df(c(1, 2)) Error in f(x, ...) : path should be charactor, null or omitted. ``` - `write.df` ```r > write.df(df, source = "json") Error in save : illegal argument - 'path' is not specified ``` ```r > write.df(df, source = c(1, 2)) Error in .local(df, path, ...) : source should be charactor, null or omitted. It is 'parquet' by default. ``` ```r > write.df(df, mode = TRUE) Error in .local(df, path, ...) : mode should be charactor or omitted. It is 'error' by default. ``` ## How was this patch tested? Unit tests in `test_sparkSQL.R` Author: hyukjinkwon <[email protected]> Closes apache#15231 from HyukjinKwon/write-default-r.

[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc

6be5046

HyukjinKwon reviewed Apr 22, 2016
View reviewed changes

Merge assertion into jdbc

db639a5

JustinPihony and others added 3 commits June 6, 2016 19:35

Merge https://github.com/apache/spark into jdbc_reconciliation

69f7c7b

[SPARK-14525][SQL] Make jdbc a CreatableRelationProvider for simpler …

88d181e

…saving

[SPARK-14525][SQL] Clean empty space commit

c44271e

JustinPihony reviewed Jun 7, 2016
View reviewed changes

HyukjinKwon reviewed Jun 12, 2016
View reviewed changes

Merge

0a98e45

HyukjinKwon mentioned this pull request Jul 7, 2016

[SPARK-16402] [SQL] JDBC Source: Implement save API of DataFrameWriter #14077

Closed

JustinPihony added 2 commits July 6, 2016 21:12

[SPARK-14525][SQL]Address some code reviews

cb9889e

Merge branch 'jdbc_reconciliation' of https://github.com/JustinPihony…

d18efef

…/spark into jdbc_reconciliation

HyukjinKwon reviewed Sep 24, 2016

View reviewed changes

R and SQL documentation

06c1cba

HyukjinKwon reviewed Sep 24, 2016

View reviewed changes

Move import back

8fb86b4

gatorsmile reviewed Sep 24, 2016

View reviewed changes

HyukjinKwon mentioned this pull request Sep 24, 2016

[SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR #15231

Closed

Address comments

724bbe2

asfgit closed this in 50b89d0 Sep 26, 2016

cloud-fan reviewed Sep 27, 2016

View reviewed changes

[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc #12601

[SPARK-14525][SQL] Make DataFrameWrite.save work for jdbc #12601

Uh oh!

Conversation

JustinPihony commented Apr 22, 2016

What changes were proposed in this pull request?

How was this patch tested?

Additional details

Uh oh!

HyukjinKwon commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 22, 2016

Uh oh!

HyukjinKwon Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JustinPihony commented Apr 22, 2016

Uh oh!

HyukjinKwon commented Apr 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JustinPihony commented Apr 23, 2016

Uh oh!

maropu commented Apr 27, 2016

Uh oh!

HyukjinKwon commented May 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JustinPihony commented May 6, 2016

Uh oh!

JustinPihony commented May 19, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JustinPihony commented Jun 10, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JustinPihony commented Jun 24, 2016

Uh oh!

gatorsmile commented Jul 7, 2016

Uh oh!

JustinPihony commented Jul 7, 2016

Uh oh!

JustinPihony commented Jul 15, 2016

Uh oh!

JustinPihony commented Sep 15, 2016

Uh oh!

JustinPihony commented Sep 22, 2016

Uh oh!

srowen commented Sep 22, 2016

Uh oh!

JustinPihony commented Sep 23, 2016

Uh oh!

gatorsmile commented Sep 23, 2016

HyukjinKwon commented Apr 22, 2016 •

edited

Loading

HyukjinKwon Apr 22, 2016 •

edited

Loading

HyukjinKwon commented Apr 22, 2016 •

edited

Loading

HyukjinKwon commented Apr 22, 2016 •

edited

Loading

HyukjinKwon commented Apr 22, 2016 •

edited

Loading

HyukjinKwon commented Apr 22, 2016 •

edited

Loading

HyukjinKwon commented May 3, 2016 •

edited

Loading

HyukjinKwon Jun 12, 2016 •

edited

Loading

HyukjinKwon Sep 24, 2016 •

edited

Loading

HyukjinKwon Sep 24, 2016 •

edited

Loading

HyukjinKwon commented Sep 24, 2016 •

edited

Loading