[SPARK-23653][SQL] Show sql statement in spark SQL UI #20803

LantaoJin · 2018-03-12T13:54:04Z

What changes were proposed in this pull request?

SPARK-4871 had already added the sql statement in job description for using spark-sql. But it has some problems:

long sql statement cannot be displayed in description column.
sql statement submitted in spark-shell or spark-submit cannot be covered.

In eBay, most spark applications like ETL using spark-submit to schedule their jobs with a few sql files. The sql statement in those applications cannot be saw in current spark UI. Even we get the sql files, there are many variables in the it such as "select * from ${workingBD}.table where data_col=${TODAY}". So this

How was this patch tested?

LantaoJin · 2018-03-12T13:57:18Z

@gatorsmile @cloud-fan Could you add some comments?

AmplabJenkins · 2018-03-12T13:57:40Z

Can one of the admins verify this patch?

cloud-fan · 2018-03-12T19:25:11Z

what if an SQL execution triggers multiple jobs?

wangyum · 2018-03-13T00:11:21Z

Double click this SQL statement can show full SQL statement: [SPARK-8145][WebUI]Trigger a double click on the span to show full job description. #6646
What if this SQL statement contains --hiveconf or --hivevar?

LantaoJin · 2018-03-13T02:45:41Z

@cloud-fan one SQL execution only has one sql statement whatever how many jobs it triggered.

LantaoJin · 2018-03-13T02:49:04Z

What if this SQL statement contains --hiveconf or --hivevar?

What's meaning? Can you give an example?

wangyum · 2018-03-13T05:26:14Z

cat <<EOF > test.sql
select '\${a}', '\${b}';
EOF

spark-sql --hiveconf a=avalue --hivevar b=bvalue -f test.sql

SQL text is select ${a}, ${b} or select avalue, bvalue?

jerryshao · 2018-03-13T07:52:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+  private val executionIdToSqlText = new ConcurrentHashMap[Long, String]()
+
+  def setSqlText(sqlText: String): Unit = {
+    executionIdToSqlText.putIfAbsent(_nextExecutionId.get(), sqlText)


Does the executionId used here match the current execution? IIUC, the execution id is incremented in withNewExecutionId, and the one you used here mostly refers to the previous execution, please correct me if I'm wrong.

setSqlText is invoked before withNewExecutionId. First time _nextExecutionId is 0 by default, so setSqlText store (0, x) in map. When withNewExecutionId is invoked, the code val executionId = SQLExecution.nextExecutionId increase the execution id and return the previous execution id, 0. Then val sqlText = getSqlText(executionId) will return the sql text which 0 mapped, x. Next time when setSqlText is invoked, _nextExecutionId.get() return the increased id, 1. So the new sql text store in map (1, y).

Ohh, I see. Sorry I misunderstood it.

LantaoJin · 2018-03-13T11:56:30Z

@wangyum Good point. Unfortunately it is select ${a}, ${b}. Let me fix it.

cloud-fan · 2018-03-13T18:29:19Z

So this patch duplicates the SQL text info on the jobs page to the SQL query page. I think it's good and more user-friendly, but we need to make sure the underlying implementation reuse the code, to avoid problems like missing the --hivevar.

LantaoJin · 2018-03-14T09:48:18Z

Thanks a lot, @cloud-fan . The problems like missing the --hivevar also exist in current implementation (display sql text in jobs pages). I will try to fix it in my ticket. Probably accurately, this patch not only moves the sql text from jobs page to sql query page, but also resolves the problem that sql text cannot be captured from bin/spark-submit or bin/spark-shell. You know bin/spark-sql (client deploy mode) is mostly used in ad-hoc scenario. Besides that, lots of Spark SQL scenarios like daily job in warehouse, ETL job, and others which need to be submitted to cluster, SPARK-4871 doesn't cover.

LantaoJin · 2018-03-14T12:15:56Z

Hi @wangyum, the problem about variable substitution now is resolved.

cloud-fan · 2018-03-14T20:18:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecution.scala

+  }
+
+  def getSqlText(executionId: Long): String = {
+    executionIdToSqlText.get(executionId)


what if this execution doesn't have SQL text?

It shows nothing

cloud-fan · 2018-03-14T20:30:08Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

   * @since 2.0.0
   */
  def sql(sqlText: String): DataFrame = {
+    SQLExecution.setSqlText(substitutor.substitute(sqlText))


I think the most difficult part is, how to connect the SQL text to the execution. I don't think the current one works, e.g.

val df = spark.sql("xxxxx") spark.range(10).count()

You set the SQL text for the next execution, but the next execution may not happen on this dataframe.

I think SQL text should belong to a DataFrame, and executions on this dataframe show the SQL text. e.g.

val df = spark.sql("xxxxxx") df.collect() // this should show sql text on the UI df.count() // shall we shall sql text? df.show() // this adds a limit on top of the query plan, but ideally we should shall the sql text. df.filter(...).collect() // how about this?

@cloud-fan, Bind sql text to DataFrame is a good idea. Trying to fix the list you mentioned above.

It's better to answer the list first. Strictly speaking, except collect, most of the dataframe operations will create another dataframe and execute. e.g. .count() creates a new dataframe with aggregate, .show() creates a new dataframe with limit.

It seems like df.count should not show the SQL, but df.show should as it's very common.

LantaoJin · 2018-03-15T07:20:44Z

@cloud-fan, please review.
The test result is:
val df = spark.sql("xxxxx")
spark.range(10).count() // noting show in UI
df.collect() // show sql text "xxxxx" on the UI
df.count() // show sql text "xxxxx" on the UI
df.show() // show sql text "xxxxx" on the UI
df.filter(...).collect() // show sql text "xxxxx" on the UI

LantaoJin · 2018-03-15T14:03:52Z

@cloud-fan @jerryshao In the last commit, seems I faced a Scala bug. :-(

[error] /Users/lajin/git/my/spark/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:63: in object Dataset, multiple overloaded alternatives of define default arguments
[error] Error occurred in an application involving default arguments.
[error] private[sql] object Dataset {
[error] ^

https://stackoverflow.com/questions/24991209/scala-2-11-complains-with-multiple-overloaded-alternatives-of-method

cloud-fan · 2018-03-16T22:35:56Z

Sorry I didn't clarify it clearly enough. I was not suggesting to show sql text for all of these cases, but tried to raise a discussion about when we should show sql text. e.g. for df.count() and df.filter(...).count seems we should not show.

LantaoJin · 2018-03-19T07:06:12Z

@cloud-fan, please review.
Now the test result is:
val df = spark.sql("xxxxx")
spark.range(10).collect() // noting shows in UI
df.collect() // shows sql text "xxxxx"
df.count() // noting show in UI
df.show() // shows sql text "xxxxx"
df.filter(...).collect() // shows sql text "xxxxx"
df.filter(...).count() // noting shows

cloud-fan · 2018-03-19T22:42:55Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

    @DeveloperApi @InterfaceStability.Unstable @transient val queryExecution: QueryExecution,
-    encoder: Encoder[T])
+    encoder: Encoder[T],
+    val sqlText: String = "")


what's the exact rule you defined to decide whether or not we should propagate the sql text?

And how does the SQL shell execute commands? like SELECT * FROM ..., does it display all the rows or add a LIMIT before displaying? Generally we should not propagate sql text, as a new DataFrame usually means the plan is changed, the SQL text is not accurate anymore.

Thanks for your review. I agree this comment. Before the discuss, let me reproduce the scenario our company met. Team A developed a framework to submit application with sql sentences in a file

spark-submit --master yarn-cluster --class com.ebay.SQLFramework -s biz.sql

In the biz.sql, there are many sql sentences like

create or replace temporary view view_a select xx from table ${old_db}.table_a where dt=${check_date};
insert overwrite table ${new_db}.table_a select xx from view_a join ${new_db}.table_b;
...

There is no case like
val df = spark.sql("xxxxx")
spark.range(10).collect()
df.filter(..).count()

Team B (Platform) need to capture the really sql sentences which are executed in whole cluster, as the sql files from Team A contains many variables. A better way is recording the really sql sentence in EventLog.

Ok, back to the discussion. The original purpose is to display the sql sentence which user inputs. spark.range(10).collect() isn't a sql sentence user inputs, either df.filter(..).count() . Only "xxxxx" is. So I have two proposals and a further think.

Change the display behavior, only displays the sql which can trigger action. like "create table", "insert overwrite", etc. Do not care about the select sentence. That won't propagate sql text any more. The test case above won't show anything in SQL ui. Also, the ui will show "Sql text which triggers this execution" instead of "Sql text"

Add a SQLCommandEvent and post an event with sql sentence in method SparkSession.sql(), then in the EventLoggingListener, just logging this to eventlog. I am not sure in this way, we still can get the sql text in ui.

Further more, what about open another ticket to add a command option --sqlfile biz.sql in spark-submit command. biz.sql must be a file consist by sql sentence. Base this implementation, not only client mode but also cluster mode can use pure sql.

How do you think? @cloud-fan

spark-submit --master yarn-cluster --class com.ebay.SQLFramework -s biz.sql

How does com.ebay.SQLFramework process the sql file? just call spark.sql(xxxx).show or other stuff?

Your speculation is almost right. First call val df = spark.sql(), then separates the sql text with pattern matching to there type: count, limit and other. if count, then invoke the df.showString(2,20). if limit, just invoke df.limit(1).foreach, the last type other will do noting.

LantaoJin · 2018-03-21T14:31:33Z

I have decoupled the sqlText with sql execution. In current implementation, when user invoke spark.sql(xx), it will create a new SparkListenerSQLTextCaptured event to listenerbus. Then in SQLAppStatusListener, the information will be stored and all the sql sentences will display in AllExecutionPage in order with submission time, instead of in each ExecutionPage. I will upload the commit after testing. (Better to create a new PR?)

LantaoJin · 2018-03-21T15:23:25Z

dongjoon-hyun · 2018-03-22T05:45:34Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

  def sql(sqlText: String): DataFrame = {
-    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText))
+    Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText),
+      substitutor.substitute(sqlText))


Hi, @LantaoJin .
What you need is just grapping the initial SQL text here, you can use Spark extension. Please refer Spark Atlas Connector for a sample code.

You may want to refactor this PR into ParserExtension and UI part. I think that will be less intrusive than the current implementation.

BTW, in general, the initial SQL texts easily become meaningless when another operations are added. In your example, the following case shows a misleading and wrong SQL statement instead of real executed SQL plan.

val df = spark.sql("xxxxx") df.filter(...).collect() // shows sql text "xxxxx"

As another example, please try the following. It will show you select a,b from t1.

scala> spark.sql("select a,b from t1").select("a").show +---+ | a| +---+ | 1| +---+

the following case shows a misleading and wrong SQL statement instead of real executed SQL plan.

Yes. We know this, so current implementation which bind sql text to DF is not good.

LantaoJin · 2018-03-22T07:00:26Z

Hi @jerryshao @cloud-fan @dongjoon-hyun, I would like to close this PR and open another one #20876, would you please move to that?

[SPARK-23653][SQL] Show sql statement in spark SQL UI

9d2098d

jerryshao reviewed Mar 13, 2018

View reviewed changes

fix the variable substitution problem like ${var}, ${env:var}

4712379

LantaoJin force-pushed the SPARK-23653 branch from 73c3468 to 4712379 Compare March 14, 2018 12:08

cloud-fan reviewed Mar 14, 2018

View reviewed changes

change sql text to DF and fix the problems mentioned in review

6f8bc0d

simplify the code about sql text in DF

92293c6

workaround for the Scala bug

89e8e74

count() shouldn't show sql text

df98d83

cloud-fan reviewed Mar 19, 2018

View reviewed changes

dongjoon-hyun reviewed Mar 22, 2018

View reviewed changes

LantaoJin closed this Mar 22, 2018

LantaoJin mentioned this pull request Mar 22, 2018

[SPARK-23653][SQL] Capture sql statements user input and show them in… #20876

Closed

[SPARK-23653][SQL] Show sql statement in spark SQL UI #20803

[SPARK-23653][SQL] Show sql statement in spark SQL UI #20803

Uh oh!

Conversation

LantaoJin commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

LantaoJin commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

cloud-fan commented Mar 12, 2018

Uh oh!

wangyum commented Mar 13, 2018

Uh oh!

LantaoJin commented Mar 13, 2018

Uh oh!

LantaoJin commented Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyum commented Mar 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin Mar 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Mar 13, 2018

Uh oh!

cloud-fan commented Mar 13, 2018

Uh oh!

LantaoJin commented Mar 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Mar 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Mar 15, 2018

Uh oh!

LantaoJin commented Mar 15, 2018

Uh oh!

cloud-fan commented Mar 16, 2018

Uh oh!

LantaoJin commented Mar 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Mar 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LantaoJin commented Mar 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

LantaoJin commented Mar 12, 2018 •

edited

Loading

LantaoJin commented Mar 13, 2018 •

edited

Loading

LantaoJin Mar 13, 2018 •

edited

Loading

LantaoJin commented Mar 14, 2018 •

edited

Loading

LantaoJin commented Mar 19, 2018 •

edited

Loading

LantaoJin Mar 20, 2018 •

edited

Loading

LantaoJin commented Mar 21, 2018 •

edited

Loading

dongjoon-hyun Mar 22, 2018 •

edited

Loading