[SPARK-23307][WEBUI]Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them #20481

zsxwing · 2018-02-01T22:20:15Z

What changes were proposed in this pull request?

Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them to make the behavior consistent with 2.2.

How was this patch tested?

Jenkins.
Manually ran the following codes and checked the UI for jobs/stages/tasks/queries.

spark.ui.retainedJobs 10
spark.ui.retainedStages 10
spark.sql.ui.retainedExecutions 10
spark.ui.retainedTasks 10

new Thread() {
  override def run() {
    spark.range(1, 2).foreach { i =>
        Thread.sleep(10000)
    }
  }
}.start()

Thread.sleep(5000)

for (_ <- 1 to 20) {
    new Thread() {
      override def run() {
        spark.range(1, 2).foreach { i =>
        }
      }
    }.start()
}

Thread.sleep(15000)
  spark.range(1, 2).foreach { i =>
}

sc.makeRDD(1 to 100, 100).foreach { i =>
}

…eaning up them

zsxwing · 2018-02-01T22:20:52Z

cc @vanzin @cloud-fan

vanzin · 2018-02-01T22:43:05Z

core/src/main/scala/org/apache/spark/status/KVUtils.scala

    val iter = view.closeableIterator()
    try {
-      iter.asScala.filter(filter).take(max).toList
+      iter.asScala.filter(filter).toList.sortBy(sorter).take(max)


So, aside from the two closure parameters making the calls super ugly, this is more expensive than the previous version.

Previously:

filter as you iterate over view

limit iteration

materialize "max" elements

Now:

filter as you iterate over view

materialize all elements that pass the filter

sort and take "max" elements

This will, at least, make replaying large apps a lot slower, given the filter in the task cleanup method.

// Try to delete finished tasks only. val toDelete = KVUtils.viewToSeq(view, countToDelete) { t => !live || t.status != TaskState.RUNNING.toString() }

So, when replaying, every time you need to do a cleanup of tasks, you'll deserialize all tasks for the stage. If you have a stage with 10s of thousands of tasks, that's super expensive.

If all you want to change here is the sorting of jobs, I'd recommend adding a new index to JobDataWrapper that sorts them by end time. Then you can do the sorting before you even call this method, by setting up the view appropriately.

If you also want to sort the others (stages, tasks, and sql executions), you could also create indices for those.

Or you could find a way to do this that is not so expensive on the replay side...

If adding indices, though, I'd probably try to get this into 2.3.0 since it would change the data written to disk.

@vanzin Yeah, I understand the expensive sort. However, adding indices needs more work. Do you have time to try it since I'm not familiar with LevelDB?

Adding indices is super easy. e.g.:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/status/storeTypes.scala#L90

zsxwing · 2018-02-02T00:18:05Z

@vanzin just updated it. I don't fix the task order as I think it's already using stage index and I cannot iterate tasks using two indices.

vanzin · 2018-02-02T00:23:33Z

I cannot iterate tasks using two indices.

You can actually; indices can have a parent index, and there's actually a bunch of examples in TaskDataWrapper.

Use them like this.

view.index("your index").parent(stageKey).blah blah blah

vanzin · 2018-02-02T00:28:14Z

core/src/main/scala/org/apache/spark/status/storeTypes.scala

  private def id: Int = info.jobId

+  @JsonIgnore @KVIndex("completionTime")
+  private def completionTime: Long = info.completionTime.map(_.getTime).getOrElse(Long.MaxValue)


This is fine, but if you want you can probably replace the filters in the listener by setting this to -1 for running jobs / stages / others, and starting iteration at "0".

zsxwing · 2018-02-02T01:20:56Z

This is fine, but if you want you can probably replace the filters in the listener by setting this to -1 for running jobs / stages / others, and starting iteration at "0".

Not sure if I get it correctly. I just made completionTime returns 0 for running jobs/... and started iteration at "0".

vanzin · 2018-02-02T01:57:31Z

core/src/main/scala/org/apache/spark/status/storeTypes.scala

  final val STAGE = "stage"
  final val STATUS = "sta"
  final val TASK_INDEX = "idx"
+  final val COMPLETION_TIME = "completionTime"


Could you use a shorter name like the others? It saves a little bit more space on disk because there are so many tasks in large apps.

I've asked this before, is it possible to put an ID instead of the index name to the kvstore? Then we can use long index names.

No, right now there's no support for that for indices.

vanzin · 2018-02-02T01:59:13Z

The logic looks ok. Did you look at adding a test in AppStatusListenerSuite for this? There's already a test for the cleanup, it'd be nice if it were tweaked to cover the changes here.

SparkQA · 2018-02-02T02:03:17Z

Test build #86947 has finished for PR 20481 at commit 761f1ee.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM only tiny nits. Also it would be great to have simple test case on this.

jiangxb1987 · 2018-02-02T02:20:39Z

core/src/main/scala/org/apache/spark/status/storeTypes.scala

  private def id: Int = info.jobId

+  @JsonIgnore @KVIndex("completionTime")
+  private def completionTime: Long = info.completionTime.map(_.getTime).getOrElse(-1)


nit: -1 -> -1L

jiangxb1987 · 2018-02-02T02:20:57Z

core/src/main/scala/org/apache/spark/status/storeTypes.scala

  private def active: Boolean = info.status == StageStatus.ACTIVE

+  @JsonIgnore @KVIndex("completionTime")
+  private def completionTime: Long = info.completionTime.map(_.getTime).getOrElse(-1)


nit: -1 -> -1L

SparkQA · 2018-02-02T03:12:42Z

Test build #86952 has finished for PR 20481 at commit f0de4be.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T05:12:48Z

Test build #86953 has finished for PR 20481 at commit 0424c1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-02T05:17:09Z

Test build #86954 has finished for PR 20481 at commit 4c1080a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-02-02T07:04:47Z

LGTM

zsxwing · 2018-02-02T21:34:14Z

There's already a test for the cleanup, it'd be nice if it were tweaked to cover the changes here.

I created separated tests for job/stage/task/execution as the existing cleanup test is already pretty complicated and mixing logic into it makes it much harder to understand.

SparkQA · 2018-02-03T01:15:49Z

Test build #87005 has finished for PR 20481 at commit b83b396.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-05T10:42:14Z

thanks, merging to master/2.3!

…d timestamp before cleaning up them ## What changes were proposed in this pull request? Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them to make the behavior consistent with 2.2. ## How was this patch tested? - Jenkins. - Manually ran the following codes and checked the UI for jobs/stages/tasks/queries. ``` spark.ui.retainedJobs 10 spark.ui.retainedStages 10 spark.sql.ui.retainedExecutions 10 spark.ui.retainedTasks 10 ``` ``` new Thread() { override def run() { spark.range(1, 2).foreach { i => Thread.sleep(10000) } } }.start() Thread.sleep(5000) for (_ <- 1 to 20) { new Thread() { override def run() { spark.range(1, 2).foreach { i => } } }.start() } Thread.sleep(15000) spark.range(1, 2).foreach { i => } sc.makeRDD(1 to 100, 100).foreach { i => } ``` Author: Shixiong Zhu <zsxwing@gmail.com> Closes #20481 from zsxwing/SPARK-23307. (cherry picked from commit a6bf3db) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

zsxwing · 2018-02-20T08:54:30Z

I didn't know InMemoryStore doesn't have any index... Just saw the UI is very slow on a large cluster and it causes read timeout.

cloud-fan · 2018-02-20T09:52:35Z

IIUC we use index a lot in the UI code, @vanzin is it possible to support index for in-memory kv store?

vanzin · 2018-02-20T17:21:22Z

Indices in the in-memory store are kinda dumb right now. It should be possible to do something smarter but that would increase memory usage.

It would also be good to know what "large cluster" means. Large clusters shouldn't affect the UI responsiveness. Large apps might; but I tried apps with 100k+ tasks on each stage and things seemed fine.

zsxwing · 2018-02-20T19:26:51Z

@vanzin NVM. I was wrong. The issue is indeed not related to this PR. Please take a look at https://issues.apache.org/jira/browse/SPARK-23470

Sort jobs/stages/tasks/queries with the completed timestamp before cl…

761f1ee

…eaning up them

vanzin reviewed Feb 1, 2018

View reviewed changes

zsxwing added 2 commits February 1, 2018 16:14

use index

40024ec

fix tasks

f0de4be

zsxwing force-pushed the SPARK-23307 branch from 0abbef7 to f0de4be Compare February 2, 2018 00:16

vanzin reviewed Feb 2, 2018

View reviewed changes

add task parent index

0424c1d

update sql as well

4c1080a

vanzin reviewed Feb 2, 2018

View reviewed changes

jiangxb1987 approved these changes Feb 2, 2018

View reviewed changes

nits; tests

b83b396

asfgit closed this in a6bf3db Feb 5, 2018

zsxwing deleted the SPARK-23307 branch February 6, 2018 06:36

[SPARK-23307][WEBUI]Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them #20481

[SPARK-23307][WEBUI]Sort jobs/stages/tasks/queries with the completed timestamp before cleaning up them #20481

Uh oh!

Conversation

zsxwing commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

zsxwing commented Feb 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Feb 2, 2018

Uh oh!

vanzin commented Feb 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zsxwing commented Feb 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanzin commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

jiangxb1987 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 2, 2018

Uh oh!

gengliangwang commented Feb 2, 2018

Uh oh!

zsxwing commented Feb 2, 2018

Uh oh!

SparkQA commented Feb 3, 2018

Uh oh!

cloud-fan commented Feb 5, 2018

Uh oh!

zsxwing commented Feb 20, 2018

Uh oh!

cloud-fan commented Feb 20, 2018

Uh oh!

vanzin commented Feb 20, 2018

Uh oh!

zsxwing commented Feb 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zsxwing commented Feb 1, 2018 •

edited

Loading