[SPARK-20644][core] Initial ground work for kvstore UI backend. #19582

vanzin · 2017-10-26T22:37:37Z

There are two somewhat unrelated things going on in this patch, but
both are meant to make integration of individual UI pages later on
much easier.

The first part is some tweaking of the code in the listener so that
it does less updates of the kvstore for data that changes fast; for
example, it avoids writing changes down to the store for every
task-related event, since those can arrive very quickly at times.
Instead, for these kinds of events, it chooses to only flush things
if a certain interval has passed. The interval is based on how often
the current spark-shell code updates the progress bar for jobs, so
that users can get reasonably accurate data.

The code also delays as much as possible hitting the underlying kvstore
when replaying apps in the history server. This is to avoid unnecessary
writes to disk.

The second set of changes prepare the history server and SparkUI for
integrating with the kvstore. A new class, AppStatusStore, is used
for translating between the stored data and the types used in the
UI / API. The SHS now populates a kvstore with data loaded from
event logs when an application UI is requested.

Because this store can hold references to disk-based resources, the
code was modified to retrieve data from the store under a read lock.
This allows the SHS to detect when the store is still being used, and
only update it (e.g. because an updated event log was detected) when
there is no other thread using the store.

This change ended up creating a lot of churn in the ApplicationCache
code, which was cleaned up a lot in the process. I also removed some
metrics which don't make too much sense with the new code.

Tested with existing and added unit tests, and by making sure the SHS
still works on a real cluster.

There are two somewhat unrelated things going on in this patch, but both are meant to make integration of individual UI pages later on much easier. The first part is some tweaking of the code in the listener so that it does less updates of the kvstore for data that changes fast; for example, it avoids writing changes down to the store for every task-related event, since those can arrive very quickly at times. Instead, for these kinds of events, it chooses to only flush things if a certain interval has passed. The interval is based on how often the current spark-shell code updates the progress bar for jobs, so that users can get reasonably accurate data. The code also delays as much as possible hitting the underlying kvstore when replaying apps in the history server. This is to avoid unnecessary writes to disk. The second set of changes prepare the history server and SparkUI for integrating with the kvstore. A new class, AppStatusStore, is used for translating between the stored data and the types used in the UI / API. The SHS now populates a kvstore with data loaded from event logs when an application UI is requested. Because this store can hold references to disk-based resources, the code was modified to retrieve data from the store under a read lock. This allows the SHS to detect when the store is still being used, and only update it (e.g. because an updated event log was detected) when there is no other thread using the store. This changed ended up creating a lot of churn in the ApplicationCache code, which was cleaned up a lot in the process. I also removed some metrics which don't make too much sense with the new code. Tested with existing and added unit tests, and by making sure the SHS still works on a real cluster.

vanzin · 2017-10-27T00:50:00Z

For context:

Project link: https://issues.apache.org/jira/browse/SPARK-18085
Upcoming PRs that build on this code: https://github.com/vanzin/spark/pulls

A special note about this PR: this marks a sort of "point of no return" for the UI. Once this is in, the UI will be in a weird franken-state until vanzin#51 / SPARK-20653 is committed. Until then, there will be duplicate listeners collecting data, which can slow things down a bit in the event bus, and also increase memory usage.

SparkQA · 2017-10-27T02:04:24Z

Test build #83096 has finished for PR 19582 at commit f73af34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

vanzin · 2017-11-01T18:38:36Z

@squito

squito

sorry took me a while to get to this.

squito · 2017-11-03T06:06:01Z

core/src/main/scala/org/apache/spark/status/AppStatusListener.scala

+  /** Update a live entity only if it hasn't been updated in the last configured period. */
+  private def maybeUpdate(entity: LiveEntity): Unit = {
+    if (liveUpdatePeriodNs >= 0) {
+      val now = System.nanoTime()


System.nanoTime() can be somewhat expensive, right? there are a few places you are calling this repeatedly, might as well call nanoTime() once outside of this method and pass it in.

I didn't notice that method ever showing up when profiling; my guess is it's just a read from some CPU register (TSC?) and so reasonably cheap.

I think it can vary a lot with OS (and maybe the hardware?) Unfortunately when searching now, most of the references are really dated, I have no idea what info is obsolete. But there is enough evidence it seems prudent to avoid calling it a lot in case its slow in some situations.

squito · 2017-11-03T16:49:54Z

core/src/main/scala/org/apache/spark/status/LiveEntity.scala

+
  def write(store: KVStore): Unit = {
    store.write(doUpdate())
+    lastWriteTime = System.nanoTime()


you could also pass the nanoTime down into this, to avoid calling it again

squito

lgtm

One suggestion on comments, one teeny nit.

squito · 2017-11-03T19:54:29Z

core/src/main/scala/org/apache/spark/status/storeTypes.scala

 import org.apache.spark.util.kvstore.KVIndex

+private[spark] case class AppStatusStoreMetadata(
+    val version: Long)


nit: val is unnecessary

squito · 2017-11-03T21:54:41Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+      // Invalidate the existing UI for the reloaded app attempt, if any. Note that this does
+      // not remove the UI from the active list; that has to be done in onUIDetached, so that
+      // cleanup of files can be done in a thread-safe manner. It does mean the UI will remain
+      // in memory for longer than it should.


it took me some time to figure how the cache & invalidation worked, mostly because I wasn't looking in the right places. I don't think you've made this any more confusing than it was before (in fact its probably better), but seems like a good opportunity to improve commenting a little. I think it might help to have one comment in the code where the entire sequence is described ( here on mergeApplicationListing, or on AppCache, or ApplicationCacheCheckFilter, doesn't really matter, but they could all reference the longer comment). if I understand correctly, it would be something like:

Logs of incomplete apps are regularly polled to see if they have been updated (based on an increase in file size). If they have, the existing data for that app is marked as invalid in LoadedAppUI. However, no memory is freed, no files are cleaned up at this time, nor is a new UI built. On each request for one app's UI, the application cache is checked to see if it has a valid LoadedAppUI in the cache. If there is data in the cache and its valid, then its served. If there is data in the cache but it is invalid, then the UI is rebuilt from the raw event logs. If there is nothing in the cache, then the UI is built from the raw event logs and added to the cache. This may kick another entry out of the cache -- if its for an incomplete app, then any KVStore data written to disk is deleted (as the KVStore for an incomplete app is always regenerated from scratch anyway).

Done. Also updated a bunch of other stale comments.

vanzin · 2017-11-03T23:45:02Z

retest this please

SparkQA · 2017-11-04T03:07:01Z

Test build #83424 has finished for PR 19582 at commit 537c7b4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2017-11-06T14:46:35Z

merged to master

squito reviewed Nov 3, 2017

View reviewed changes

Call System.nanoTime() only once per event.

eaf3c85

squito approved these changes Nov 3, 2017

View reviewed changes

Style, more comments.

537c7b4

asfgit closed this in c7f38e5 Nov 6, 2017

vanzin deleted the SPARK-20644 branch November 6, 2017 18:38

[SPARK-20644][core] Initial ground work for kvstore UI backend. #19582

[SPARK-20644][core] Initial ground work for kvstore UI backend. #19582

Uh oh!

Conversation

vanzin commented Oct 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanzin commented Oct 27, 2017

Uh oh!

SparkQA commented Oct 27, 2017

Uh oh!

vanzin commented Nov 1, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

squito Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

squito Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

squito Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

squito Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

squito Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin Nov 3, 2017

Choose a reason for hiding this comment

Uh oh!

vanzin commented Nov 3, 2017

Uh oh!

SparkQA commented Nov 4, 2017

Uh oh!

squito commented Nov 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vanzin commented Oct 26, 2017 •

edited

Loading