[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs #22752

devaraj-kavali · 2018-10-16T23:59:42Z

What changes were proposed in this pull request?

hsync has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. hsync uses the force FileChannel.force to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events.

I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api DFSInputStream.getFileLength() which gives the file length including the lastBlockBeingWrittenLength(different from FileStatus.getLen()), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui.

How was this patch tested?

Added new test and verified manually, with the added conf spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data.

FsHistoryProvider to read lastBlockBeingWritten data for logs

vanzin · 2018-10-17T00:16:11Z

add to whitelist

vanzin · 2018-10-17T00:16:18Z

ok to test

SparkQA · 2018-10-17T04:29:12Z

Test build #97474 has finished for PR 22752 at commit a3f53c4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2018-10-17T09:21:00Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

            }

-            if (info.fileSize < entry.getLen()) {
+            if (info.fileSize < entry.getLen() || checkAbsoluteLength(info, entry)) {


I think we can create a function to get the length of given file:

If the new conf is enabled and the input is DFSInputStream, use getFileLength (or max(getFileLength, entry.getLen()))

otherwise entry.getLen()

The logic can be simpler.

Have you looked @ this getFileLength() call to see how well it updates?

FwIW HADOOP-15606 proposes adding a method like this for all streams, though that proposal includes the need for specification and tests. Generally the HDFS team are a bit lax about that spec -> test workflow, which doesn't help downstream code or other implementations.

Thanks @gengliangwang for looking into this. Here it doesn't need to check the checkAbsoluteLength if the FileStatus.getLen() is more than the cached fileSize, if we update to max(getFileLength, entry.getLen())) it checks the absoluteLength always which may not be necessary.

Thanks @steveloughran for looking into this.

Have you looked @ this getFileLength() call to see how well it updates?

I looked at the DFSInputStream.getFileLength() api, it gives locatedBlocks.getFileLength() + lastBlockBeingWrittenLength, here locatedBlocks.getFileLength() is the value got from NameNode for all the completed blocks and lastBlockBeingWrittenLength is the lastblock lenth from DataNode which is not the completed block.

FwIW HADOOP-15606 proposes adding a method like this for all streams

Thanks for the pointer, once this is available we can update to use it.

...there's no timetable for that getLength thing, but if HDFS already supports the API, I'm more motivated to implement it. It has benefits in cloud stores in general

saves apps going an up front HEAD/getFileStatus() to know how long their data is; the GET should return it.

for S3 Select, you get back the filtered data so don't know how much you will see until the GET is issued

gengliangwang · 2018-10-17T09:23:23Z

core/src/main/scala/org/apache/spark/deploy/history/config.scala


+  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
+    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")
+      .doc("Enable to check the absolute length of the in-progress" +


Could you explain a little bit in details? So that general user can know what the benefit is.

do you have anything in mind to make it better? thanks

vanzin · 2018-10-18T18:00:39Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

    }
  }

+  private[history] def checkAbsoluteLength(info: LogInfo, entry: FileStatus): Boolean = {


The name of the method and the return value are a little cryptic. What does it mean to check?

Might be better to calls it something like shouldReloadLog or something. You could also move the existing check into this function and make the call site simpler.

vanzin · 2018-10-18T18:01:34Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

+      try {
+        result = Utils.tryWithResource(fs.open(entry.getPath)) { in =>
+          in.getWrappedStream match {
+            case dfsIn: DFSInputStream => dfsIn.getFileLength > info.fileSize


Ignoring the discussion about the getFileLength API, If this is not a DFSInpytStream you'll get a MatchError. Better to handle the default case explicitly.

vanzin · 2018-10-18T18:04:01Z

core/src/main/scala/org/apache/spark/deploy/history/config.scala

      .createWithDefaultString("1m")

+  val IN_PROGRESS_ABSOLUTE_LENGTH_CHECK =
+    ConfigBuilder("spark.history.fs.inProgressAbsoluteLengthCheck.enabled")


Is there any disadvantage in just leaving this always on? Otherwise this doesn't need to be configurable.

This new check adds overhead to the checkForLogs thread, made it disabled by default since most of the users may not want to see the history ui for the running applications, they can enable it explicitly if they want to see progress apps in history ui. I can remove this config if you think not much useful.

How much overhead are we talking about?

That thread is not really in any performance critical path, and in general people won't have so many running apps that this should become a problem... but that kinda depends on how bad this new call is.

This call makes two invocations, one for getting blocks info from the NameNode and another for getting the last block info from DataNode. I agree this is not a performance critical path, I will remove the config in the update.

SparkQA · 2018-10-19T00:59:44Z

Test build #97554 has finished for PR 22752 at commit c2f2705.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

devaraj-kavali · 2018-10-25T19:21:59Z

@vanzin can you check the updated changes, thanks

vanzin · 2018-10-25T20:15:49Z

Merging to master / 2.4 (will run a couple of tests on 2.4 before merging there).

…istoryProvider to read lastBlockBeingWritten data for logs ## What changes were proposed in this pull request? `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events. I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui. ## How was this patch tested? Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data. Closes #22752 from devaraj-kavali/SPARK-24787. Authored-by: Devaraj K <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]> (cherry picked from commit 46d2d2c) Signed-off-by: Marcelo Vanzin <[email protected]>

…istoryProvider to read lastBlockBeingWritten data for logs ## What changes were proposed in this pull request? `hsync` has been added as part of SPARK-19531 to get the latest data in the history sever ui, but that is causing the performance overhead and also leading to drop many history log events. `hsync` uses the force `FileChannel.force` to sync the data to the disk and happens for the data pipeline, it is costly operation and making the application to face overhead and drop the events. I think getting the latest data in history server can be done in different way (no impact to application while writing events), there is an api `DFSInputStream.getFileLength()` which gives the file length including the `lastBlockBeingWrittenLength`(different from `FileStatus.getLen()`), this api can be used when the file status length and previously cached length are equal to verify whether any new data has been written or not, if there is any update in data length then the history server can update the in progress history log. And also I made this change as configurable with the default value false, and can be enabled for history server if users want to see the updated data in ui. ## How was this patch tested? Added new test and verified manually, with the added conf `spark.history.fs.inProgressAbsoluteLengthCheck.enabled=true`, history server is reading the logs including the last block data which is being written and updating the Web UI with the latest data. Closes apache#22752 from devaraj-kavali/SPARK-24787. Authored-by: Devaraj K <[email protected]> Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make

a3f53c4

FsHistoryProvider to read lastBlockBeingWritten data for logs

gengliangwang reviewed Oct 17, 2018

View reviewed changes

vanzin reviewed Oct 18, 2018

View reviewed changes

Removed the added conf and refactored changes

c2f2705

asfgit closed this in 46d2d2c Oct 25, 2018

[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs #22752

[SPARK-24787][CORE] Revert hsync in EventLoggingListener and make FsHistoryProvider to read lastBlockBeingWritten data for logs #22752

Uh oh!

Conversation

devaraj-kavali commented Oct 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

vanzin commented Oct 17, 2018

Uh oh!

vanzin commented Oct 17, 2018

Uh oh!

SparkQA commented Oct 17, 2018

Uh oh!

gengliangwang Oct 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 19, 2018

Uh oh!

devaraj-kavali commented Oct 25, 2018

Uh oh!

vanzin commented Oct 25, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gengliangwang Oct 17, 2018 •

edited

Loading