Commit 554600c
[SPARK-33841][CORE] Fix issue with jobs disappearing intermittently from the SHS under high load
### What changes were proposed in this pull request?
Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.
### Why are the changes needed?
The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria.
The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 236(26.7 MB) copies of an event log directory were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/spark-master) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.
58 entities are displayed at `17:35:35`:

25 entities are displayed at `17:36:40`:

### Does this PR introduce _any_ user-facing change?
Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.
### How was this patch tested?
Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/spark-master) script:
* Build SHS with the proposed change
* Download Hadoop AWS and AWS Java SDK
* Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
* Configure SHS to read event logs from S3
* Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/spark-master/monitor.sh) script to query SHS API
* Run 5 [producers](https://github.com/vladhlinsky/shs-monitor/blob/spark-master/producer.sh) for ~5 mins, create 125(14.2 MB) event log directory copies
* Wait for SHS to load all the applications
* Verify that the number of loaded applications increases continuously over time
For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor/tree/spark-master) repository.
> This version of the reproduction uses event log directories instead of single files, since recent optimization
> [SPARK-33790](https://issues.apache.org/jira/browse/SPARK-33790) makes it hard to reproduce the issue with single event log files.
Closes #30845 from vladhlinsky/SPARK-33841.
Authored-by: Vlad Glinsky <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>1 parent c17c76d commit 554600c
File tree
1 file changed
+20
-6
lines changed- core/src/main/scala/org/apache/spark/deploy/history
1 file changed
+20
-6
lines changedLines changed: 20 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
474 | 474 | | |
475 | 475 | | |
476 | 476 | | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
| 480 | + | |
| 481 | + | |
477 | 482 | | |
478 | 483 | | |
479 | | - | |
| 484 | + | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
480 | 492 | | |
481 | 493 | | |
482 | 494 | | |
| |||
576 | 588 | | |
577 | 589 | | |
578 | 590 | | |
579 | | - | |
580 | | - | |
581 | | - | |
582 | | - | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
583 | 598 | | |
584 | | - | |
585 | 599 | | |
586 | 600 | | |
587 | 601 | | |
| |||
0 commit comments