Commit 7881622
[SPARK-33841][CORE][3.0] Fix issue with jobs disappearing intermittently from the SHS under high load
### What changes were proposed in this pull request?
Mark SHS event log entries that were `processing` at the beginning of the `checkForLogs` run as not stale and check for this mark before deleting an event log. This fixes the issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.
### Why are the changes needed?
The issue is caused by [SPARK-29043](https://issues.apache.org/jira/browse/SPARK-29043), which is designated to improve the concurrent performance of the History Server. The [change](https://github.com/apache/spark/pull/25797/files#) breaks the ["app deletion" logic](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563) because of missing proper synchronization for `processing` event log entries. Since SHS now [filters out](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) all `processing` event log entries, such entries do not have a chance to be [updated with the new `lastProcessed`](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472) time and thus any entity that completes processing right after [filtering](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462) and before [the check for stale entities](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560) will be identified as stale and will be deleted from the UI until the next `checkForLogs` run. This is because [updated `lastProcessed` time is used as criteria](https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557), and event log entries that missed to be updated with a new time, will match that criteria.
The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.
241 entities are displayed at `20:50:42`:

203 entities are displayed at `20:52:17`:

The number of loaded applications over time:

### Does this PR introduce _any_ user-facing change?
Yes, SHS users won't face the behavior when the number of displayed applications decreases periodically.
### How was this patch tested?
Tested using [shs-monitor](https://github.com/vladhlinsky/shs-monitor) script:
* Build SHS with the proposed change
* Download Hadoop AWS and AWS Java SDK
* Prepare S3 bucket and user for programmatic access, grant required roles to the user. Get access key and secret key
* Configure SHS to read event logs from S3
* Start [monitor](https://github.com/vladhlinsky/shs-monitor/blob/main/monitor.sh) script to query SHS API
* Run 8 [producers](https://github.com/vladhlinsky/shs-monitor/blob/main/producer.sh) for ~10 mins, create 805(83.1 MB) event log copies
* Wait for SHS to load all the applications
* Verify that the number of loaded applications increases continuously over time

For more details, please refer to the [shs-monitor](https://github.com/vladhlinsky/shs-monitor) repository.
Closes #30842 from vladhlinsky/SPARK-33841-branch-3.0.
Authored-by: Vlad Glinsky <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>1 parent f67c3c2 commit 7881622
File tree
1 file changed
+20
-6
lines changed- core/src/main/scala/org/apache/spark/deploy/history
1 file changed
+20
-6
lines changedLines changed: 20 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
460 | 460 | | |
461 | 461 | | |
462 | 462 | | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
463 | 468 | | |
464 | 469 | | |
465 | | - | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
466 | 478 | | |
467 | 479 | | |
468 | 480 | | |
| |||
562 | 574 | | |
563 | 575 | | |
564 | 576 | | |
565 | | - | |
566 | | - | |
567 | | - | |
568 | | - | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
569 | 584 | | |
570 | | - | |
571 | 585 | | |
572 | 586 | | |
573 | 587 | | |
| |||
0 commit comments