Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 12, 2025

What changes were proposed in this pull request?

This PR aims to support spark.eventLog.excludedPatterns to exclude specific SparkEvents. This has two goals.

  1. Save the cost of event log processing.
  2. Provide a full control to save or not on top of the existing logEvent.

trait SparkListenerEvent {
/* Whether output this event to the event log */
protected[spark] def logEvent: Boolean = true
}

Why are the changes needed?

Historically, Apache Spark provides multiple ways to manage the event logs to save a storage cost.

  • spark.history.fs.cleaner.maxAge: Delete old Spark jobs by age
  • spark.history.fs.cleaner.maxNum: Delete old Spark jobs by the total number of jobs
  • spark.history.fs.eventLog.rolling.maxFilesToRetain: Decompress + Compact + Compress back

For example, after compaction, Spark event logs only have the following.

// The extracted event names from a `compacted` event log file
"SparkListenerLogStart"
"SparkListenerResourceProfileAdded"
"org.apache.spark.sql.connect.service.SparkListenerConnectServiceStarted"
"SparkListenerBlockManagerAdded"
"SparkListenerEnvironmentUpdate"
"SparkListenerApplicationStart"
"SparkListenerExecutorAdded"
"SparkListenerBlockManagerAdded"
"SparkListenerExecutorAdded"
"SparkListenerBlockManagerAdded"
"SparkListenerJobStart"
"SparkListenerStageSubmitted"
"SparkListenerTaskStart"
"SparkListenerTaskEnd"
"SparkListenerStageCompleted"

This PR aims to provide a simple alternative to allow the users to skip specific Spark events completely.

org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd
org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates

Does this PR introduce any user-facing change?

No. This is a new feature.

How was this patch tested?

Pass the CIs with the newly added test case.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jun 12, 2025
@dongjoon-hyun
Copy link
Member Author

I addressed your comments.

$ build/sbt "core/testOnly *EventLogFile*Writer*Suite -- -z SPARK-52458"
...
[info] RollingEventLogFilesWriterSuite:
[info] - SPARK-52458: Support spark.eventLog.excludedPatterns (117 milliseconds)
[info] SingleEventLogFileWriterSuite:
[info] - SPARK-52458: Support spark.eventLog.excludedPatterns (17 milliseconds)
[info] Run completed in 1 second, 129 milliseconds.
[info] Total number of tests run: 2
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 2, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 10 s, completed Jun 11, 2025, 8:05:36 PM

@yaooqinn
Copy link
Member

QQ: Is it possible that logs become unable to be rendered if some events are missing?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 12, 2025

QQ: Is it possible that logs become unable to be rendered if some events are missing?

Of course, yes, the users need to provide a meaningful configuration.

For example, the event names like the following should not be used. As you see in the PR description, those survived after compactions also.

"SparkListenerLogStart"
"SparkListenerResourceProfileAdded"
"org.apache.spark.sql.connect.service.SparkListenerConnectServiceStarted"
"SparkListenerBlockManagerAdded"
"SparkListenerEnvironmentUpdate"
"SparkListenerApplicationStart"
"SparkListenerExecutorAdded"
"SparkListenerBlockManagerAdded"
"SparkListenerExecutorAdded"
"SparkListenerBlockManagerAdded"
"SparkListenerJobStart"
"SparkListenerStageSubmitted"
"SparkListenerTaskStart"
"SparkListenerTaskEnd"
"SparkListenerStageCompleted"

The meaningful and intuitive set of configurations are the Spark UI ones in the PR descriptions like the following. Also, they should be excluded together. For example, SparkListenerSQLExecutionEnd and SparkListenerSQLExecutionStart are related.

org.apache.spark.sql.execution.ui.SparkListenerSQLAdaptiveExecutionUpdate
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd
org.apache.spark.sql.execution.ui.SparkListenerDriverAccumUpdates

In addition, this can be used to prevent the user-defined SparkEvent classes from being exposed to the persistent event logs.

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@dongjoon-hyun
Copy link
Member Author

Thank you, @yaooqinn !

@dongjoon-hyun
Copy link
Member Author

Merged to master for Apache Spark 4.1.0.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-52458 branch June 12, 2025 05:39
dongjoon-hyun added a commit that referenced this pull request Oct 16, 2025
### What changes were proposed in this pull request?

This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation.

### Why are the changes needed?

To help the users use new features easily.

- #47856
- #51130
- #51163
- #51604
- #51630
- #51708
- #51885
- #52091
- #52382

### Does this PR introduce _any_ user-facing change?

No behavior change because this is a documentation update.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52626 from dongjoon-hyun/SPARK-53926.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
### What changes were proposed in this pull request?

This PR aims to document newly added `core` module configurations as a part of Apache Spark 4.1.0 preparation.

### Why are the changes needed?

To help the users use new features easily.

- apache#47856
- apache#51130
- apache#51163
- apache#51604
- apache#51630
- apache#51708
- apache#51885
- apache#52091
- apache#52382

### Does this PR introduce _any_ user-facing change?

No behavior change because this is a documentation update.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#52626 from dongjoon-hyun/SPARK-53926.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants