-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54583][SS] Add SQLConf to enable use of OffsetMap #53311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/checkpointing/OffsetSeq.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala
Show resolved
Hide resolved
| } | ||
|
|
||
| test("STREAMING_OFFSET_LOG_FORMAT_VERSION config - default VERSION_1") { | ||
| import testImplicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the import testImplicits._ be at the test class level so it does not need to be duplicated for the tests?
| } | ||
| } | ||
|
|
||
| test("STREAMING_OFFSET_LOG_FORMAT_VERSION config - checkpoint wins on restart") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also test the inverse? Start with v2 and ensure it remains v2 ?
sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/OffsetSeqLogSuite.scala
Show resolved
Hide resolved
| confInOffsetLog.key -> sessionConf.get(confInSession.key) | ||
| }.toMap | ||
| OffsetSeqMetadata(batchWatermarkMs, batchTimestampMs, confs++ confsFromRebind) | ||
| val version = sessionConf.get(STREAMING_OFFSET_LOG_FORMAT_VERSION.key).toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the version here refer to the OffsetSeqMetadata version or the OffsetSeqBase version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OffsetSeqBase - we don't have multiple metadata versions here.
What changes were proposed in this pull request?
Add the SQLConf that allows us to use the OffsetMap format instead of OffsetSeq.
Why are the changes needed?
We need to allow users to set the offset log format version which will enable features like source naming in the future.
Does this PR introduce any user-facing change?
What changes?
This PR adds a new public configuration
spark.sql.streaming.offsetLog.formatVersionthat allows users to control the offset log format versionused in streaming query checkpoints.
Previous behavior:
Streaming queries could only use VERSION_1 offset log format (OffsetSeq), which stores offsets as an ordered sequence. Although VERSION_2 format
(OffsetMap) was implemented internally, there was no way for users to enable it.
New behavior:
Users can now set
spark.sql.streaming.offsetLog.formatVersionto choose between two offset log formats:Example:
The checkpoint will use OffsetMap format with source IDs as map keys instead of relying on source ordering.
Important notes:
Comparison:
This is a user-facing change compared to released Spark versions as it introduces a new public configuration that was not previously available.
How was this patch tested?
Unit tests
Was this patch authored or co-authored using generative AI tooling?
No