[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to `streaming` uses the correct package name #51959

LuciferYang · 2025-08-10T07:59:31Z

What changes were proposed in this pull request?

SPARK-52787 and SPARK-52630 reorganized the directory structure of the streaming-related code, but failed to align the code's package names with the directory structure. Therefore, this pull request introduces the following changes:

The org.apache.spark.sql.execution.streaming.Source, org.apache.spark.sql.execution.streaming.Sink, org.apache.spark.sql.execution.streaming.runtime.ManifestFileCommitProtocol and org.apache.spark.sql.execution.streaming.ConsoleSinkProvider are moved back to their original directories to maintain consistency with the directory structure. Their package names have not been modified because doing so would cause the MiMa (binary compatibility checking tool) to fail or result in forward compatibility issues.
The package names of other streaming-related code are corrected to ensure consistency with the directory structure.

Why are the changes needed?

The package name of the code should be kept as consistent as possible with the directory structure.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

LuciferYang · 2025-08-10T09:21:30Z

There might be test failures, and I will fix them.

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister

LuciferYang · 2025-08-10T14:04:45Z

cc @anishshri-db @HeartSaVioR @dongjoon-hyun @HyukjinKwon @peter-toth @yaooqinn

I have refactored a version of the code, and all tests have passed, as well as the mima check. Therefore, there are no binary compatibility issues within the promised scope.

However, there are also changes involving a modification to a configuration's default value and a revision to the description in an SPI file. What are your opinions on these changes?

dongjoon-hyun

+1, LGTM. Thank you, @LuciferYang .

dongjoon-hyun · 2025-08-11T16:29:17Z

cc @cloud-fan too since this is a massive change across 271 files.

anishshri-db · 2025-08-12T01:07:43Z

@LuciferYang - is it possible to keep the Source.scala and Sink.scala files also within the directories ?

...re/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/MicroBatchExecution.scala

anishshri-db · 2025-08-12T01:13:19Z

@LuciferYang - thanks for making this refactoring change. mostly looks good - just have couple of small questions. Thanks !

anishshri-db

lgtm pending green CI

LuciferYang · 2025-08-12T09:26:41Z

The latest code has passed all tests.

dongjoon-hyun · 2025-08-12T13:46:11Z

Thank you, @LuciferYang and @anishshri-db .

Merged to master.

LuciferYang · 2025-08-12T14:13:14Z

Thanks @dongjoon-hyun @anishshri-db @peter-toth

cloud-fan · 2025-09-16T16:46:05Z

Hi @LuciferYang @dongjoon-hyun @anishshri-db , while I agree it's better to keep the package name consistent with the directory structure, is it really worthwhile to make such invansive changes and also break third party streaming sources (pulsar example)? Spark does not limit people to use its internal APIs but there is no backward compability guarantee on internal APIs, so we can break these internal APIs, but we should only do it when we have to. This seems not a "have to" case to me. what do you think?

dongjoon-hyun · 2025-09-16T16:51:11Z

To @cloud-fan , I thought the original refactoring (SPARK-52787 and SPARK-52630) and this were all in Apache Spark 4.1.0? Do you mean this is an invasive internal change from 4.0.0 (or 4.0.1)?

cloud-fan · 2025-09-16T17:00:30Z

Reorgnizing the directory structure is also an invansive change but at least it doesn't break binary compatibility of any API. I think we should try to keep binary compatibility in all releases, even for Spark 5.0 in the future. We can break internal APIs when we have to, but this doesn't seem like a "have to" case to me.

LuciferYang · 2025-09-16T17:04:22Z

If it's necessary to maintain such internal API compatibility, I suggest reverting this part of the code to its initial state, that is, revert the current pr as well as changes SPARK-52787 and SPARK-52630.

dongjoon-hyun · 2025-09-16T17:06:25Z

I agree with @LuciferYang .

cloud-fan · 2025-09-16T17:08:06Z

I don't have a strong opinion on this. This directory structure reorgnizing does seperate the code better but also introduce the package name inconsistency problem. cc @anishshri-db

dongjoon-hyun · 2025-09-16T17:09:48Z

cc @HeartSaVioR and @ericm-db from the original PRs, too. @cloud-fan's suggestion applies all of them.

[SPARK-52787][SS] Reorganize streaming execution dir around runtime and checkpoint areas #51477
[SPARK-52630][SS] Reorganize streaming operator and state mgmt code and dirs #51327

anishshri-db · 2025-09-16T20:20:11Z

@cloud-fan - few thoughts here:

I feel like we should still retain the new dir structure. Atleast it helps to organize code/components logically. Otherwise, we just have a ton of files in a single dir
In the longer term, I feel like we should also allow for the package names to reflect this structure within spark code as well as external dependencies. Btw, is it expected for external sources to rely on spark.sql.execution imports. I understand not breaking stuff under sql/api but I feel like we should have some path to allow us to make this transition - since this seems like an engine internal in many ways. I guess the only way is to allow both import paths for a few releases ? not sure if we could scope this down to only some list of imports to keep the changes contained ?

HeartSaVioR · 2025-09-16T22:11:24Z

I think this is rather a fundamental problem - what is really a public API and what is not? We are limiting ourselves too much on considering non-public API as "pseudo" public API if there is any reference with it. We can't even blame them because we have weird policy of marking public API which no one except a few of people in Spark community would only understand.

Let's face the fact. Claiming the API to be public API only if there is scaladoc/javadoc/python doc, does not work.

cloud-fan · 2025-09-17T03:01:27Z

how about we fix for pulsar data source for now and fix other breakages case by case? I think we can add a package object in org.apache.spark.sql.execution.streaming and add some classes/objects to keep binary compatibility.

HeartSaVioR · 2025-09-17T03:30:40Z

Hmm OK they are actually following what we do with Kafka data source. HDFSMetadataLog and SerializedOffset are probably something 3rd party would use if they copy and modify the built-in data source.

Ideally speaking, we should rather consider moving these classes to the common package (execution package is definitely an internal one), but that effort will leave pulsar to be still broken, so probably better to be only done with major release, with discussion in prior.

I'm fine whatever way to fix this issue - 1) revert entirely 2) revert case by case 3) don't revert but add alias classes (extends the refactored class but having the old name) for compatibility.

dongjoon-hyun · 2025-09-17T05:17:48Z

To @cloud-fan :

how about we fix for pulsar data source for now and fix other breakages case by case? ...
add some classes/objects to keep binary compatibility.

The above idea sounds like a kind of adhoc approach to me. Technically, if we want to achieve what you claims to achieve, the Apache Spark community might need to enforce to run MIMA test for all.

Without any systematic support, the goal (or the promise) is going to be fragile because nobody guarantees it. In addition, I'm not sure about the current status of master branch for other code path.

Do you have any suggestions for your requirements? Do we have a practical way to meet your goal here?

cloud-fan · 2025-09-17T05:38:29Z

It's not the first time that we add compatibility fixes for internal APIs when we found it break third-party libraries. As I mentioned earlier, we can break internal API when we have to, and now I think a adhoc fix is better than reverting (partially or fully), or leave it broken.

We will keep doing this until Spark Connect gets more adoption, so that internal APIs won't exist in the classpath, to entirely avoid such issues.

dongjoon-hyun · 2025-09-17T05:58:15Z

Okay, please make a PR to drive in your direction and ping us. I'll try to help as much as possible to recover the ideal status, @cloud-fan .

dongjoon-hyun · 2025-09-17T05:59:19Z

BTW, do you know or have some reference about the status of Spark Connect adoption? I'm just curious.

We will keep doing this until Spark Connect gets more adoption, so that internal APIs won't exist in the classpath, to entirely avoid such issues.

cloud-fan · 2025-09-17T06:01:54Z

We can probably track the PyPI downloads of pyspark-connnect and the tarball downloads of the new Spark Connect package. But I'm not exactly sure where to find the data...

LuciferYang · 2025-09-17T06:15:38Z

At present, internal package names are excluded during the MIMA check, such as org.apache.spark.sql.execution.*. This makes it difficult for us to recognize when we've inadvertently broken something important while modifying the code.

Therefore, should we refine the relevant rules? For example, only exclude org.apache.spark.sql.execution.xxx.* while not excluding org.apache.spark.sql.execution.streaming.* (this would require relevant experts to handle the refinement).

This approach might enable developers and reviewers to notice related issues before code merging and then further consult with experts in relevant fields to determine whether these internal APIs are suitable for modification.

What are your thoughts on this? @cloud-fan @dongjoon-hyun @HeartSaVioR @anishshri-db

anishshri-db · 2025-09-17T06:20:22Z

now I think a adhoc fix is better

I think I agree with @cloud-fan that an adhoc fix is better here

cloud-fan · 2025-09-17T06:21:51Z

We are allowed to change APIs in org.apache.spark.sql.execution.* as they are internal. It's a manual process to find broken third-party libraries (I found the pulsar issue in our internal CI). I don't think we should limit ourselves to not change internal APIs, and it's fine to keep this "case by case" approach.

With the monthly preview release, I think it can help the Spark plugin developers to detect compatibility issues earlier.

LuciferYang · 2025-09-17T06:27:03Z

Yes, I agree with the ad hoc fix as well. I just want to explore if there are any proactive ways to identify issues in advance so as to minimize rework as much as possible.

…pache.spark.sql.execution.streaming ### What changes were proposed in this pull request? This is a followup of #51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins. This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27 ### Why are the changes needed? Avoid breaking Spark plugins ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test ### Was this patch authored or co-authored using generative AI tooling? no Closes #52387 from cloud-fan/compat. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…pache.spark.sql.execution.streaming ### What changes were proposed in this pull request? This is a followup of apache#51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins. This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27 ### Why are the changes needed? Avoid breaking Spark plugins ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52387 from cloud-fan/compat. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

LuciferYang added 2 commits August 10, 2025 15:55

source

e31cfef

test

b01556c

LuciferYang marked this pull request as draft August 10, 2025 07:59

github-actions bot added SQL ML MLLIB STRUCTURED STREAMING WEB UI PYTHON CONNECT labels Aug 10, 2025

LuciferYang added 2 commits August 10, 2025 16:33

more

cedad75

config

cdc1bd9

LuciferYang commented Aug 10, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

LuciferYang commented Aug 10, 2025

View reviewed changes

sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister Outdated Show resolved Hide resolved

LuciferYang changed the title ~~[SPARK-53233][SS] Make the code related to streaming uses the correct package name~~ [SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to streaming uses the correct package name Aug 10, 2025

LuciferYang marked this pull request as ready for review August 10, 2025 14:00

dongjoon-hyun approved these changes Aug 10, 2025

View reviewed changes

LuciferYang added 2 commits August 12, 2025 00:56

revert config change

b57366e

Merge branch 'apache:master' into SPARK-53233

99b93c4

anishshri-db reviewed Aug 12, 2025

View reviewed changes

...re/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/MicroBatchExecution.scala Outdated Show resolved Hide resolved

revert

05a7374

anishshri-db approved these changes Aug 12, 2025

View reviewed changes

dongjoon-hyun closed this in d8dcfe7 Aug 12, 2025

LuciferYang deleted the SPARK-53233 branch September 17, 2025 05:48

cloud-fan mentioned this pull request Sep 18, 2025

[SPARK-53233][SQL][FOLLOWUP] Add compatibility class/object for org.apache.spark.sql.execution.streaming #52387

Closed

[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to streaming uses the correct package name #51959

[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to streaming uses the correct package name #51959

Uh oh!

Conversation

LuciferYang commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented Aug 10, 2025

Uh oh!

Uh oh!

Uh oh!

LuciferYang commented Aug 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 11, 2025

Uh oh!

anishshri-db commented Aug 12, 2025

Uh oh!

Uh oh!

anishshri-db commented Aug 12, 2025

Uh oh!

anishshri-db left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Aug 12, 2025

Uh oh!

dongjoon-hyun commented Aug 12, 2025

Uh oh!

LuciferYang commented Aug 12, 2025

Uh oh!

cloud-fan commented Sep 16, 2025

Uh oh!

dongjoon-hyun commented Sep 16, 2025

Uh oh!

cloud-fan commented Sep 16, 2025

Uh oh!

LuciferYang commented Sep 16, 2025

Uh oh!

dongjoon-hyun commented Sep 16, 2025

Uh oh!

cloud-fan commented Sep 16, 2025

Uh oh!

dongjoon-hyun commented Sep 16, 2025

Uh oh!

anishshri-db commented Sep 16, 2025

Uh oh!

HeartSaVioR commented Sep 16, 2025

Uh oh!

cloud-fan commented Sep 17, 2025

Uh oh!

HeartSaVioR commented Sep 17, 2025

Uh oh!

dongjoon-hyun commented Sep 17, 2025

Uh oh!

cloud-fan commented Sep 17, 2025

Uh oh!

dongjoon-hyun commented Sep 17, 2025

Uh oh!

dongjoon-hyun commented Sep 17, 2025

Uh oh!

cloud-fan commented Sep 17, 2025

Uh oh!

LuciferYang commented Sep 17, 2025

Uh oh!

anishshri-db commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 17, 2025

Uh oh!

LuciferYang commented Sep 17, 2025

Uh oh!

Reviewers

[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to `streaming` uses the correct package name #51959

[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to `streaming` uses the correct package name #51959

LuciferYang commented Aug 10, 2025 •

edited

Loading

LuciferYang commented Aug 10, 2025 •

edited

Loading

anishshri-db commented Sep 17, 2025 •

edited

Loading