Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Aug 10, 2025

What changes were proposed in this pull request?

SPARK-52787 and SPARK-52630 reorganized the directory structure of the streaming-related code, but failed to align the code's package names with the directory structure. Therefore, this pull request introduces the following changes:

  1. The org.apache.spark.sql.execution.streaming.Source, org.apache.spark.sql.execution.streaming.Sink, org.apache.spark.sql.execution.streaming.runtime.ManifestFileCommitProtocol and org.apache.spark.sql.execution.streaming.ConsoleSinkProvider are moved back to their original directories to maintain consistency with the directory structure. Their package names have not been modified because doing so would cause the MiMa (binary compatibility checking tool) to fail or result in forward compatibility issues.

  2. The package names of other streaming-related code are corrected to ensure consistency with the directory structure.

Why are the changes needed?

The package name of the code should be kept as consistent as possible with the directory structure.

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass GitHub Actions

Was this patch authored or co-authored using generative AI tooling?

No

@LuciferYang
Copy link
Contributor Author

There might be test failures, and I will fix them.

@LuciferYang LuciferYang changed the title [SPARK-53233][SS] Make the code related to streaming uses the correct package name [SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to streaming uses the correct package name Aug 10, 2025
@LuciferYang LuciferYang marked this pull request as ready for review August 10, 2025 14:00
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Aug 10, 2025

cc @anishshri-db @HeartSaVioR @dongjoon-hyun @HyukjinKwon @peter-toth @yaooqinn

I have refactored a version of the code, and all tests have passed, as well as the mima check. Therefore, there are no binary compatibility issues within the promised scope.

However, there are also changes involving a modification to a configuration's default value and a revision to the description in an SPI file. What are your opinions on these changes?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @LuciferYang .

@dongjoon-hyun
Copy link
Member

cc @cloud-fan too since this is a massive change across 271 files.

@anishshri-db
Copy link
Contributor

@LuciferYang - is it possible to keep the Source.scala and Sink.scala files also within the directories ?

@anishshri-db
Copy link
Contributor

@LuciferYang - thanks for making this refactoring change. mostly looks good - just have couple of small questions. Thanks !

Copy link
Contributor

@anishshri-db anishshri-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm pending green CI

@LuciferYang
Copy link
Contributor Author

image

The latest code has passed all tests.

@dongjoon-hyun
Copy link
Member

Thank you, @LuciferYang and @anishshri-db .

Merged to master.

@LuciferYang
Copy link
Contributor Author

Thanks @dongjoon-hyun @anishshri-db @peter-toth

@cloud-fan
Copy link
Contributor

Hi @LuciferYang @dongjoon-hyun @anishshri-db , while I agree it's better to keep the package name consistent with the directory structure, is it really worthwhile to make such invansive changes and also break third party streaming sources (pulsar example)? Spark does not limit people to use its internal APIs but there is no backward compability guarantee on internal APIs, so we can break these internal APIs, but we should only do it when we have to. This seems not a "have to" case to me. what do you think?

@dongjoon-hyun
Copy link
Member

To @cloud-fan , I thought the original refactoring (SPARK-52787 and SPARK-52630) and this were all in Apache Spark 4.1.0? Do you mean this is an invasive internal change from 4.0.0 (or 4.0.1)?

@cloud-fan
Copy link
Contributor

Reorgnizing the directory structure is also an invansive change but at least it doesn't break binary compatibility of any API. I think we should try to keep binary compatibility in all releases, even for Spark 5.0 in the future. We can break internal APIs when we have to, but this doesn't seem like a "have to" case to me.

@LuciferYang
Copy link
Contributor Author

If it's necessary to maintain such internal API compatibility, I suggest reverting this part of the code to its initial state, that is, revert the current pr as well as changes SPARK-52787 and SPARK-52630.

@dongjoon-hyun
Copy link
Member

I agree with @LuciferYang .

@cloud-fan
Copy link
Contributor

I don't have a strong opinion on this. This directory structure reorgnizing does seperate the code better but also introduce the package name inconsistency problem. cc @anishshri-db

@dongjoon-hyun
Copy link
Member

@anishshri-db
Copy link
Contributor

@cloud-fan - few thoughts here:

  • I feel like we should still retain the new dir structure. Atleast it helps to organize code/components logically. Otherwise, we just have a ton of files in a single dir
  • In the longer term, I feel like we should also allow for the package names to reflect this structure within spark code as well as external dependencies. Btw, is it expected for external sources to rely on spark.sql.execution imports. I understand not breaking stuff under sql/api but I feel like we should have some path to allow us to make this transition - since this seems like an engine internal in many ways. I guess the only way is to allow both import paths for a few releases ? not sure if we could scope this down to only some list of imports to keep the changes contained ?

@HeartSaVioR
Copy link
Contributor

I think this is rather a fundamental problem - what is really a public API and what is not? We are limiting ourselves too much on considering non-public API as "pseudo" public API if there is any reference with it. We can't even blame them because we have weird policy of marking public API which no one except a few of people in Spark community would only understand.

Let's face the fact. Claiming the API to be public API only if there is scaladoc/javadoc/python doc, does not work.

@cloud-fan
Copy link
Contributor

how about we fix for pulsar data source for now and fix other breakages case by case? I think we can add a package object in org.apache.spark.sql.execution.streaming and add some classes/objects to keep binary compatibility.

@HeartSaVioR
Copy link
Contributor

Hmm OK they are actually following what we do with Kafka data source. HDFSMetadataLog and SerializedOffset are probably something 3rd party would use if they copy and modify the built-in data source.

Ideally speaking, we should rather consider moving these classes to the common package (execution package is definitely an internal one), but that effort will leave pulsar to be still broken, so probably better to be only done with major release, with discussion in prior.

I'm fine whatever way to fix this issue - 1) revert entirely 2) revert case by case 3) don't revert but add alias classes (extends the refactored class but having the old name) for compatibility.

@dongjoon-hyun
Copy link
Member

To @cloud-fan :

how about we fix for pulsar data source for now and fix other breakages case by case? ...
add some classes/objects to keep binary compatibility.

The above idea sounds like a kind of adhoc approach to me. Technically, if we want to achieve what you claims to achieve, the Apache Spark community might need to enforce to run MIMA test for all.

Without any systematic support, the goal (or the promise) is going to be fragile because nobody guarantees it. In addition, I'm not sure about the current status of master branch for other code path.

Do you have any suggestions for your requirements? Do we have a practical way to meet your goal here?

@cloud-fan
Copy link
Contributor

It's not the first time that we add compatibility fixes for internal APIs when we found it break third-party libraries. As I mentioned earlier, we can break internal API when we have to, and now I think a adhoc fix is better than reverting (partially or fully), or leave it broken.

We will keep doing this until Spark Connect gets more adoption, so that internal APIs won't exist in the classpath, to entirely avoid such issues.

@LuciferYang LuciferYang deleted the SPARK-53233 branch September 17, 2025 05:48
@dongjoon-hyun
Copy link
Member

Okay, please make a PR to drive in your direction and ping us. I'll try to help as much as possible to recover the ideal status, @cloud-fan .

@dongjoon-hyun
Copy link
Member

BTW, do you know or have some reference about the status of Spark Connect adoption? I'm just curious.

We will keep doing this until Spark Connect gets more adoption, so that internal APIs won't exist in the classpath, to entirely avoid such issues.

@cloud-fan
Copy link
Contributor

We can probably track the PyPI downloads of pyspark-connnect and the tarball downloads of the new Spark Connect package. But I'm not exactly sure where to find the data...

@LuciferYang
Copy link
Contributor Author

At present, internal package names are excluded during the MIMA check, such as org.apache.spark.sql.execution.*. This makes it difficult for us to recognize when we've inadvertently broken something important while modifying the code.

Therefore, should we refine the relevant rules? For example, only exclude org.apache.spark.sql.execution.xxx.* while not excluding org.apache.spark.sql.execution.streaming.* (this would require relevant experts to handle the refinement).

This approach might enable developers and reviewers to notice related issues before code merging and then further consult with experts in relevant fields to determine whether these internal APIs are suitable for modification.

What are your thoughts on this? @cloud-fan @dongjoon-hyun @HeartSaVioR @anishshri-db

@anishshri-db
Copy link
Contributor

anishshri-db commented Sep 17, 2025

now I think a adhoc fix is better

I think I agree with @cloud-fan that an adhoc fix is better here

@cloud-fan
Copy link
Contributor

We are allowed to change APIs in org.apache.spark.sql.execution.* as they are internal. It's a manual process to find broken third-party libraries (I found the pulsar issue in our internal CI). I don't think we should limit ourselves to not change internal APIs, and it's fine to keep this "case by case" approach.

With the monthly preview release, I think it can help the Spark plugin developers to detect compatibility issues earlier.

@LuciferYang
Copy link
Contributor Author

Yes, I agree with the ad hoc fix as well. I just want to explore if there are any proactive ways to identify issues in advance so as to minimize rework as much as possible.

cloud-fan added a commit that referenced this pull request Sep 19, 2025
…pache.spark.sql.execution.streaming

### What changes were proposed in this pull request?

This is a followup of #51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins.

This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27

### Why are the changes needed?

Avoid breaking Spark plugins

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

manual test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #52387 from cloud-fan/compat.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
huangxiaopingRD pushed a commit to huangxiaopingRD/spark that referenced this pull request Nov 25, 2025
…pache.spark.sql.execution.streaming

### What changes were proposed in this pull request?

This is a followup of apache#51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins.

This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27

### Why are the changes needed?

Avoid breaking Spark plugins

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

manual test

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#52387 from cloud-fan/compat.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants