-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-53233][SQL][SS][MLLIB][CONNECT] Make the code related to streaming uses the correct package name
#51959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
There might be test failures, and I will fix them. |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
Outdated
Show resolved
Hide resolved
streaming uses the correct package namestreaming uses the correct package name
|
cc @anishshri-db @HeartSaVioR @dongjoon-hyun @HyukjinKwon @peter-toth @yaooqinn I have refactored a version of the code, and all tests have passed, as well as the mima check. Therefore, there are no binary compatibility issues within the promised scope. However, there are also changes involving a modification to a configuration's default value and a revision to the description in an SPI file. What are your opinions on these changes? |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @LuciferYang .
|
cc @cloud-fan too since this is a massive change across 271 files. |
|
@LuciferYang - is it possible to keep the |
...re/src/main/scala/org/apache/spark/sql/execution/streaming/runtime/MicroBatchExecution.scala
Outdated
Show resolved
Hide resolved
|
@LuciferYang - thanks for making this refactoring change. mostly looks good - just have couple of small questions. Thanks ! |
anishshri-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm pending green CI
|
Thank you, @LuciferYang and @anishshri-db . Merged to master. |
|
Hi @LuciferYang @dongjoon-hyun @anishshri-db , while I agree it's better to keep the package name consistent with the directory structure, is it really worthwhile to make such invansive changes and also break third party streaming sources (pulsar example)? Spark does not limit people to use its internal APIs but there is no backward compability guarantee on internal APIs, so we can break these internal APIs, but we should only do it when we have to. This seems not a "have to" case to me. what do you think? |
|
To @cloud-fan , I thought the original refactoring (SPARK-52787 and SPARK-52630) and this were all in Apache Spark 4.1.0? Do you mean this is an invasive internal change from 4.0.0 (or 4.0.1)? |
|
Reorgnizing the directory structure is also an invansive change but at least it doesn't break binary compatibility of any API. I think we should try to keep binary compatibility in all releases, even for Spark 5.0 in the future. We can break internal APIs when we have to, but this doesn't seem like a "have to" case to me. |
|
If it's necessary to maintain such internal API compatibility, I suggest reverting this part of the code to its initial state, that is, revert the current pr as well as changes SPARK-52787 and SPARK-52630. |
|
I agree with @LuciferYang . |
|
I don't have a strong opinion on this. This directory structure reorgnizing does seperate the code better but also introduce the package name inconsistency problem. cc @anishshri-db |
|
cc @HeartSaVioR and @ericm-db from the original PRs, too. @cloud-fan's suggestion applies all of them. |
|
@cloud-fan - few thoughts here:
|
|
I think this is rather a fundamental problem - what is really a public API and what is not? We are limiting ourselves too much on considering non-public API as "pseudo" public API if there is any reference with it. We can't even blame them because we have weird policy of marking public API which no one except a few of people in Spark community would only understand. Let's face the fact. Claiming the API to be public API only if there is scaladoc/javadoc/python doc, does not work. |
|
how about we fix for pulsar data source for now and fix other breakages case by case? I think we can add a package object in |
|
Hmm OK they are actually following what we do with Kafka data source. HDFSMetadataLog and SerializedOffset are probably something 3rd party would use if they copy and modify the built-in data source. Ideally speaking, we should rather consider moving these classes to the common package (execution package is definitely an internal one), but that effort will leave pulsar to be still broken, so probably better to be only done with major release, with discussion in prior. I'm fine whatever way to fix this issue - 1) revert entirely 2) revert case by case 3) don't revert but add alias classes (extends the refactored class but having the old name) for compatibility. |
|
To @cloud-fan :
The above idea sounds like a kind of adhoc approach to me. Technically, if we want to achieve what you claims to achieve, the Apache Spark community might need to enforce to run MIMA test for all. Without any systematic support, the goal (or the promise) is going to be fragile because nobody guarantees it. In addition, I'm not sure about the current status of Do you have any suggestions for your requirements? Do we have a practical way to meet your goal here? |
|
It's not the first time that we add compatibility fixes for internal APIs when we found it break third-party libraries. As I mentioned earlier, we can break internal API when we have to, and now I think a adhoc fix is better than reverting (partially or fully), or leave it broken. We will keep doing this until Spark Connect gets more adoption, so that internal APIs won't exist in the classpath, to entirely avoid such issues. |
|
Okay, please make a PR to drive in your direction and ping us. I'll try to help as much as possible to recover the ideal status, @cloud-fan . |
|
BTW, do you know or have some reference about the status of
|
|
We can probably track the PyPI downloads of |
|
At present, internal package names are excluded during the MIMA check, such as Therefore, should we refine the relevant rules? For example, only exclude This approach might enable developers and reviewers to notice related issues before code merging and then further consult with experts in relevant fields to determine whether these internal APIs are suitable for modification. What are your thoughts on this? @cloud-fan @dongjoon-hyun @HeartSaVioR @anishshri-db |
I think I agree with @cloud-fan that an adhoc fix is better here |
|
We are allowed to change APIs in With the monthly preview release, I think it can help the Spark plugin developers to detect compatibility issues earlier. |
|
Yes, I agree with the ad hoc fix as well. I just want to explore if there are any proactive ways to identify issues in advance so as to minimize rework as much as possible. |
…pache.spark.sql.execution.streaming ### What changes were proposed in this pull request? This is a followup of #51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins. This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27 ### Why are the changes needed? Avoid breaking Spark plugins ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test ### Was this patch authored or co-authored using generative AI tooling? no Closes #52387 from cloud-fan/compat. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…pache.spark.sql.execution.streaming ### What changes were proposed in this pull request? This is a followup of apache#51959 . Although internal APIs are allowed to be changed, it's still better to keep compatibility if possible to avoid breaking existing Spark plugins. This PR brings back `HDFSMetadataLog` and `SerializedOffset` to the original package, to avoid breaking the pulsar data source: https://github.com/streamnative/pulsar-spark/blob/master/src/main/scala/org/apache/spark/sql/pulsar/PulsarSources.scala#L27 ### Why are the changes needed? Avoid breaking Spark plugins ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52387 from cloud-fan/compat. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

What changes were proposed in this pull request?
SPARK-52787 and SPARK-52630 reorganized the directory structure of the
streaming-related code, but failed to align the code'spackage nameswith the directory structure. Therefore, this pull request introduces the following changes:The
org.apache.spark.sql.execution.streaming.Source,org.apache.spark.sql.execution.streaming.Sink,org.apache.spark.sql.execution.streaming.runtime.ManifestFileCommitProtocolandorg.apache.spark.sql.execution.streaming.ConsoleSinkProviderare moved back to their original directories to maintain consistency with the directory structure. Theirpackage nameshave not been modified because doing so would cause theMiMa(binary compatibility checking tool) to fail or result in forward compatibility issues.The
package namesof otherstreaming-related code are corrected to ensure consistency with the directory structure.Why are the changes needed?
The
package nameof the code should be kept as consistent as possible with the directory structure.Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No