forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
[pull] master from apache:master #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…s.xml` ### What changes were proposed in this pull request? The pr aims to change the suppress files from `sql/core/src/main/java/org/apache/spark/sql/api.java/*` to `sql/core/src/main/java/org/apache/spark/sql/api/java/*`, the former seems to be a wrong code path. ### Why are the changes needed? Correct the `files` contend in `checkstyle-suppressions.xml` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38469 from LuciferYang/fix-java-supperessions. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Before this PR, the `collect()` call will throw an exception to recommend to use `toPandas()`. With this PR, we can generate a list of PySpark `Row` upon calling `collect()`. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38409 from amaliujia/python_support_collect. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This reverts commit 9fc3aa0. ### Why are the changes needed? The upgrade breaks `dev/sbt-checkstyle` script, below is the error ``` [error] org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true. [error] at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) [error] at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) [error] at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) [error] at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) [error] at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1473) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:914) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) [error] at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) [error] at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) [error] at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) [error] at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) [error] at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) [error] at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) [error] at scala.xml.factory.XMLLoader.parse(XMLLoader.scala:73) [error] at scala.xml.factory.XMLLoader.loadXML(XMLLoader.scala:54) [error] at scala.xml.factory.XMLLoader.loadXML$(XMLLoader.scala:53) [error] at scala.xml.XML$.loadXML(XML.scala:62) [error] at scala.xml.factory.XMLLoader.loadString(XMLLoader.scala:92) [error] at scala.xml.factory.XMLLoader.loadString$(XMLLoader.scala:92) [error] at scala.xml.XML$.loadString(XML.scala:62) [error] at com.etsy.sbt.checkstyle.Checkstyle$.checkstyle(Checkstyle.scala:35) [error] at com.etsy.sbt.checkstyle.CheckstylePlugin$autoImport$.$anonfun$checkstyleTask$1(CheckstylePlugin.scala:36) [error] at com.etsy.sbt.checkstyle.CheckstylePlugin$autoImport$.$anonfun$checkstyleTask$1$adapted(CheckstylePlugin.scala:34) [error] at scala.Function1.$anonfun$compose$1(Function1.scala:49) [error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62) [error] at sbt.std.Transform$$anon$4.work(Transform.scala:68) [error] at sbt.Execute.$anonfun$submit$2(Execute.scala:282) [error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23) [error] at sbt.Execute.work(Execute.scala:291) [error] at sbt.Execute.$anonfun$submit$1(Execute.scala:282) [error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265) [error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:64) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [error] at java.lang.Thread.run(Thread.java:748) ``` Closes #38476 from linhongliu-db/fix-sbt. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…itrary columns ### What changes were proposed in this pull request? Support DataFrame creation from 2d NumPy array with arbitrary columns. ### Why are the changes needed? Currently, DataFrame creation from 2d ndarray works only with 2 columns. We should provide complete support for DataFrame creation with 2d ndarray. As part of [SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405). ### Does this PR introduce _any_ user-facing change? Yes. Before ```py >>> spark.createDataFrame(np.array([[1], [2]])).dtypes Traceback (most recent call last): ... raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (2, 1), indices imply (2, 2) >>> spark.createDataFrame(np.array([[1, 1, 1], [2, 2, 2]])).dtypes Traceback (most recent call last): ... raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (2, 3), indices imply (2, 2) ``` After ```py >>> spark.createDataFrame(np.array([[1], [2]])).dtypes [('value', 'bigint')] >>> spark.createDataFrame(np.array([[1, 1, 1], [2, 2, 2]])).dtypes [('_1', 'bigint'), ('_2', 'bigint'), ('_3', 'bigint')] ``` ### How was this patch tested? Unit tests. Closes #38473 from xinrong-meng/ncol_ndarr. Authored-by: Xinrong Meng <xinrong@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…or classes ### What changes were proposed in this pull request? This pr replaces TypeCheckFailure by DataTypeMismatch in type checks in the conditional expressions, includes: 1. If (2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L61-L67 2. CaseWhen (2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L175-L183 3. InSubquery (2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L378-L396 4. In (1): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L453 ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? 1. Add new UT 2. Update existed UT 3. Pass GA Closes #38438 from panbingkun/SPARK-40748. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? A minor change to fix the a Scala related compilation warning ``` [WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala:105: [deprecation | origin= | version=2.13.7] Wrap `given` in backticks to use it as an identifier, it will become a keyword in Scala 3. ``` ### Why are the changes needed? Fix a Scala related compilation warning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Action Closes #38478 from LuciferYang/minor-wrap-given. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? Fix a few wrong or misleading comments in DAGSchedulerSuite. ### Why are the changes needed? The wrong or misleading comments in DAGSchedulerSuite introduce confusions and make it harder to understanding the code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No code changes, pure comment changes. Original tests pass. Closes #38371 from JiexingLi/fix-comments. Authored-by: JiexingLi <jiexing.li@databricks.com> Signed-off-by: Mridul <mridul<at>gmail.com>
### What changes were proposed in this pull request? This PR aims to update `cloudpickle` to `v2.2.0` for Apache Spark 3.4.0. ### Why are the changes needed? SPARK-37457 updated `cloudpickle` v2.0.0 at Apache Spark 3.3.0. To bring the latest bug fixes. - https://github.com/cloudpipe/cloudpickle/releases/tag/v2.2.0 - https://github.com/cloudpipe/cloudpickle/releases/tag/2.1.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #38474 from dongjoon-hyun/SPARK-40991. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lient should have a default value=1 ### What changes were proposed in this pull request? To match existing Python DataFarme API, this PR changes the `Range.step` as required and Python client keep `1` as a default value for this field. ### Why are the changes needed? Matching existing DataFrame API. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38471 from amaliujia/range_step_required. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
### What changes were proposed in this pull request? This PR makes Bloom filter join use larger number of bits to build Bloom filter if row count is exist. ### Why are the changes needed? To fix Bloom filter join cannot filter out more data when CBO is enabled. For example: TPC-DS q64: CBO is enabled | CBO is disabled -- | -- <img width="282" height="600" alt="image" src="https://user-images.githubusercontent.com/5399861/187076753-2e9ccc72-0289-4537-a6d9-3a01a37bf6cd.png"> | <img width="373" height="600" alt="image" src="https://user-images.githubusercontent.com/5399861/187076786-c982e711-52e2-4199-ba42-e1100f57287b.png"> <img width="532" height="400" alt="image" src="https://user-images.githubusercontent.com/5399861/187075553-bd6956b7-8f1f-4df5-82b7-d010defb6d21.png"> | <img width="622" height="400" alt="image" src="https://user-images.githubusercontent.com/5399861/187075588-254c3246-b9af-403c-8df7-d8344fd1d2a4.png"> After this PR: Build bloom filter | Filter data -- | -- <img width="262" height="600" alt="image" src="https://user-images.githubusercontent.com/5399861/187075676-85b2afae-03a0-4430-9c4e-2679c6ef62f7.png"> | <img width="509" height="600" alt="image" src="https://user-images.githubusercontent.com/5399861/187075713-41173dc1-d01d-476a-b218-5c67be823e1b.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #37697 from wangyum/SPARK-40248. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <wgyumg@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>
…error classes ### What changes were proposed in this pull request? This pr replaces TypeCheckFailure by DataTypeMismatch in type checks in the complex type creator expressions, includes: 1. CreateMap (3): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L205-L214 2. CreateNamedStruct (3): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L445-L457 3. UpdateFields (2): https://github.com/apache/spark/blob/1431975723d8df30a25b2333eddcfd0bb6c57677/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L670-L673 ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? 1. Add new UT 2. Update existed UT 3. Pass GA Closes #38463 from panbingkun/SPARK-40374. Authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? Every time entries in offset log or commit log needs to be access, we read from disk which is slow. Can a cache of recent entries to speed up reads. There is already an existing implementation of a caching mechanism in OffsetSeqLog. Lets replace it with an implementation in HDFSMetadataLog (parent class) so that we can support reading from in memory cache for both offset log and commit log. ### Why are the changes needed? Improve read speeds for entries in offset log and commit log ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests should suffice Closes #38430 from jerrypeng/SPARK-40957. Authored-by: Jerry Peng <jerry.peng@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request? This pr aims upgrade RoaringBitmap 0.9.35 ### Why are the changes needed? This version bring some bug fix: - RoaringBitmap/RoaringBitmap#587 - RoaringBitmap/RoaringBitmap#588 other changes as follows: RoaringBitmap/RoaringBitmap@0.9.32...0.9.35 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes #38465 from LuciferYang/rbitmap-0935. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>
…String ### What changes were proposed in this pull request? This patch adds documentation to describe how clients should implement handling connecting to the Spark Connect endpoint. GRPC as a protocol is well documented and has many options. However, this does not make it easy for users to reason about how to correctly configure GRPC to make it work for their use cases. To overcome the issue, this document defines a client connection string that needs to be parsed by the different language clients and can be used to properly configure the GRPC client. ### Why are the changes needed? Documentation and design specification for clients implementing the Spark Connect protocol. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only. Closes #38470 from grundprinzip/client-connection. Lead-authored-by: Martin Grund <martin.grund@databricks.com> Co-authored-by: Martin Grund <grundprinzip@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…overage in Python client ### What changes were proposed in this pull request? This PR tests `session.sql` in Python client both in `toProto` path and the data collection path. ### Why are the changes needed? Improve testing coverage. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38472 from amaliujia/test_sql_in_python_client. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Fix the type in the doc filename: `coient` -> `client`. ### Why are the changes needed? Fix typo. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? UT Closes #38487 from amaliujia/follow_up_docs. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lient ### What changes were proposed in this pull request? 1. Improve testing coverage for `Union` and `UnionAll` (they are actually both `UnionAll`) 2. Add the API which does `UnionByName`. ### Why are the changes needed? Improve API Coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38453 from amaliujia/python_union. Authored-by: Rui Wang <rui.wang@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…o `INVALID_IDENTIFIER` ### What changes were proposed in this pull request? In the PR, I propose to assign the proper name `INVALID_IDENTIFIER ` to the legacy error class `_LEGACY_ERROR_TEMP_0040 `, and modify test suite to use `checkError()` which checks the error class name, context and etc. ### Why are the changes needed? Proper name improves user experience w/ Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, the PR changes an user-facing error message. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "core/testOnly *SparkThrowableSuite" $ build/sbt "test:testOnly *ErrorParserSuite" ``` Closes #38484 from MaxGekk/invalid-identifier-error-class. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request? Fix the filename in doc: `client_connection_string.md` -> `client-connection-string.md`. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? github CI Closes #38493 from dengziming/minor-docs. Authored-by: dengziming <dengziming@bytedance.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? Change to plain cast to `PartitioningUtils.castPartitionSpec` in convertToPartIdent. So the behavior can follow the `STORE_ASSIGNMENT_POLICY`. ### Why are the changes needed? Make v2 code path ALTER PARTITION follows `STORE_ASSIGNMENT_POLICY` in ansi mode. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? remove the test `SPARK-40798: Alter partition should verify partition value - legacy`. This change has been convered at : `SPARK-40798: Alter partition should verify partition value` Closes #38449 from ulysses-you/SPARK-40798-follow. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…tElementNames in Mima for Scala 2.13 ### What changes were proposed in this pull request? This PR is a followup of #35594 that recovers Mima compatibility test for Scala 2.13. ### Why are the changes needed? To fix the Mima build broken (https://github.com/apache/spark/actions/runs/3380379538/jobs/5613108397) ``` [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.3.0! Found 2 potential problems (filtered 945) [error] * method productElementName(Int)java.lang.String in object org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#Shutdown does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#Shutdown.productElementName") [error] * method productElementNames()scala.collection.Iterator in object org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#Shutdown does not have a correspondent in current version [error] filter with: ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages#Shutdown.productElementNames") ``` ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? CI in this PR should test it out. After that, scheduled jobs for Scala 2.13 will test this out Closes #38492 from HyukjinKwon/SPARK-38270-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
… code for MergeScalarSubqueries ### What changes were proposed in this pull request? Recently, I read the `MergeScalarSubqueries` because it is a feature used for improve performance. I fount the parameters of ScalarSubqueryReference is hard to understand, so I want add some comments on it. Additionally, the private method `supportedAggregateMerge` of `MergeScalarSubqueries` looks redundant, this PR wants simplify the code. ### Why are the changes needed? Improve the readability and simplify the code for `MergeScalarSubqueries`. ### Does this PR introduce _any_ user-facing change? 'No'. Just improve the readability and simplify the code for `MergeScalarSubqueries`. ### How was this patch tested? Exists tests. Closes #38461 from beliefer/SPARK-34079_followup. Authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…L status in UI
### What changes were proposed in this pull request?
Use event `SparkListenerSQLExecutionEnd` to track if the SQL/DataFrame is completion instead of using job status.
### Why are the changes needed?
The SQL may succeed with some failed jobs. For example, a inner join with one empty side and one large side, the plan would finish and the large side is still running.
### Does this PR introduce _any_ user-facing change?
yes, correct the sql status in UI
### How was this patch tested?
add test for backward compatibility and manually test
```sql
CREATE TABLE t1 (c1 int) USING PARQUET;
CREATE TABLE t2 USING PARQUET AS SELECT 1 AS c2;
```
```bash
./bin/spark-sql -e "SELECT /*+ merge(tmp) */ * FROM t1 JOIN (SELECT c2, java_method('java.lang.Thread', 'sleep', 10000L) FROM t2) tmp ON c1 = c2;"
```
before:
<img width="1712" alt="image" src="https://user-images.githubusercontent.com/12025282/196576790-7e4eeb29-024f-4ac3-bdec-f4e894448b57.png">
after:
<img width="1709" alt="image" src="https://user-images.githubusercontent.com/12025282/196576674-15d80366-bd42-417b-80bf-eeec0b1ef046.png">
Closes #38302 from ulysses-you/sql-end.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…lve `dev/sbt-checkstyle` run failed with sbt 1.7.3 ### What changes were proposed in this pull request? This pr aims upgrade `sbt-checkstyle-plugin` to 4.0.0 to resolve `dev/sbt-checkstyle` run failed with sbt 1.7.3, the new version will check the generated source code, so some new suppression rules have been added to `dev/checkstyle-suppressions.xml` ### Why are the changes needed? #38476 revert sbt 1.7.3 upgrade due to run `dev/sbt-checkstyle` failed: ``` [error] org.xml.sax.SAXParseException; lineNumber: 18; columnNumber: 10; DOCTYPE is disallowed when the feature "http://apache.org/xml/features/disallow-doctype-decl" set to true. [error] at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(ErrorHandlerWrapper.java:203) [error] at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.fatalError(ErrorHandlerWrapper.java:177) [error] at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:400) [error] at com.sun.org.apache.xerces.internal.impl.XMLErrorReporter.reportError(XMLErrorReporter.java:327) [error] at com.sun.org.apache.xerces.internal.impl.XMLScanner.reportFatalError(XMLScanner.java:1473) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl$PrologDriver.next(XMLDocumentScannerImpl.java:914) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:602) [error] at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:505) [error] at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:842) [error] at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) [error] at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141) [error] at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1213) [error] at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:643) [error] at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl.parse(SAXParserImpl.java:327) [error] at scala.xml.factory.XMLLoader.parse(XMLLoader.scala:73) [error] at scala.xml.factory.XMLLoader.loadXML(XMLLoader.scala:54) [error] at scala.xml.factory.XMLLoader.loadXML$(XMLLoader.scala:53) [error] at scala.xml.XML$.loadXML(XML.scala:62) [error] at scala.xml.factory.XMLLoader.loadString(XMLLoader.scala:92) [error] at scala.xml.factory.XMLLoader.loadString$(XMLLoader.scala:92) [error] at scala.xml.XML$.loadString(XML.scala:62) [error] at com.etsy.sbt.checkstyle.Checkstyle$.checkstyle(Checkstyle.scala:35) [error] at com.etsy.sbt.checkstyle.CheckstylePlugin$autoImport$.$anonfun$checkstyleTask$1(CheckstylePlugin.scala:36) [error] at com.etsy.sbt.checkstyle.CheckstylePlugin$autoImport$.$anonfun$checkstyleTask$1$adapted(CheckstylePlugin.scala:34) [error] at scala.Function1.$anonfun$compose$1(Function1.scala:49) [error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:62) [error] at sbt.std.Transform$$anon$4.work(Transform.scala:68) [error] at sbt.Execute.$anonfun$submit$2(Execute.scala:282) [error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:23) [error] at sbt.Execute.work(Execute.scala:291) [error] at sbt.Execute.$anonfun$submit$1(Execute.scala:282) [error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265) [error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:64) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [error] at java.lang.Thread.run(Thread.java:748) ``` This pr upgrade `sbt-checkstyle-plugin` to 4.0.0 to resolve the above issue, the `https://github.com/stringbean/sbt-checkstyle-plugin` was forked from etsy/sbt-checkstyle-plugin in 2022 after it became unmaintained, the release notes as follows: - https://github.com/stringbean/sbt-checkstyle-plugin/releases/tag/3.2.0 - https://github.com/stringbean/sbt-checkstyle-plugin/releases/tag/3.3.0 - https://github.com/stringbean/sbt-checkstyle-plugin/releases/tag/v4.0.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test `dev/sbt-checkstyle` with sbt 1.7.3 and this pr: `Checkstyle checks passed.` Closes #38481 from LuciferYang/173-pass-checkstyle. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request? Strip leading - from resource name prefix ### Why are the changes needed? leading - are not allowed for resource name prefix (especially spark.kubernetes.executor.podNamePrefix) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #38331 from tobiasstadler/fix-SPARK-40869. Lead-authored-by: Tobias Stadler <ts.stadler@gmx.de> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request? This pr aims to re-upgrade sbt to 1.7.3 due to SPARK-40996 has solved the issue of `dev/sbt-checkstyle` execution failure. ### Why are the changes needed? The release note as follows, this version just updates sbt underlying Coursier from 2.1.0-M2 to 2.1.0-M7-18-g67daad6a9: - https://github.com/sbt/sbt/releases/tag/v1.7.3 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test: Run `dev/sbt-checkstyle` with this pr ``` Checkstyle checks passed. ``` Closes #38502 from LuciferYang/SPARK-40976-2. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…tests for JDK 9+ ### What changes were proposed in this pull request? This PR is a follow-up for #38277. This change is required due to test failures in JDK 11 and JDK 17. The patch disables the unit test for JDK 9+. This is a temporary measure while I am debugging and working on the fix for higher versions of JDK. ### Why are the changes needed? Fixes the test failure in JDK 11. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes #38504 from sadikovi/fix-symlink-test. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…se-sensitive ### What changes were proposed in this pull request? Adding one sentence to the documentation that connection string parameters are case sensitive. ### Why are the changes needed? Dev Experience ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Doc only Closes #38501 from grundprinzip/SPARK-41001-v21. Authored-by: Martin Grund <martin.grund@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
pull bot
pushed a commit
that referenced
this pull request
Aug 12, 2024
…eption ### What changes were proposed in this pull request? This pr reworks the group by map type to fix issues: - Can not bind reference excpetion at runtume since the attribute was wrapped by `MapSort` and we didi not transform the plan with new output - The add `MapSort` rule should be put before `PullOutGroupingExpressions` to avoid complex expr existing in grouping keys ### Why are the changes needed? To fix issues. for example: ``` select map(1, id) from range(10) group by map(1, id); [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 org.apache.spark.SparkException: [INTERNAL_ERROR] Couldn't find _groupingexpression#18 in [mapsort(_groupingexpression#18)#19] SQLSTATE: XX000 at org.apache.spark.SparkException$.internalError(SparkException.scala:92) at org.apache.spark.SparkException$.internalError(SparkException.scala:96) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:81) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:470) ``` ### Does this PR introduce _any_ user-facing change? no, not released ### How was this patch tested? improve the tests to add more cases ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47545 from ulysses-you/maptype. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: youxiduo <youxiduo@corp.netease.com>
pull bot
pushed a commit
that referenced
this pull request
Jan 23, 2025
…IN-subquery ### What changes were proposed in this pull request? This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.: ``` Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))#40] +- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L) :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L] : +- LocalRelation [col1#32, col2#33] +- LocalRelation [c2#39L] ``` `sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression. ### Why are the changes needed? The following query fails: ``` create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); select col1, sum(col2) in (select c2 from v1) from v2 group by col1; ``` It fails with this error: ``` [INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000 ``` With SPARK_TESTING=1, it fails with this error: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan: Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19] +- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L) :- LocalRelation [col1#11, col2#12] +- LocalRelation [c2#18L] ``` The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression. The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression. This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions. ### Does this PR introduce _any_ user-facing change? No, other than allowing this type of query to succeed. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48627 from bersprockets/aggregate_in_set_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
huangxiaopingRD
pushed a commit
that referenced
this pull request
Sep 2, 2025
…IN-subquery ### What changes were proposed in this pull request? This PR adds code to `RewritePredicateSubquery#apply` to explicitly handle the case where an `Aggregate` node contains an aggregate expression in the left-hand operand of an IN-subquery expression. The explicit handler moves the IN-subquery expressions out of the `Aggregate` and into a parent `Project` node. The `Aggregate` will continue to perform the aggregations that were used as an operand to the IN-subquery expression, but will not include the IN-subquery expression itself. After pulling up IN-subquery expressions into a Project node, `RewritePredicateSubquery#apply` is called again to handle the `Project` as a `UnaryNode`. The `Join` will now be inserted between the `Project` and the `Aggregate` node, and the join condition will use an attribute rather than an aggregate expression, e.g.: ``` Project [col1#32, exists#42 AS (sum(col2) IN (listquery()))#40] +- Join ExistenceJoin(exists#42), (sum(col2)#41L = c2#39L) :- Aggregate [col1#32], [col1#32, sum(col2#33) AS sum(col2)#41L] : +- LocalRelation [col1#32, col2#33] +- LocalRelation [c2#39L] ``` `sum(col2)#41L` in the above join condition, despite how it looks, is the name of the attribute, not an aggregate expression. ### Why are the changes needed? The following query fails: ``` create or replace temp view v1(c1, c2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); create or replace temp view v2(col1, col2) as values (1, 2), (1, 3), (2, 2), (3, 7), (3, 1); select col1, sum(col2) in (select c2 from v1) from v2 group by col1; ``` It fails with this error: ``` [INTERNAL_ERROR] Cannot generate code for expression: sum(input[1, int, false]) SQLSTATE: XX000 ``` With SPARK_TESTING=1, it fails with this error: ``` [PLAN_VALIDATION_FAILED_RULE_IN_BATCH] Rule org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery in batch RewriteSubquery generated an invalid plan: Special expressions are placed in the wrong plan: Aggregate [col1#11], [col1#11, first(exists#20, false) AS (sum(col2) IN (listquery()))#19] +- Join ExistenceJoin(exists#20), (sum(col2#12) = c2#18L) :- LocalRelation [col1#11, col2#12] +- LocalRelation [c2#18L] ``` The issue is that `RewritePredicateSubquery` builds a `Join` operator where the join condition contains an aggregate expression. The bug is in the handler for `UnaryNode` in `RewritePredicateSubquery#apply`, which adds a `Join` below the `Aggregate` and assumes that the left-hand operand of IN-subquery can be used in the join condition. This works fine for most cases, but not when the left-hand operand is an aggregate expression. This PR moves the offending IN-subqueries to a `Project` node, with the aggregates replaced by attributes referring to the aggregate expressions. The resulting join condition now uses those attributes rather than the actual aggregate expressions. ### Does this PR introduce _any_ user-facing change? No, other than allowing this type of query to succeed. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#48627 from bersprockets/aggregate_in_set_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit e02ff1c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )