forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 0
sync #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
sync #15
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
### What changes were proposed in this pull request? This is a followup of #28695 , to fix the problem completely. The root cause is that, `df("col").as("name")` is not a column reference anymore, and should not have the special column metadata. However, this was broken in ba7adc4#diff-ac415c903887e49486ba542a65eec980L1050-L1053 This PR fixes the regression, by strip the special column metadata in `Column.name`, which is the behavior before #28326 . ### Why are the changes needed? Fix a regression. We shouldn't fail if there is no ambiguous self-join. ### Does this PR introduce _any_ user-facing change? Yes, the query in the test can run now. ### How was this patch tested? updated test Closes #28783 from cloud-fan/self-join. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…lizers.py ### What changes were proposed in this pull request? This PR aims to support both pandas 0.23 and 1.0. ### Why are the changes needed? ``` $ pip install pandas==0.23.2 $ python -c "import pandas.CategoricalDtype" Traceback (most recent call last): File "<string>", line 1, in <module> ModuleNotFoundError: No module named 'pandas.CategoricalDtype' $ python -c "from pandas.api.types import CategoricalDtype" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins. ``` $ pip freeze | grep pandas pandas==0.23.2 $ python/run-tests.py --python-executables python --modules pyspark-sql ... Tests passed in 359 seconds ``` Closes #28789 from williamhyun/williamhyun-patch-2. Authored-by: William Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…he case sensitivity in grouped and cogrouped pandas UDFs ### What changes were proposed in this pull request? This is another approach to fix the issue. See the previous try #28745. It was too invasive so I took more conservative approach. This PR proposes to resolve grouping attributes separately first so it can be properly referred when `FlatMapGroupsInPandas` and `FlatMapCoGroupsInPandas` are resolved without ambiguity. Previously, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` was failed as below: ``` pyspark.sql.utils.AnalysisException: "Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;" ``` because the unresolved `COLUMN` in `FlatMapGroupsInPandas` doesn't know which reference to take from the child projection. After this fix, it resolves the child projection first with grouping keys and pass, to `FlatMapGroupsInPandas`, the attribute as a grouping key from the child projection that is positionally selected. ### Why are the changes needed? To resolve grouping keys correctly. ### Does this PR introduce _any_ user-facing change? Yes, ```python from pyspark.sql.functions import * df = spark.createDataFrame([[1, 1]], ["column", "Score"]) pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP) def my_pandas_udf(pdf): return pdf.assign(Score=0.5) df.groupby('COLUMN').apply(my_pandas_udf).show() ``` ```python df1 = spark.createDataFrame([(1, 1)], ("column", "value")) df2 = spark.createDataFrame([(1, 1)], ("column", "value")) df1.groupby("COLUMN").cogroup( df2.groupby("COLUMN") ).applyInPandas(lambda r, l: r + l, df1.schema).show() ``` Before: ``` pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.; ``` ``` pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input columns: [COLUMN, COLUMN, value, value];; 'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, column#13L, value#14L), [column#22L, value#23L] :- Project [COLUMN#9L, column#9L, value#10L] : +- LogicalRDD [column#9L, value#10L], false +- Project [COLUMN#13L, column#13L, value#14L] +- LogicalRDD [column#13L, value#14L], false ``` After: ``` +------+-----+ |column|Score| +------+-----+ | 1| 0.5| +------+-----+ ``` ``` +------+-----+ |column|value| +------+-----+ | 2| 2| +------+-----+ ``` ### How was this patch tested? Unittests were added and manually tested. Closes #28777 from HyukjinKwon/SPARK-31915-another. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>
…riftCLIService to getPortNumber" This reverts commit 02f32cf.
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins with both Hadoop 2/3 or do the following manually. **Hadoop 2.7** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" ... [info] All tests passed. ``` **Hadoop 3.2** ``` $ build/sbt "sql/testOnly *.FileBasedDataSourceSuite -- -z SPARK-31935" -Phadoop-3.2 ... [info] All tests passed. ``` Closes #28791 from dongjoon-hyun/SPARK-31935. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…e conversion ### What changes were proposed in this pull request? When using pyarrow to convert a Pandas categorical column, use `is_categorical` instead of trying to import `CategoricalDtype` ### Why are the changes needed? The import for `CategoricalDtype` had changed from Pandas 0.23 to 1.0 and pyspark currently tries both locations. Using `is_categorical` is a more stable API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #28793 from BryanCutler/arrow-use-is_categorical-SPARK-31964. Authored-by: Bryan Cutler <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…ld pattern is missing
### What changes were proposed in this pull request?
If a datetime pattern contains no year field, the day of year field should not be ignored if exists
e.g.
```
spark-sql> select to_timestamp('31', 'DD');
1970-01-01 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
1970-01-30 00:00:00
spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp('31', 'DD');
1970-01-31 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
NULL
```
This PR only fixes some corner cases that use 'D' pattern to parse datetimes and there is w/o 'y'.
### Why are the changes needed?
fix some corner cases
### Does this PR introduce _any_ user-facing change?
yes, the day of year field will not be ignored
### How was this patch tested?
add unit tests.
Closes #28766 from yaooqinn/SPARK-31939.
Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
… condition ### What changes were proposed in this pull request? This reverts commit b9737c3 while keeping following changes * set default value of `spark.sql.adaptive.skewJoin.skewedPartitionFactor` to 5 * improve tests * remove unused imports ### Why are the changes needed? As discussed in #28669 (comment), revert SPARK-31864 for optimizing skew join to work for extremely clustered keys. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #28770 from manuzhang/spark-31942. Authored-by: manuzhang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR updates the test case to accept Hadoop 2/3 error message correctly. ### Why are the changes needed? SPARK-31935(#28760) breaks Hadoop 3.2 UT because Hadoop 2 and Hadoop 3 have different exception messages. In #28791, there are two test suites missed the fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #28796 from gengliangwang/SPARK-31926-followup. Authored-by: Gengliang Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…egistration to test conditionally
### What changes were proposed in this pull request?
This PR proposes to move the doctests in `registerJavaUDAF` and `registerJavaFunction` to the proper unittests that run conditionally when the test classes are present.
Both tests are dependent on the test classes in JVM side, `test.org.apache.spark.sql.JavaStringLength` and `test.org.apache.spark.sql.MyDoubleAvg`. So if you run the tests against the plain `sbt package`, it fails as below:
```
**********************************************************************
File "/.../spark/python/pyspark/sql/udf.py", line 366, in pyspark.sql.udf.UDFRegistration.registerJavaFunction
Failed example:
spark.udf.registerJavaFunction(
"javaStringLength", "test.org.apache.spark.sql.JavaStringLength", IntegerType())
Exception raised:
Traceback (most recent call last):
...
test.org.apache.spark.sql.JavaStringLength, please make sure it is on the classpath;
...
6 of 7 in pyspark.sql.udf.UDFRegistration.registerJavaFunction
2 of 4 in pyspark.sql.udf.UDFRegistration.registerJavaUDAF
***Test Failed*** 8 failures.
```
### Why are the changes needed?
In order to support to run the tests against the plain SBT build. See also https://spark.apache.org/developer-tools.html
### Does this PR introduce _any_ user-facing change?
No, it's test-only.
### How was this patch tested?
Manually tested as below:
```bash
./build/sbt -DskipTests -Phive-thriftserver clean package
cd python
./run-tests --python-executable=python3 --testname="pyspark.sql.udf UserDefinedFunction"
./run-tests --python-executable=python3 --testname="pyspark.sql.tests.test_udf UDFTests"
```
```bash
./build/sbt -DskipTests -Phive-thriftserver clean test:package
cd python
./run-tests --python-executable=python3 --testname="pyspark.sql.udf UserDefinedFunction"
./run-tests --python-executable=python3 --testname="pyspark.sql.tests.test_udf UDFTests"
```
Closes #28795 from HyukjinKwon/SPARK-31965.
Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… sql.execution package ### What changes were proposed in this pull request? This PR proposes to remove package private in classes/objects in sql.execution package, as per SPARK-16964. ### Why are the changes needed? This is per post-hoc review comment, see #24996 (comment) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #28790 from HeartSaVioR/SPARK-28199-FOLLOWUP-apply-SPARK-16964. Authored-by: Jungtaek Lim (HeartSaVioR) <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…gisticRegressionWithSGDTests.test_training_and_prediction ### What changes were proposed in this pull request? This is similar with 64cb6f7 The test `StreamingLogisticRegressionWithSGDTests.test_training_and_prediction` seems also flaky. This PR just increases the timeout to 3 mins too. The cause is very likely the time elapsed. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123787/testReport/pyspark.mllib.tests.test_streaming_algorithms/StreamingLogisticRegressionWithSGDTests/test_training_and_prediction/ ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 330, in test_training_and_prediction eventually(condition, timeout=60.0) File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/testing/utils.py", line 90, in eventually % (timeout, lastValue)) AssertionError: Test failed due to timeout after 60 sec, with last condition returning: Latest errors: 0.67, 0.71, 0.78, 0.7, 0.75, 0.74, 0.73, 0.69, 0.62, 0.71, 0.69, 0.75, 0.72, 0.77, 0.71, 0.74, 0.76, 0.78, 0.7, 0.78, 0.8, 0.74, 0.77, 0.75, 0.76, 0.76, 0.75 ``` ### Why are the changes needed? To make PR builds more stable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins will test them out. Closes #28798 from HyukjinKwon/SPARK-31966. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
sync