remove unnecessary evaluation from SortOrder #3

cloud-fan · 2015-12-02T15:45:13Z

No description provided.

Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management). This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns). This also change the way to grow buffer, double it each time, then trim it once finished. cc liancheng Author: Davies Liu <[email protected]> Closes apache#9760 from davies/cache_limit.

This adds an extra filter for private or protected classes. We only filter for package private right now. Author: Timothy Hunter <[email protected]> Closes apache#9697 from thunterdb/spark-11732.

…ude_example JIRA link: https://issues.apache.org/jira/browse/SPARK-11729 Author: Xusen Yin <[email protected]> Closes apache#9713 from yinxusen/SPARK-11729.

Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <[email protected]> Closes apache#9749 from jkbradley/lr-io-2.

This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <[email protected]> Closes apache#9776 from mengxr/SPARK-11764.

There events happen normally during the app's lifecycle, so printing out ERROR logs all the time is misleading, and can actually affect usability of interactive shells. Author: Marcelo Vanzin <[email protected]> Closes apache#9772 from vanzin/SPARK-11786.

… a batch We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file. Author: Shixiong Zhu <[email protected]> Closes apache#9707 from zsxwing/fix-checkpoint.

…ng for those busy executors By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized. For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time. 1. the timer expiration starts before the listener event arrives. 2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally. Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation. For the rest of usages, the end users can decide if to use force killing or not by themselves. If to turn on that option, the killExecutor will do the action without any status checking. Author: Grace <[email protected]> Author: Andrew Or <[email protected]> Author: Jie Huang <[email protected]> Closes apache#7888 from GraceH/forcekill.

Author: Rohan Bhanderi <[email protected]> Closes apache#9781 from RohanBhanderi/patch-3.

Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes. Author: Shixiong Zhu <[email protected]> Closes apache#9778 from zsxwing/SPARK-11790.

…two params have both in error msg When we exceed the max memory tell users to increase both params instead of just the one. Author: Holden Karau <[email protected]> Closes apache#9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.

… response Author: Jacek Lewandowski <[email protected]> Closes apache#9692 from jacek-lewandowski/SPARK-11726.

Fixed the merge conflicts in apache#7410 Closes apache#7410 Author: Shixiong Zhu <[email protected]> Author: jerryshao <[email protected]> Author: jerryshao <[email protected]> Closes apache#9742 from zsxwing/pr7410.

…y for maps. I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795. Author: Reynold Xin <[email protected]> Closes apache#9784 from rxin/SPARK-11503.

Fix the serialization of RoaringBitmap with Kyro serializer This PR came from metamx#1, thanks to drcrallen Author: Davies Liu <[email protected]> Author: Charles Allen <[email protected]> Closes apache#9748 from davies/SPARK-11016.

This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty. This PR is based on apache#9661 (fix conflicts), see all of the comments at apache#9661 . Author: Kent Yao <[email protected]> Author: Davies Liu <[email protected]> Author: Charles Allen <[email protected]> Closes apache#9746 from davies/roaring_mapstatus.

The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM) Author: Davies Liu <[email protected]> Closes apache#9704 from davies/kyro_string.

…erialization They were previously using Spark's default serializer for serialization. Author: Reynold Xin <[email protected]> Closes apache#9787 from rxin/SPARK-11797.

The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following: ```Java > help(predict) Help on topic ‘predict’ was found in the following packages: Package Library SparkR /Users/yanboliang/data/trunk2/spark/R/lib stats /Library/Frameworks/R.framework/Versions/3.2/Resources/library Choose one 1: Make predictions from a model {SparkR} 2: Model Predictions {stats} ``` Author: Yanbo Liang <[email protected]> Closes apache#9732 from yanboliang/spark-11755.

…ener bus's thread See discussion toward the tail of apache#9723 From zsxwing : ``` The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext. I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally. ``` Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread. Author: tedyu <[email protected]> Closes apache#9741 from tedyu/master.

I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <[email protected]> Closes apache#6665 from RoyGao/7013.

Support the years between 0 <= year < 1000 Author: Davies Liu <[email protected]> Closes apache#9701 from davies/leading_zero.

…xample JIRA issue https://issues.apache.org/jira/browse/SPARK-11728. The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`. Author: Xusen Yin <[email protected]> Closes apache#9716 from yinxusen/SPARK-11728.

Author: Wenchen Fan <[email protected]> Closes apache#9783 from cloud-fan/postgre.

I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <[email protected]> Closes apache#9789 from rxin/SPARK-11802.

…n of UnsafeHashedRelations https://issues.apache.org/jira/browse/SPARK-11792 Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs. Author: Yin Huai <[email protected]> Closes apache#9788 from yhuai/SPARK-11792.

…aredStatement.executeUpdate for DDLs New changes with JDBCRDD Author: somideshmukh <[email protected]> Closes apache#9733 from somideshmukh/SomilBranch-1.1.

"Force" the executor ID sort with Int. Author: Jean-Baptiste Onofré <[email protected]> Closes apache#9165 from jbonofre/SPARK-6541.

Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability Author: Sean Owen <[email protected]> Closes apache#9731 from srowen/SPARK-11652.

It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <[email protected]> Closes apache#9771 from vivkul/patch-1.

…lize `JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray` needs to copy the content in the internal array to a new array. However, since the array will be converted to `ByteBuffer` at once, we can avoid the memory copy. This PR added `ByteBufferOutputStream` to access the protected `buf` and convert it to a `ByteBuffer` directly. Author: Shixiong Zhu <[email protected]> Closes apache#10051 from zsxwing/SPARK-12060.

This PR backports PR apache#10039 to master Author: Cheng Lian <[email protected]> Closes apache#10063 from liancheng/spark-12046.doc-fix.master.

…ill fail The reason is that, for a single culumn `RowEncoder`(or a single field product encoder), when we use it as the encoder for grouping key, we should also combine the grouping attributes, although there is only one grouping attribute. Author: Wenchen Fan <[email protected]> Closes apache#10059 from cloud-fan/bug.

…mpatible with encoder schema When we build the `fromRowExpression` for an encoder, we set up a lot of "unresolved" stuff and lost the required data type, which may lead to runtime error if the real type doesn't match the encoder's schema. For example, we build an encoder for `case class Data(a: Int, b: String)` and the real type is `[a: int, b: long]`, then we will hit runtime error and say that we can't construct class `Data` with int and long, because we lost the information that `b` should be a string. Author: Wenchen Fan <[email protected]> Closes apache#9840 from cloud-fan/err-msg.

create java version of `constructorFor` and `extractorFor` in `JavaTypeInference` Author: Wenchen Fan <[email protected]> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <[email protected]> Closes apache#9937 from cloud-fan/pojo.

Persist and Unpersist exist in both RDD and Dataframe APIs. I think they are still very critical in Dataset APIs. Not sure if my understanding is correct? If so, could you help me check if the implementation is acceptable? Please provide your opinions. marmbrus rxin cloud-fan Thank you very much! Author: gatorsmile <[email protected]> Author: xiaoli <[email protected]> Author: Xiao Li <[email protected]> Closes apache#9889 from gatorsmile/persistDS.

andrewor14 the same PR as in branch 1.5 harishreedharan Author: woj-i <[email protected]> Closes apache#9859 from woj-i/master.

This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2. Author: Josh Rosen <[email protected]> Closes apache#10054 from JoshRosen/upgrade-to-tachyon-0.8.2.

This bug was exposed as memory corruption in Timsort which uses copyMemory to copy large regions that can overlap. The prior implementation did not handle this case half the time and always copied forward, resulting in the data being corrupt. Author: Nong Li <[email protected]> Closes apache#10068 from nongli/spark-12030.

The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `<checkpoint dir>/_partitioner`. In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible. Author: Tathagata Das <[email protected]> Closes apache#9983 from tdas/SPARK-12004.

…ce.serialize" This reverts commit 1401166.

https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin <[email protected]> Closes apache#9965 from yinxusen/SPARK-11961.

… recovery issue Fixed a minor race condition in apache#10017 Closes apache#10017 Author: jerryshao <[email protected]> Author: Shixiong Zhu <[email protected]> Closes apache#10074 from zsxwing/review-pr10017.

…g up TestHive.reset() When profiling HiveCompatibilitySuite, I noticed that most of the time seems to be spent in expensive `TestHive.reset()` calls. This patch speeds up suites based on HiveComparisionTest, such as HiveCompatibilitySuite, with the following changes: - Avoid `TestHive.reset()` whenever possible: - Use a simple set of heuristics to guess whether we need to call `reset()` in between tests. - As a safety-net, automatically re-run failed tests by calling `reset()` before the re-attempt. - Speed up the expensive parts of `TestHive.reset()`: loading the `src` and `srcpart` tables took roughly 600ms per test, so we now avoid this by using a simple heuristic which only loads those tables by tests that reference them. This is based on simple string matching over the test queries which errs on the side of loading in more situations than might be strictly necessary. After these changes, HiveCompatibilitySuite seems to run in about 10 minutes. This PR is a revival of apache#6663, an earlier experimental PR from June, where I played around with several possible speedups for this suite. Author: Josh Rosen <[email protected]> Closes apache#10055 from JoshRosen/speculative-testhive-reset.

The issue is that the output commiter is not idempotent and retry attempts will fail because the output file already exists. It is not safe to clean up the file as this output committer is by design not retryable. Currently, the job fails with a confusing file exists error. This patch is a stop gap to tell the user to look at the top of the error log for the proper message. This is difficult to test locally as Spark is hardcoded not to retry. Manually verified by upping the retry attempts. Author: Nong Li <[email protected]> Author: Nong Li <[email protected]> Closes apache#10080 from nongli/spark-11328.

…data source When query the Timestamp or Date column like the following val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" < end) The generated SQL query is "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" It should have quote around the Timestamp/Date value such as "TIMESTAMP_COLUMN >= '2015-01-01 00:00:00.0'" Author: Huaxin Gao <[email protected]> Closes apache#9872 from huaxingao/spark-11788.

https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <[email protected]> Closes apache#10072 from yhuai/SPARK-11352.

…ild of the current TreeNode, we should only return the simpleString. In TreeNode's argString, if a TreeNode is not a child of the current TreeNode, we will only return the simpleString. I tested the [following case provided by Cristian](https://issues.apache.org/jira/browse/SPARK-11596?focusedCommentId=15019241&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15019241). ``` val c = (1 to 20).foldLeft[Option[DataFrame]] (None) { (curr, idx) => println(s"PROCESSING >>>>>>>>>>> $idx") val df = sqlContext.sparkContext.parallelize((0 to 10).zipWithIndex).toDF("A", "B") val union = curr.map(_.unionAll(df)).getOrElse(df) union.cache() Some(union) } c.get.explain(true) ``` Without the change, `c.get.explain(true)` took 100s. With the change, `c.get.explain(true)` took 26ms. https://issues.apache.org/jira/browse/SPARK-11596 Author: Yin Huai <[email protected]> Closes apache#10079 from yhuai/SPARK-11596.

Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This is a problem for long-running applications (e.g. streaming). Author: Andrew Or <[email protected]> Closes apache#10070 from andrewor14/periodic-gc.

The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases. **New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081). Author: Andrew Or <[email protected]> Closes apache#10081 from andrewor14/unified-memory-small-heaps.

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: #3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager. ### Why are the changes needed? Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same. ```py >>> func = lambda x: x >>> df = spark.range(1) >>> df.select(udf(func)("id")).cache() ``` ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == *(2) Project [pythonUDF0#14 AS <lambda>(id)#12] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14] +- *(1) Range (0, 1, step=1, splits=12) ``` This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance. ### Does this PR introduce _any_ user-facing change? Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it. ### How was this patch tested? I added a test case and manually. ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == InMemoryTableScan [<lambda>(id)#12] +- InMemoryRelation [<lambda>(id)#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(2) Project [pythonUDF0#5 AS <lambda>(id)#3] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5] +- *(1) Range (0, 1, step=1, splits=12) ``` Closes apache#28774 from ueshin/issues/SPARK-31945/udf_cache. Authored-by: Takuya UESHIN <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

… without WindowExpression ### What changes were proposed in this pull request? Add WindowFunction check at `CheckAnalysis`. ### Why are the changes needed? Provide friendly error msg. **BEFORE** ```scala scala> sql("select rank() from values(1)").show java.lang.UnsupportedOperationException: Cannot generate code for expression: rank() ``` **AFTER** ```scala scala> sql("select rank() from values(1)").show org.apache.spark.sql.AnalysisException: Window function rank() requires an OVER clause.;; Project [rank() AS RANK()#3] +- LocalRelation [col1#2] ``` ### Does this PR introduce _any_ user-facing change? Yes, user wiill be given a better error msg. ### How was this patch tested? Pass the newly added UT. Closes apache#28808 from ulysses-you/SPARK-31975. Authored-by: ulysses <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ly equivalent children in `RewriteDistinctAggregates` ### What changes were proposed in this pull request? In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same. ### Why are the changes needed? This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`. ``` create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; ``` The Expand operator has three projections (each producing a row for each incoming row): ``` [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) ``` In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: ``` [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) ``` With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Benchmark code in the JIRA (SPARK-40382). Before the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 14721 14859 195 5.7 175.5 1.0X some semantically equivalent 14569 14572 5 5.8 173.7 1.0X none semantically equivalent 14408 14488 113 5.8 171.8 1.0X ``` After the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 3658 3692 49 22.9 43.6 1.0X some semantically equivalent 9124 9214 127 9.2 108.8 0.4X none semantically equivalent 14601 14777 250 5.7 174.1 0.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#37825 from bersprockets/rewritedistinct_issue. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <[email protected]>

Davies Liu and others added 30 commits November 17, 2015 12:50

[SPARK-11732] Removes some MiMa false positives

fa603e0

This adds an extra filter for private or protected classes. We only filter for package private right now. Author: Timothy Hunter <[email protected]> Closes apache#9697 from thunterdb/spark-11732.

[SPARK-11729] Replace example code in ml-linear-methods.md using incl…

328eb49

…ude_example JIRA link: https://issues.apache.org/jira/browse/SPARK-11729 Author: Xusen Yin <[email protected]> Closes apache#9713 from yinxusen/SPARK-11729.

[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector

3e9e638

This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <[email protected]> Closes apache#9776 from mengxr/SPARK-11764.

[MINOR] Correct comments in JavaDirectKafkaWordCount

e29656f

Author: Rohan Bhanderi <[email protected]> Closes apache#9781 from RohanBhanderi/patch-3.

[SPARK-11726] Throw exception on timeout when waiting for REST server…

b362d50

… response Author: Jacek Lewandowski <[email protected]> Closes apache#9692 from jacek-lewandowski/SPARK-11726.

[SPARK-11793][SQL] Dataset should set the resolved encoders internall…

ed8d153

…y for maps. I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795. Author: Reynold Xin <[email protected]> Closes apache#9784 from rxin/SPARK-11503.

[SPARK-11737] [SQL] Fix serialization of UTF8String with Kyro

98be816

The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM) Author: Davies Liu <[email protected]> Closes apache#9704 from davies/kyro_string.

[SPARK-11797][SQL] collect, first, and take should use encoders for s…

91f4b6f

…erialization They were previously using Spark's default serializer for serialization. Author: Reynold Xin <[email protected]> Closes apache#9787 from rxin/SPARK-11797.

[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler

67a5132

I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <[email protected]> Closes apache#6665 from RoyGao/7013.

[SPARK-11643] [SQL] parse year with leading zero

2f191c6

Support the years between 0 <= year < 1000 Author: Davies Liu <[email protected]> Closes apache#9701 from davies/leading_zero.

[SPARK-10186][SQL][FOLLOW-UP] simplify test

8019f66

Author: Wenchen Fan <[email protected]> Closes apache#9783 from cloud-fan/postgre.

[SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets

5e2b444

I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803. Author: Reynold Xin <[email protected]> Closes apache#9789 from rxin/SPARK-11802.

[SPARK-10946][SQL] JDBC - Use Statement.executeUpdate instead of Prep…

b8f4379

…aredStatement.executeUpdate for DDLs New changes with JDBCRDD Author: somideshmukh <[email protected]> Closes apache#9733 from somideshmukh/SomilBranch-1.1.

[SPARK-6541] Sort executors by ID (numeric)

e62820c

"Force" the executor ID sort with Int. Author: Jean-Baptiste Onofré <[email protected]> Closes apache#9165 from jbonofre/SPARK-6541.

[SPARK-11652][CORE] Remote code execution with InvokerTransformer

9631ca3

Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability Author: Sean Owen <[email protected]> Closes apache#9731 from srowen/SPARK-11652.

rmse was wrongly calculated

1429e0a

It was multiplying with U instaed of dividing by U Author: Viveka Kulharia <[email protected]> Closes apache#9771 from vivkul/patch-1.

zsxwing and others added 21 commits December 1, 2015 09:45

[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues

69dbe6b

This PR backports PR apache#10039 to master Author: Cheng Lian <[email protected]> Closes apache#10063 from liancheng/spark-12046.doc-fix.master.

[SPARK-11821] Propagate Kerberos keytab for all environments

6a8cf80

andrewor14 the same PR as in branch 1.5 harishreedharan Author: woj-i <[email protected]> Closes apache#9859 from woj-i/master.

[SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2

34e7093

This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2. Author: Josh Rosen <[email protected]> Closes apache#10054 from JoshRosen/upgrade-to-tachyon-0.8.2.

Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstan…

328b757

…ce.serialize" This reverts commit 1401166.

[SPARK-11961][DOC] Add docs of ChiSqSelector

e76431f

https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin <[email protected]> Closes apache#9965 from yinxusen/SPARK-11961.

[SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint…

f292018

… recovery issue Fixed a minor race condition in apache#10017 Closes apache#10017 Author: jerryshao <[email protected]> Author: Shixiong Zhu <[email protected]> Closes apache#10074 from zsxwing/review-pr10017.

[SPARK-11352][SQL] Escape */ in the generated comments.

5872a9d

https://issues.apache.org/jira/browse/SPARK-11352 Author: Yin Huai <[email protected]> Closes apache#10072 from yhuai/SPARK-11352.

remove unnecessary evaluation from SortOrder

772dc7f

cloud-fan force-pushed the order-by branch from 57405b3 to 772dc7f Compare December 2, 2015 15:46

fix comment

a3e1313

cloud-fan closed this Dec 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove unnecessary evaluation from SortOrder #3

remove unnecessary evaluation from SortOrder #3

Uh oh!

cloud-fan commented Dec 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

remove unnecessary evaluation from SortOrder #3

remove unnecessary evaluation from SortOrder #3

Uh oh!

Conversation

cloud-fan commented Dec 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants