[SPARK-6761][ML][SQL] Approximate quantiles by thunterdb · Pull Request #3 · viirya/spark-1

thunterdb · 2016-02-18T20:15:59Z

Hi @viirya , I took your original PR and I made some changes to improve correctness and performance. I suggest we merge these changes into your public PR (apache#6042), and have someone from SQL or MLlib review the changes. What do you think?

Test coverage: should be close to 100% now (with 48 tests that cover most of the scenarios)

Performance: the cost of running the algorithm is now negligible when running benchmarks, most of the time is spent on deserializing RDDs. This was verified by running synthetic benchmarks with the sampler: the time spent in the quantile code itself is now < 5% of the total runtime as reported by VisualVM. However, the cost of using RDDs in DataFrame.describe is prohibitively high, so I suggest just leaving the quantiles in DataFrame.stats for now.

Note that this code only works by appending to buffers or by writing full buffers in an amortized manner, so it should be very easy to port to UDAFs later if someone wants to do it.

… hive metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes apache#10826 from sameeragarwal/skip-hive-metadata.

…uctions for streaming-akka project Since `actorStream` is an external project, we should add the linking and deploying instructions for it. A follow up PR of apache#10744 Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10856 from zsxwing/akka-link-instruction.

https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10186 from yinxusen/SPARK-11923.

…r other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10863 from yinxusen/SPARK-12952.

…r in dev/run-tests This patch improves our `dev/run-tests` script to test modules in a topologically-sorted order based on modules' dependencies. This will help to ensure that bugs in upstream projects are not misattributed to downstream projects because those projects' tests were the first ones to exhibit the failure Topological sorting is also useful for shortening the feedback loop when testing pull requests: if I make a change in SQL then the SQL tests should run before MLlib, not after. In addition, this patch also updates our test module definitions to split `sql` into `catalyst`, `sql`, and `hive` in order to allow more tests to be skipped when changing only `hive/` files. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10885 from JoshRosen/SPARK-8725.

Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes apache#10926 from liancheng/agg-doc-fix.

environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files. Author: Jeff Zhang <zjffdu@apache.org> Closes apache#10913 from zjffdu/SPARK-12993.

The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes apache#10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.

Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`. Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#10568 from zsxwing/send-ask-fail.

… Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes apache#9595 from zjffdu/SPARK-11622.

This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes apache#10820 from nongli/spark-12854.

…not be regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at apache#6386 (comment) re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes apache#10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.

Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#10829 from yanboliang/spark-12903.

This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes apache#10911 from liancheng/cms-df-api.

This PR is a follow-up of PR apache#10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes apache#10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.

… shutdown If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence. vanzin Author: Nishkam Ravi <nishkamravi@gmail.com> Author: nishkamravi2 <nishkamravi@gmail.com> Closes apache#10881 from nishkamravi2/master_netty.

…erties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10724 from yinxusen/SPARK-12780.

There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10902 from BenFradet/SPARK-12983.

…in cluster mode JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables Author: Andrew <weiner.andrew.j@gmail.com> Closes apache#10869 from weineran/branch-yarn-docs.

…o_test() There's a minor bug in how we handle the `root` module in the `modules_to_test()` function in `dev/run-tests.py`: since `root` now depends on `build` (since every test needs to run on any build test), we now need to check for the presence of root in `modules_to_test` instead of `changed_modules`. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10933 from JoshRosen/build-module-fix.

…ith `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes apache#8969 from jasoncl/SPARK-10847.

The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of apache#10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes apache#10835 from andrewor14/task-metrics-use-accums.

…s API contract Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array. If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: https://rawgit.com/JoshRosen/e520fb9a64c1c97ec985/raw/5e8a5aa8d2a18910a1607f0aa4190104adda3424/Violating%2520RDD.partitions%2520contract.html In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#10932 from JoshRosen/SPARK-13021.

This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes apache#10937 from cloud-fan/bloom-filter.

…Parser commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and apache#10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches apache#10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes apache#10905 from hvanhovell/SPARK-12866.

by explicitly marking annotated parameters as vals (SI-8813). Caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10955 from andrewor14/fix-scala211.

…tch.Row These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes apache#10952 from nongli/spark-13045.

this is stated for --packages and --repositories. Without stating it for --jars, people expect a standard java classpath to work, with expansion and using a different delimiter than a comma. Currently this is only state in the --help for spark-submit "Comma-separated list of local jars to include on the driver and executor classpaths." Author: James Lohse <jimlohse@users.noreply.github.com> Closes apache#10890 from jimlohse/patch-1.

…n Sketch This PR is a follow-up of apache#10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes apache#10968 from liancheng/cms-specialized.

… configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes apache#10849 from tejasapatil/SPARK-12926.

…for 2.x Phase 1: update plugin versions, test dependencies, some example and third-party versions Author: Sean Owen <sowen@cloudera.com> Closes apache#11206 from srowen/SPARK-13324.

…pares Option and String directly. ## What changes were proposed in this pull request? Fix some comparisons between unequal types that cause IJ warnings and in at least one case a likely bug (TaskSetManager) ## How was the this patch tested? Running Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes apache#11253 from srowen/SPARK-13371.

thunterdb · 2016-02-18T20:17:58Z

Oops, I realized the PR is huge because I merged the latest spark master. If you do that with your PR, it will be much smaller.

Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that. Author: Davies Liu <davies@databricks.com> Closes apache#11225 from davies/fix_pruning_expand.

This PR support codegen for broadcast outer join. In order to reduce the duplicated codes, this PR merge HashJoin and HashOuterJoin together (also BroadcastHashJoin and BroadcastHashOuterJoin). Author: Davies Liu <davies@databricks.com> Closes apache#11130 from davies/gen_out.

viirya · 2016-02-19T07:11:34Z

@thunterdb Thanks for submitting this PR. However, I noticed your PR is against my master branch. But my original PR is approximate_quantile branch. That is why your PR is so huge. Can you re-submit your PR against that branch?

Thanks.

Fix for incorrect memory in Spark UI as per SPARK-5768 Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes apache#6972 from rekhajoshm/SPARK-5768 and squashes the following commits: b678a91 [Joshi] Fix for incorrect memory in Spark UI 2fe53d9 [Joshi] Fix for incorrect memory in Spark UI eb823b8 [Joshi] SPARK-5768: Fix for incorrect memory in Spark UI 0be142d [Rekha Joshi] Merge pull request #3 from apache/master 106fd8e [Rekha Joshi] Merge pull request #2 from apache/master e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master (cherry picked from commit 085a721) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

This makes sure attempts are listed in the order they were executed, and that the app's state matches the state of the most current attempt. Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes apache#7253 from rekhajoshm/SPARK-8593 and squashes the following commits: 874dd80 [Joshi] History Server: updated order for multiple attempts(logcleaner) 716e0b1 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime) 548c753 [Joshi] History Server: updated order for multiple attempts(descending start time works everytime) 83306a8 [Joshi] History Server: updated order for multiple attempts(descending start time) b0fc922 [Joshi] History Server: updated order for multiple attempts(updated comment) cc0fda7 [Joshi] History Server: updated order for multiple attempts(updated test) 304cb0b [Joshi] History Server: updated order for multiple attempts(reverted HistoryPage) 85024e8 [Joshi] History Server: updated order for multiple attempts a41ac4b [Joshi] History Server: updated order for multiple attempts ab65fa1 [Joshi] History Server: some attempt completed to work with showIncomplete 0be142d [Rekha Joshi] Merge pull request #3 from apache/master 106fd8e [Rekha Joshi] Merge pull request #2 from apache/master e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master (cherry picked from commit 42d8a01) Signed-off-by: Sean Owen <sowen@cloudera.com>

…ve path. when i run cmd like that sc.addFile("../test.txt"), it did not work and throwed an exception: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.<init>(Path.java:172) ........ ....... Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:../test.txt at java.net.URI.checkPath(URI.java:1804) at java.net.URI.<init>(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203) Author: DoingDone9 <799203320@qq.com> Closes apache#4993 from DoingDone9/relativePath and squashes the following commits: ee375cd [DoingDone9] Update SparkContextSuite.scala d594e16 [DoingDone9] Update SparkContext.scala 0ff3fa8 [DoingDone9] test for add file dced8eb [DoingDone9] Update SparkContext.scala e4a13fe [DoingDone9] getCanonicalPath 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master (cherry picked from commit 00e730b) Signed-off-by: Sean Owen <sowen@cloudera.com>

…, because this will make some UDAF can not work. spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage" Author: DoingDone9 <799203320@qq.com> Closes apache#5131 from DoingDone9/udaf and squashes the following commits: 9de08d0 [DoingDone9] Update HiveUdfSuite.scala 49c62dc [DoingDone9] Update hiveUdfs.scala 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master (cherry picked from commit 968408b) Signed-off-by: Michael Armbrust <michael@databricks.com>

thunterdb · 2016-02-19T18:00:10Z

I am closing this PR, see #4 instead. Sorry for the confusion.

Empty commit to test 'Co-authored-by' and 'Signed-off-by'

### What changes were proposed in this pull request? `org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite` failed lately. After had a look at the logs it just shows the following fact without any details: ``` Caused by: sbt.ForkMain$ForkError: sun.security.krb5.KrbException: Server not found in Kerberos database (7) - Server not found in Kerberos database ``` Since the issue is intermittent and not able to reproduce it we should add more debug information and wait for reproduction with the extended logs. ### Why are the changes needed? Failing test doesn't give enough debug information. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I've started the test manually and checked that such additional debug messages show up: ``` >>> KrbApReq: APOptions are 00000000 00000000 00000000 00000000 >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Looking for keys for: kafka/localhostEXAMPLE.COM Added key: 17version: 0 Added key: 23version: 0 Added key: 16version: 0 Found unsupported keytype (3) for kafka/localhostEXAMPLE.COM >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType Using builtin default etypes for permitted_enctypes default etypes for permitted_enctypes: 17 16 23. >>> EType: sun.security.krb5.internal.crypto.Aes128CtsHmacSha1EType MemoryCache: add 1571936500/174770/16C565221B70AAB2BEFE31A83D13A2F4/client/localhostEXAMPLE.COM to client/localhostEXAMPLE.COM|kafka/localhostEXAMPLE.COM MemoryCache: Existing AuthList: #3: 1571936493/200803/8CD70D280B0862C5DA1FF901ECAD39FE/client/localhostEXAMPLE.COM #2: 1571936499/985009/BAD33290D079DD4E3579A8686EC326B7/client/localhostEXAMPLE.COM #1: 1571936499/995208/B76B9D78A9BE283AC78340157107FD40/client/localhostEXAMPLE.COM ``` Closes apache#26252 from gaborgsomogyi/SPARK-29580. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? This PR proposes to make `PythonFunction` holds `Seq[Byte]` instead of `Array[Byte]` to be able to compare if the byte array has the same values for the cache manager. ### Why are the changes needed? Currently the cache manager doesn't use the cache for `udf` if the `udf` is created again even if the functions is the same. ```py >>> func = lambda x: x >>> df = spark.range(1) >>> df.select(udf(func)("id")).cache() ``` ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == *(2) Project [pythonUDF0#14 AS <lambda>(id)#12] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#14] +- *(1) Range (0, 1, step=1, splits=12) ``` This is because `PythonFunction` holds `Array[Byte]`, and `equals` method of array equals only when the both array is the same instance. ### Does this PR introduce _any_ user-facing change? Yes, if the user reuse the Python function for the UDF, the cache manager will detect the same function and use the cache for it. ### How was this patch tested? I added a test case and manually. ```py >>> df.select(udf(func)("id")).explain() == Physical Plan == InMemoryTableScan [<lambda>(id)#12] +- InMemoryRelation [<lambda>(id)#12], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(2) Project [pythonUDF0#5 AS <lambda>(id)#3] +- BatchEvalPython [<lambda>(id#0L)], [pythonUDF0#5] +- *(1) Range (0, 1, step=1, splits=12) ``` Closes apache#28774 from ueshin/issues/SPARK-31945/udf_cache. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

… without WindowExpression ### What changes were proposed in this pull request? Add WindowFunction check at `CheckAnalysis`. ### Why are the changes needed? Provide friendly error msg. **BEFORE** ```scala scala> sql("select rank() from values(1)").show java.lang.UnsupportedOperationException: Cannot generate code for expression: rank() ``` **AFTER** ```scala scala> sql("select rank() from values(1)").show org.apache.spark.sql.AnalysisException: Window function rank() requires an OVER clause.;; Project [rank() AS RANK()#3] +- LocalRelation [col1#2] ``` ### Does this PR introduce _any_ user-facing change? Yes, user wiill be given a better error msg. ### How was this patch tested? Pass the newly added UT. Closes apache#28808 from ulysses-you/SPARK-31975. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

acu-bsc-75, fix bug in SessionWindowStateStoreSaveExec, when update state in append mode

…ly equivalent children in `RewriteDistinctAggregates` ### What changes were proposed in this pull request? In `RewriteDistinctAggregates`, when grouping aggregate expressions by function children, treat children that are semantically equivalent as the same. ### Why are the changes needed? This PR will reduce the number of projections in the Expand operator when there are multiple distinct aggregations with superficially different children. In some cases, it will eliminate the need for an Expand operator. Example: In the following query, the Expand operator creates 3\*n rows (where n is the number of incoming rows) because it has a projection for each of function children `b + 1`, `1 + b` and `c`. ``` create or replace temp view v1 as select * from values (1, 2, 3.0), (1, 3, 4.0), (2, 4, 2.5), (2, 3, 1.0) v1(a, b, c); select a, count(distinct b + 1), avg(distinct 1 + b) filter (where c > 0), sum(c) from v1 group by a; ``` The Expand operator has three projections (each producing a row for each incoming row): ``` [a#87, null, null, 0, null, UnscaledValue(c#89)], <== projection #1 (for regular aggregation) [a#87, (b#88 + 1), null, 1, null, null], <== projection #2 (for distinct aggregation of b + 1) [a#87, null, (1 + b#88), 2, (c#89 > 0.0), null]], <== projection #3 (for distinct aggregation of 1 + b) ``` In reality, the Expand only needs one projection for `1 + b` and `b + 1`, because they are semantically equivalent. With the proposed change, the Expand operator's projections look like this: ``` [a#67, null, 0, null, UnscaledValue(c#69)], <== projection #1 (for regular aggregations) [a#67, (b#68 + 1), 1, (c#69 > 0.0), null]], <== projection #2 (for distinct aggregation on b + 1 and 1 + b) ``` With one less projection, Expand produces 2\*n rows instead of 3\*n rows, but still produces the correct result. In the case where all distinct aggregates have semantically equivalent children, the Expand operator is not needed at all. Benchmark code in the JIRA (SPARK-40382). Before the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 14721 14859 195 5.7 175.5 1.0X some semantically equivalent 14569 14572 5 5.8 173.7 1.0X none semantically equivalent 14408 14488 113 5.8 171.8 1.0X ``` After the PR: ``` distinct aggregates: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ all semantically equivalent 3658 3692 49 22.9 43.6 1.0X some semantically equivalent 9124 9214 127 9.2 108.8 0.4X none semantically equivalent 14601 14777 250 5.7 174.1 0.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit tests. Closes apache#37825 from bersprockets/rewritedistinct_issue. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…edExpression() ### What changes were proposed in this pull request? In `EquivalentExpressions.addExpr()`, add a guard `supportedExpression()` to make it consistent with `addExprTree()` and `getExprState()`. ### Why are the changes needed? This fixes a regression caused by apache#39010 which added the `supportedExpression()` to `addExprTree()` and `getExprState()` but not `addExpr()`. One example of a use case affected by the inconsistency is the `PhysicalAggregation` pattern in physical planning. There, it calls `addExpr()` to deduplicate the aggregate expressions, and then calls `getExprState()` to deduplicate the result expressions. Guarding inconsistently will cause the aggregate and result expressions go out of sync, eventually resulting in query execution error (or whole-stage codegen error). ### Does this PR introduce _any_ user-facing change? This fixes a regression affecting Spark 3.3.2+, where it may manifest as an error running aggregate operators with higher-order functions. Example running the SQL command: ```sql select max(transform(array(id), x -> x)), max(transform(array(id), x -> x)) from range(2) ``` example error message before the fix: ``` java.lang.IllegalStateException: Couldn't find max(transform(array(id#0L), lambdafunction(lambda x#2L, lambda x#2L, false)))#4 in [max(transform(array(id#0L), lambdafunction(lambda x#1L, lambda x#1L, false)))#3] ``` after the fix this error is gone. ### How was this patch tested? Added new test cases to `SubexpressionEliminationSuite` for the immediate issue, and to `DataFrameAggregateSuite` for an example of user-visible symptom. Closes apache#40473 from rednaxelafx/spark-42851. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit ef0a76e) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

sameeragarwal and others added 30 commits January 26, 2016 07:50

[SPARK-11923][ML] Python API for ml.feature.ChiSqSelector

8beab68

https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10186 from yinxusen/SPARK-11923.

[SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimize…

fbf7623

…r other than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10863 from yinxusen/SPARK-12952.

[SQL] Minor Scaladoc format fix

83507fe

Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes apache#10926 from liancheng/agg-doc-fix.

[SPARK-12780] Inconsistency returning value of ML python models' prop…

4db255c

…erties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes apache#10724 from yinxusen/SPARK-12780.

[SPARK-12983][CORE][DOC] Correct metrics.properties.template

90b0e56

There are some typos or plain unintelligible sentences in the metrics template. Author: BenFradet <benjamin.fradet@gmail.com> Closes apache#10902 from BenFradet/SPARK-12983.

[HOTFIX] Fix Scala 2.11 compilation

d702f0c

by explicitly marking annotated parameters as vals (SI-8813). Caused by apache#10835. Author: Andrew Or <andrew@databricks.com> Closes apache#10955 from andrewor14/fix-scala211.

[SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBa…

4a09123

…tch.Row These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes apache#10952 from nongli/spark-13045.

srowen and others added 9 commits February 17, 2016 19:03

[SPARK-13324][CORE][BUILD] Update plugin, test, example dependencies …

b844048

…for 2.x Phase 1: update plugin versions, test dependencies, some example and third-party versions Author: Sean Owen <sowen@cloudera.com> Closes apache#11206 from srowen/SPARK-13324.

Add github pull request template

892b2dd

insert tests

e48badd

Merge remote-tracking branch 'upstream/master' into spark-6761b

699808a

tentative batch algorithm

2cba6c1

finally batch sampling is working

cbb1bb5

cleanups

773b20f

finished cleanups

5167bad

Davies Liu and others added 3 commits February 18, 2016 13:07

[SPARK-13351][SQL] fix column pruning on Expand

26f38bb

Currently, the columns in projects of Expand that are not used by Aggregate are not pruned, this PR fix that. Author: Davies Liu <davies@databricks.com> Closes apache#11225 from davies/fix_pruning_expand.

Merge remote-tracking branch 'upstream/master' into approximate_quantile

daaa196

thunterdb added 2 commits February 19, 2016 09:55

merge with branch

5176054

fix import order

d607fda

thunterdb closed this Feb 19, 2016

viirya pushed a commit that referenced this pull request Aug 6, 2018

Merge pull request #3 from HyukjinKwon/test-script

7dd0173

Empty commit to test 'Co-authored-by' and 'Signed-off-by'

viirya pushed a commit that referenced this pull request Feb 16, 2021

Merge pull request #3 from LiangchangZ/SPARK-10816

e0f3160

acu-bsc-75, fix bug in SessionWindowStateStoreSaveExec, when update state in append mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6761][ML][SQL] Approximate quantiles#3

[SPARK-6761][ML][SQL] Approximate quantiles#3
thunterdb wants to merge 252 commits intoviirya:masterfrom
thunterdb:spark-6761b

thunterdb commented Feb 18, 2016

Uh oh!

thunterdb commented Feb 18, 2016

Uh oh!

viirya commented Feb 19, 2016

Uh oh!

thunterdb commented Feb 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

thunterdb commented Feb 18, 2016

Uh oh!

thunterdb commented Feb 18, 2016

Uh oh!

viirya commented Feb 19, 2016

Uh oh!

thunterdb commented Feb 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments