Branch 1.3 #16413

Kevy123 · 2016-12-27T02:44:18Z

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

This PR might have some issues with #3732 , and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <[email protected]> Closes #3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type (cherry picked from commit 4659468) Signed-off-by: Cheng Lian <[email protected]>

#5082 /cc liancheng Author: Yadong Qi <[email protected]> Closes #5132 from watermen/sql-missingInput-new and squashes the following commits: 1e5bdc5 [Yadong Qi] Check the missingInput simply (cherry picked from commit 9f3273b) Signed-off-by: Cheng Lian <[email protected]>

…e query One more thing if this PR is considered to be OK - it might make sense to add extra .jdbc() API's that take Properties to SQLContext. Author: Volodymyr Lyubinets <[email protected]> Closes #4859 from vlyubin/jdbcProperties and squashes the following commits: 7a8cfda [Volodymyr Lyubinets] Support jdbc connection properties in OPTIONS part of the query (cherry picked from commit bfd3ee9) Signed-off-by: Michael Armbrust <[email protected]>

…tor for all types of operator In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case clauses, thus never hit those clauses for unresolved operators and missing input attributes. This PR also removes the `prettyString` call when generating error message for missing input attributes. Because result of `prettyString` doesn't contain expression ID, and may give confusing messages like > resolved attributes a missing from a cc rxin  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129)  Author: Cheng Lian <[email protected]> Closes #5129 from liancheng/spark-6452 and squashes the following commits: 52cdc69 [Cheng Lian] Addresses comments 029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator for all types of operator (cherry picked from commit 1afcf77) Signed-off-by: Michael Armbrust <[email protected]>

As for "notebook --pylab inline" is not supported any more, update the related documentation for this. Author: Cong Yue <[email protected]> Closes #5111 from yuecong/patch-1 and squashes the following commits: 872df76 [Cong Yue] Update the command to use IPython notebook (cherry picked from commit c12312f) Signed-off-by: Sean Owen <[email protected]>

…en running FlumeStreamSuite When we run FlumeStreamSuite on Jenkins, sometimes we get error like as follows. sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 52 times over 10.094849836 seconds. Last failure message: Error connecting to localhost/127.0.0.1:23456. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:116) at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply$mcV$sp(FlumeStreamSuite.scala:66) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66) at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) This error is caused by check-then-act logic when it find free-port . /** Find a free port */ private def findFreePort(): Int = { Utils.startServiceOnPort(23456, (trialPort: Int) => { val socket = new ServerSocket(trialPort) socket.close() (null, trialPort) }, conf)._2 } Removing the check-then-act is not easy but we can reduce the chance of having the error by choosing random value for initial port instead of 23456. Author: Kousuke Saruta <[email protected]> Closes #4337 from sarutak/SPARK-5559 and squashes the following commits: 16f109f [Kousuke Saruta] Added `require` to Utils#startServiceOnPort c39d8b6 [Kousuke Saruta] Merge branch 'SPARK-5559' of github.com:sarutak/spark into SPARK-5559 1610ba2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559 33357e3 [Kousuke Saruta] Changed "findFreePort" method in MQTTStreamSuite and FlumeStreamSuite so that it can choose valid random port a9029fe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559 9489ef9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559 8212e42 [Kousuke Saruta] Modified default port used in FlumeStreamSuite from 23456 to random value (cherry picked from commit 85cf063) Signed-off-by: Sean Owen <[email protected]>

To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they difined in a previous example. Author: Peter Rudenko <[email protected]> Closes #5135 from petro-rudenko/patch-3 and squashes the following commits: 5190c75 [Peter Rudenko] Fix primitive types for java examples. 1d35383 [Peter Rudenko] [SQL][docs][minor] Define LabeledDocument/Document classes in CV example (cherry picked from commit 08d4528) Signed-off-by: Sean Owen <[email protected]>

Add checkpiontInterval to ALS to prevent: 1. StackOverflow exceptions caused by long lineage, 2. large shuffle files generated during iterations, 3. slow recovery when some node fail. srowen coderxiang Author: Xiangrui Meng <[email protected]> Closes #5076 from mengxr/SPARK-5955 and squashes the following commits: df56791 [Xiangrui Meng] update impl to reuse code 29affcb [Xiangrui Meng] do not materialize factors in implicit 20d3f7f [Xiangrui Meng] add checkpointInterval to ALS (cherry picked from commit 6b36470) Signed-off-by: Xiangrui Meng <[email protected]> Conflicts: mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

For example, one might expect the following code to work, but it does not. Now you will at least get a warning with a suggestion to use aliases. ```scala val df = sqlContext.load(path, "parquet") val txns = df.groupBy("cust_id").agg($"cust_id", countDistinct($"day_num").as("txns")) val spend = df.groupBy("cust_id").agg($"cust_id", sum($"extended_price").as("spend")) val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner") ``` Author: Michael Armbrust <[email protected]> Closes #5163 from marmbrus/selfJoinError and squashes the following commits: 16c1f0b [Michael Armbrust] fix visibility 1b57e8d [Michael Armbrust] Warn when constructing trivially true equals predicate (cherry picked from commit 32efadd) Signed-off-by: Michael Armbrust <[email protected]>

Otherwise we will leak files when spilling occurs. Author: Michael Armbrust <[email protected]> Closes #5161 from marmbrus/cleanupAfterSort and squashes the following commits: cb13d3c [Michael Armbrust] hint to inferencer cdebdf5 [Michael Armbrust] Use completion iterator to close external sorter (cherry picked from commit 26c6ce3) Signed-off-by: Michael Armbrust <[email protected]>

Due to a recent change that made `StructType` a `Seq` we started inadvertently turning `StructType`s into generic `Traversable` when attempting nested tree transformations. In this PR we explicitly avoid descending into `DataType`s to avoid this bug. Author: Michael Armbrust <[email protected]> Closes #5157 from marmbrus/udfFix and squashes the following commits: 26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold StructTypes (cherry picked from commit 3fa3d12) Signed-off-by: Michael Armbrust <[email protected]>

…urn zero" This reverts commit 93975a3.

Author: Michael Armbrust <[email protected]> Closes #5155 from marmbrus/errorMessages and squashes the following commits: b898188 [Michael Armbrust] Fix formatting of error messages. (cherry picked from commit 046c1e2) Signed-off-by: Michael Armbrust <[email protected]>

Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in `DataFrame`s this can cause errors for queries such as: ```scala val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str") df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count() ``` As a result, in this PR we defer the elimination of subqueries until the optimization phase. Author: Michael Armbrust <[email protected]> Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits: a9bb262 [Michael Armbrust] Update Optimizer.scala 27d25bf [Michael Armbrust] fix hive tests 9137e03 [Michael Armbrust] add type 81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization (cherry picked from commit cbeaf9e) Signed-off-by: Michael Armbrust <[email protected]>

Avoid unclear match errors and use `AnalysisException`. Author: Michael Armbrust <[email protected]> Closes #5158 from marmbrus/dataSourceError and squashes the following commits: af9f82a [Michael Armbrust] Yins comment 90c6ba4 [Michael Armbrust] Better error messages for invalid data sources (cherry picked from commit a8f51b8) Signed-off-by: Michael Armbrust <[email protected]>

…g to load classes (master branch PR) ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang. See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction. This patch fixes this issue by ensuring proper cleanup of these resources. It also adds logging for unexpected error cases. This PR is an extended version of #4935 and adds a regression test. Author: Josh Rosen <[email protected]> Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits: e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream 961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail 7ee2261 [Josh Rosen] Add a failing regression test e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader (cherry picked from commit 7215aa7) Signed-off-by: Andrew Or <[email protected]>

…lyst I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier. Author: Reynold Xin <[email protected]> Closes #5162 from rxin/catalyst-explicit-types and squashes the following commits: e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst. (cherry picked from commit 7334801) Signed-off-by: Reynold Xin <[email protected]> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

It would be great to fix this for 1.3. since the fix is surgical and it helps understandability for users. cc shivaram pwendell Author: Kay Ousterhout <[email protected]> Closes #4839 from kayousterhout/SPARK-6088 and squashes the following commits: 3ab012c [Kay Ousterhout] Update getting result time incrementally, correctly set GET_RESULT status f346b49 [Kay Ousterhout] Typos 748ea6b [Kay Ousterhout] Fixed build failure 84d617c [Kay Ousterhout] [SPARK-6088] Correct how tasks that get remote results are shown in the UI. (cherry picked from commit 6948ab6) Signed-off-by: Andrew Or <[email protected]>

Opening shuffle files can be very significant when the disk is contended, especially when using ext3. While writing data to a file can avoid hitting disk (and instead hit the buffer cache), opening a file always involves writing some metadata about the file to disk, so the open time can be a very significant portion of the shuffle write time. In one job I ran recently, the time to write shuffle data to the file was only 4ms for each task, but the time to open the file was about 100x as long (~400ms). When we add metrics about spilled data (#2504), we should ensure that the file open time is also included there. Author: Kay Ousterhout <[email protected]> Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits: ea3a4ae [Kay Ousterhout] Added comment about excluded open time fdc5185 [Kay Ousterhout] Improved comment 42b7e43 [Kay Ousterhout] Fixed parens for nanotime 2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time. (cherry picked from commit d8ccf65) Signed-off-by: Andrew Or <[email protected]>

Clarify the local directories usage in YARN Author: Christophe Préaud <[email protected]> Closes #5165 from preaudc/yarn-doc-local-dirs and squashes the following commits: 6912b90 [Christophe Préaud] Fix some formatting issues. 4fa8ec2 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' into yarn-doc-local-dirs eaaf519 [Christophe Préaud] Clarify the local directories usage in YARN 436fb7d [Christophe Préaud] Revert "Clarify the local directories usage in YARN" 876ae5e [Christophe Préaud] Clarify the local directories usage in YARN 608dbfa [Christophe Préaud] Merge remote-tracking branch 'upstream/master' a49a2ce [Christophe Préaud] Merge remote-tracking branch 'upstream/master' 9ba89ca [Christophe Préaud] Ensure that files are fetched atomically 54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master' c6a5590 [Christophe Préaud] Revert commit 8ea871f 7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' 8ea871f [Christophe Préaud] Ensure that files are fetched atomically

Needed to import the types specifically, not the more general pyspark.sql Author: Bill Chambers <[email protected]> Author: anabranch <[email protected]> Closes #5179 from anabranch/master and squashes the following commits: 8fa67bf [anabranch] Corrected SqlContext Import 603b080 [Bill Chambers] [DOCUMENTATION]Fixed Missing Type Import in Documentation (cherry picked from commit c5cc414) Signed-off-by: Reynold Xin <[email protected]>

…ghts) should initialize numFeatures In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model. ```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated. In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case. Author: Yanbo Liang <[email protected]> Closes #5167 from yanboliang/spark-6496 and squashes the following commits: 8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures (cherry picked from commit 10c7860) Signed-off-by: Sean Owen <[email protected]>

…n LDAModel.scala Remove unicode characters from MLlib file. Author: Michael Griffiths <[email protected]> Author: Griffiths, Michael (NYC-RPM) <[email protected]> Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following commits: bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27) 38eb535 [Michael Griffiths] Merge pull request #2 from apache/master b08e865 [Michael Griffiths] Merge pull request #1 from apache/master

…, because this will make some UDAF can not work. spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage" Author: DoingDone9 <[email protected]> Closes #5131 from DoingDone9/udaf and squashes the following commits: 9de08d0 [DoingDone9] Update HiveUdfSuite.scala 49c62dc [DoingDone9] Update hiveUdfs.scala 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master (cherry picked from commit 968408b) Signed-off-by: Michael Armbrust <[email protected]>

The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed. Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)  Author: Cheng Lian <[email protected]> Closes #5183 from liancheng/spark-6450 and squashes the following commits: 3536780 [Cheng Lian] Fixes metastore Parquet table conversion (cherry picked from commit 8c3b005) Signed-off-by: Michael Armbrust <[email protected]>

Previously this could result in sets compare equals when in fact the right was a subset of the left. Based on #5133 by sisihj Author: sisihj <[email protected]> Author: Michael Armbrust <[email protected]> Closes #5194 from marmbrus/pr/5133 and squashes the following commits: 5ed4615 [Michael Armbrust] fix imports d4cbbc0 [Michael Armbrust] Add test cases 0a0834f [sisihj] AttributeSet.equal should compare size (cherry picked from commit 276ef1c) Signed-off-by: Michael Armbrust <[email protected]>

``` >>> df[df.name.inSet("Bob", "Mike")].collect() [Row(age=5, name=u'Bob')] >>> df[df.age.inSet([1, 2, 3])].collect() [Row(age=2, name=u'Alice')] ``` Author: Davies Liu <[email protected]> Closes #5190 from davies/in and squashes the following commits: 6b73a47 [Davies Liu] Column.inSet() in Python (cherry picked from commit f535802) Signed-off-by: Reynold Xin <[email protected]>

Author: Michael Armbrust <[email protected]> Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits: bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema (cherry picked from commit f88f51b) Signed-off-by: Cheng Lian <[email protected]>

…t schema to support dropping of columns using replace columns Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. Author: Yash Datta <[email protected]> Closes #5141 from saucam/replace_col and squashes the following commits: e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema 5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns (cherry picked from commit 1c05027) Signed-off-by: Cheng Lian <[email protected]>

When running "bin/computer-classpath.sh", the output will be: :/spark/conf:/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.5.0-cdh5.2.0.jar:/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/spark/lib_managed/jars/datanucleus-core-3.2.10.jar Java will add the current working dir to the CLASSPATH, if the first ":" exists, which is not expected by spark users. For example, if I call spark-shell in the folder /root. And there exists a "core-site.xml" under /root/. Spark will use this file as HADOOP CONF file, even if I have already set HADOOP_CONF_DIR=/etc/hadoop/conf. Author: guliangliang <[email protected]> Closes #5156 from marsishandsome/Spark6491 and squashes the following commits: 5ae214f [guliangliang] use appendToClasspath to change CLASSPATH b21f3b2 [guliangliang] keep the classpath order 5d1f870 [guliangliang] [SPARK-6491] Spark will put the current working dir to the CLASSPATH

…ry files. Spark streaming deletes the temp file and backup files without checking if they exist or not Author: Hao Zhu <[email protected]> Closes #8082 from viadea/master and squashes the following commits: 242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files. 087daf0 [Hao Zhu] SPARK-9801 (cherry picked from commit 3c9802d) Signed-off-by: Tathagata Das <[email protected]>

…when a grouping expression is used as an argument of the aggregate fucntion https://issues.apache.org/jira/browse/SPARK-10169 Author: Wenchen Fan <[email protected]> Author: Yin Huai <[email protected]> Closes #8380 from yhuai/aggTransformDown-branch1.3.

…eter setting Added check for positive block size with a note that -1 for auto-configuring is not supported Author: Bryan Cutler <[email protected]> Closes #8363 from BryanCutler/ml.ALS-neg-blocksize-8400-1.3.

…nitializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <[email protected]> Closes #8526 from mengxr/SPARK-10354. (cherry picked from commit f0f563a) Signed-off-by: Xiangrui Meng <[email protected]>

…= 0.0 for some subset of matrix multiplications Apply fixes for alpha, beta parameter handling in gemm/gemv from #8525 to branch 1.3 CC mengxr brkyvz Author: Sean Owen <[email protected]> Closes #8572 from srowen/SPARK-10353.2.

…ld be 0.0 (original: 1.0) Small typo in the example for `LabelledPoint` in the MLLib docs. Author: Sean Paradiso <[email protected]> Closes #8680 from sparadiso/docs_mllib_smalltypo. (cherry picked from commit 1dc7548) Signed-off-by: Xiangrui Meng <[email protected]>

…rialization Python time values return a floating point value, need to cast to integer before serialize with struct.pack('!q', value) https://issues.apache.org/jira/browse/SPARK-6931 Author: Bryan Cutler <[email protected]> Closes #8594 from BryanCutler/py-write_long-backport-6931-1.2. (cherry picked from commit 4862a80) Signed-off-by: Davies Liu <[email protected]>

Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against. Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation. Author: Ahir Reddy <[email protected]> Closes #8709 from ahirreddy/sbt-scala-version-fix. (cherry picked from commit 9bbe33f) Signed-off-by: Sean Owen <[email protected]>

…keys JIRA: https://issues.apache.org/jira/browse/SPARK-10642 When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`. Author: Liang-Chi Hsieh <[email protected]> Closes #8796 from viirya/fix-pyrdd-lookup. (cherry picked from commit 136c77d) Signed-off-by: Davies Liu <[email protected]>

As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them. Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring. Author: Josh Rosen <[email protected]> Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master. (cherry picked from commit f1c9115) Signed-off-by: Josh Rosen <[email protected]>

…mitCoordinator (branch-1.3 backport) This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2. Author: Josh Rosen <[email protected]> Closes #8790 from JoshRosen/SPARK-10381-1.3.

Author: Reynold Xin <[email protected]> Closes #8894 from rxin/branch-1.3.

The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0. This bug exists since the beginning. Author: Davies Liu <[email protected]> Closes #9014 from davies/fix_decimal. (cherry picked from commit 37526ac) Signed-off-by: Davies Liu <[email protected]> Conflicts: sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

…when asked for index after the last non-zero entry See #9009 for details. Author: zero323 <[email protected]> Closes #9062 from zero323/SPARK-10973_1.3.

…atrix returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes #9293 Author: Sean Owen <[email protected]> Closes #9309 from srowen/SPARK-11302.2. (cherry picked from commit 826e1e3) Signed-off-by: Xiangrui Meng <[email protected]>

….3 backport) This is a branch-1.3 backport of #9382, a fix for SPARK-11424. Author: Josh Rosen <[email protected]> Closes #9423 from JoshRosen/hadoop-decompressor-pooling-fix-branch-1.3.

The code convertToCanonicalEdges is such that srcIds are smaller than dstIds but the scaladoc suggested otherwise. Have fixed the same. Author: Gaurav Kumar <[email protected]> Closes #9666 from gauravkumar37/patch-1. (cherry picked from commit df0e318) Signed-off-by: Reynold Xin <[email protected]>

jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <[email protected]> Closes #9803 from hhbyyh/w2vVocab. (cherry picked from commit e391abd) Signed-off-by: Xiangrui Meng <[email protected]>

…ceByKeyAndWindow invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None, thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data. In addition, the docstring used wrong parameter names, also fixed. Author: David Tolpin <[email protected]> Closes #9775 from dtolpin/master. (cherry picked from commit 599a8c6) Signed-off-by: Tathagata Das <[email protected]>

…branch 1.3 JIRA: https://issues.apache.org/jira/browse/SPARK-13464 ## What changes were proposed in this pull request? During backport a mllib feature, I found that the clearly checkouted branch-1.3 codebase would fail at the test `test_reduce_by_key_and_window_with_none_invFunc` in pyspark/streaming. We should fix it. ## How was the this patch tested? Unit test `test_reduce_by_key_and_window_with_none_invFunc` is fixed. Author: Liang-Chi Hsieh <[email protected]> Closes #11339 from viirya/fix-streaming-test-branch-1.3.

…tionClustering failed test ## What changes were proposed in this pull request? Backport JIRA-SPARK-12363 to branch-1.3. ## How was the this patch tested? Unit test. cc mengxr Author: Liang-Chi Hsieh <[email protected]> Author: Xiangrui Meng <[email protected]> Closes #11265 from viirya/backport-12363-1.3 and squashes the following commits: ec076dd [Liang-Chi Hsieh] Fix scala style. 7a3ef5f [Xiangrui Meng] use Graph instead of GraphImpl and update tests and example based on PIC paper b86018d [Liang-Chi Hsieh] Remove setRun and fix PowerIterationClustering failed test.

HyukjinKwon · 2016-12-27T02:45:10Z

Hi @Kevy123, it seems this pull request is mistakenly open. Could you please close this?

AmplabJenkins · 2016-12-27T02:47:14Z

Can one of the admins verify this patch?

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396 Author: Sean Owen <[email protected]> Closes apache#16447 from srowen/CloseStalePRs.

adrian-wang and others added 30 commits March 23, 2015 11:46

Revert "[SPARK-5680][SQL] Sum function on all null values, should ret…

930b667

…urn zero" This reverts commit 93975a3.

viadea and others added 23 commits August 10, 2015 17:18

[SPARK-8400] [ML] Added check in ml.ALS for positive block size param…

e8b0564

…eter setting Added check for positive block size with a note that -1 for auto-configuring is not supported Author: Bryan Cutler <[email protected]> Closes #8363 from BryanCutler/ml.ALS-neg-blocksize-8400-1.3.

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCom…

e54525f

…mitCoordinator (branch-1.3 backport) This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2. Author: Josh Rosen <[email protected]> Closes #8790 from JoshRosen/SPARK-10381-1.3.

Update branch-1.3 for 1.3.2 release.

392875a

Author: Reynold Xin <[email protected]> Closes #8894 from rxin/branch-1.3.

Preparing Spark release v1.3.2-rc1

5a13975

Preparing development version 1.3.3-SNAPSHOT

9f4b926

[SPARK-10973] [ML] [PYTHON] Fix IndexError exception on SparseVector …

25203d9

…when asked for index after the last non-zero entry See #9009 for details. Author: zero323 <[email protected]> Closes #9062 from zero323/SPARK-10973_1.3.

[SPARK-11424] Guard against double-close() of RecordReaders (branch-1…

b90e5cb

….3 backport) This is a branch-1.3 backport of #9382, a fix for SPARK-11424. Author: Josh Rosen <[email protected]> Closes #9423 from JoshRosen/hadoop-decompressor-pooling-fix-branch-1.3.

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017

Close stale PRs

ecaca37

Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396

srowen mentioned this pull request Jan 1, 2017

[BUILD] Close stale PRs #16447

Closed

asfgit closed this in ba48812 Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Branch 1.3 #16413

Branch 1.3 #16413

Uh oh!

Kevy123 commented Dec 27, 2016

Uh oh!

HyukjinKwon commented Dec 27, 2016

Uh oh!

AmplabJenkins commented Dec 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Branch 1.3 #16413

Branch 1.3 #16413

Uh oh!

Conversation

Kevy123 commented Dec 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon commented Dec 27, 2016

Uh oh!

AmplabJenkins commented Dec 27, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants