Skip to content

Conversation

@Kevy123
Copy link

@Kevy123 Kevy123 commented Dec 27, 2016

What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

adrian-wang and others added 30 commits March 23, 2015 11:46
This PR might have some issues with #3732 ,
and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged.

Author: Daoyuan Wang <[email protected]>

Closes #3822 from adrian-wang/parquetdate and squashes the following commits:

2c5d54d [Daoyuan Wang] add a test case
faef887 [Daoyuan Wang] parquet support for primitive date
97e9080 [Daoyuan Wang] parquet support for date type

(cherry picked from commit 4659468)
Signed-off-by: Cheng Lian <[email protected]>
#5082

/cc liancheng

Author: Yadong Qi <[email protected]>

Closes #5132 from watermen/sql-missingInput-new and squashes the following commits:

1e5bdc5 [Yadong Qi] Check the missingInput simply

(cherry picked from commit 9f3273b)
Signed-off-by: Cheng Lian <[email protected]>
…e query

One more thing if this PR is considered to be OK - it might make sense to add extra .jdbc() API's that take Properties to SQLContext.

Author: Volodymyr Lyubinets <[email protected]>

Closes #4859 from vlyubin/jdbcProperties and squashes the following commits:

7a8cfda [Volodymyr Lyubinets] Support jdbc connection properties in OPTIONS part of the query

(cherry picked from commit bfd3ee9)
Signed-off-by: Michael Armbrust <[email protected]>
…tor for all types of operator

In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case clauses, thus never hit those clauses for unresolved operators and missing input attributes.

This PR also removes the `prettyString` call when generating error message for missing input attributes. Because result of `prettyString` doesn't contain expression ID, and may give confusing messages like

> resolved attributes a missing from a

cc rxin

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #5129 from liancheng/spark-6452 and squashes the following commits:

52cdc69 [Cheng Lian] Addresses comments
029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator for all types of operator

(cherry picked from commit 1afcf77)
Signed-off-by: Michael Armbrust <[email protected]>
As for "notebook --pylab inline" is not supported any more, update the related documentation for this.

Author: Cong Yue <[email protected]>

Closes #5111 from yuecong/patch-1 and squashes the following commits:

872df76 [Cong Yue] Update the command to use IPython notebook

(cherry picked from commit c12312f)
Signed-off-by: Sean Owen <[email protected]>
…en running FlumeStreamSuite

When we run FlumeStreamSuite on Jenkins, sometimes we get error like as follows.

    sbt.ForkMain$ForkError: The code passed to eventually never returned normally. Attempted 52 times over 10.094849836 seconds. Last failure message: Error connecting to localhost/127.0.0.1:23456.
	    at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
	    at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	    at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
	   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:116)
           at org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
	   at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply$mcV$sp(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
	    at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
	    at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
	    at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
	    at org.scalatest.Transformer.apply(Transformer.scala:22)
	    at org.scalatest.Transformer.apply(Transformer.scala:20)
    	    at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
	    at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
	    at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
	    at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
	   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
	    at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
	    at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)

This error is caused by check-then-act logic  when it find free-port .

      /** Find a free port */
      private def findFreePort(): Int = {
        Utils.startServiceOnPort(23456, (trialPort: Int) => {
          val socket = new ServerSocket(trialPort)
          socket.close()
          (null, trialPort)
        }, conf)._2
      }

Removing the check-then-act is not easy but we can reduce the chance of having the error by choosing random value for initial port instead of 23456.

Author: Kousuke Saruta <[email protected]>

Closes #4337 from sarutak/SPARK-5559 and squashes the following commits:

16f109f [Kousuke Saruta] Added `require` to Utils#startServiceOnPort
c39d8b6 [Kousuke Saruta] Merge branch 'SPARK-5559' of github.com:sarutak/spark into SPARK-5559
1610ba2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
33357e3 [Kousuke Saruta] Changed "findFreePort" method in MQTTStreamSuite and FlumeStreamSuite so that it can choose valid random port
a9029fe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
9489ef9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5559
8212e42 [Kousuke Saruta] Modified default port used in FlumeStreamSuite from 23456 to random value

(cherry picked from commit 85cf063)
Signed-off-by: Sean Owen <[email protected]>
To easier copy/paste Cross-Validation example code snippet need to define LabeledDocument/Document in it, since they difined in a previous example.

Author: Peter Rudenko <[email protected]>

Closes #5135 from petro-rudenko/patch-3 and squashes the following commits:

5190c75 [Peter Rudenko] Fix primitive types for java examples.
1d35383 [Peter Rudenko] [SQL][docs][minor] Define LabeledDocument/Document classes in CV example

(cherry picked from commit 08d4528)
Signed-off-by: Sean Owen <[email protected]>
Add checkpiontInterval to ALS to prevent:

1. StackOverflow exceptions caused by long lineage,
2. large shuffle files generated during iterations,
3. slow recovery when some node fail.

srowen coderxiang

Author: Xiangrui Meng <[email protected]>

Closes #5076 from mengxr/SPARK-5955 and squashes the following commits:

df56791 [Xiangrui Meng] update impl to reuse code
29affcb [Xiangrui Meng] do not materialize factors in implicit
20d3f7f [Xiangrui Meng] add checkpointInterval to ALS

(cherry picked from commit 6b36470)
Signed-off-by: Xiangrui Meng <[email protected]>

Conflicts:
	mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
For example, one might expect the following code to work, but it does not.  Now you will at least get a warning with a suggestion to use aliases.

```scala
val df = sqlContext.load(path, "parquet")
val txns = df.groupBy("cust_id").agg($"cust_id", countDistinct($"day_num").as("txns"))
val spend = df.groupBy("cust_id").agg($"cust_id", sum($"extended_price").as("spend"))
val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner")
```

Author: Michael Armbrust <[email protected]>

Closes #5163 from marmbrus/selfJoinError and squashes the following commits:

16c1f0b [Michael Armbrust] fix visibility
1b57e8d [Michael Armbrust] Warn when constructing trivially true equals predicate

(cherry picked from commit 32efadd)
Signed-off-by: Michael Armbrust <[email protected]>
Otherwise we will leak files when spilling occurs.

Author: Michael Armbrust <[email protected]>

Closes #5161 from marmbrus/cleanupAfterSort and squashes the following commits:

cb13d3c [Michael Armbrust] hint to inferencer
cdebdf5 [Michael Armbrust] Use completion iterator to close external sorter

(cherry picked from commit 26c6ce3)
Signed-off-by: Michael Armbrust <[email protected]>
Due to a recent change that made `StructType` a `Seq` we started inadvertently turning `StructType`s into generic `Traversable` when attempting nested tree transformations.  In this PR we explicitly avoid descending into `DataType`s to avoid this bug.

Author: Michael Armbrust <[email protected]>

Closes #5157 from marmbrus/udfFix and squashes the following commits:

26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold StructTypes

(cherry picked from commit 3fa3d12)
Signed-off-by: Michael Armbrust <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #5155 from marmbrus/errorMessages and squashes the following commits:

b898188 [Michael Armbrust] Fix formatting of error messages.

(cherry picked from commit 046c1e2)
Signed-off-by: Michael Armbrust <[email protected]>
Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again.  However, with eager analysis in `DataFrame`s this can cause errors for queries such as:

```scala
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()
```

As a result, in this PR we defer the elimination of subqueries until the optimization phase.

Author: Michael Armbrust <[email protected]>

Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits:

a9bb262 [Michael Armbrust] Update Optimizer.scala
27d25bf [Michael Armbrust] fix hive tests
9137e03 [Michael Armbrust] add type
81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization

(cherry picked from commit cbeaf9e)
Signed-off-by: Michael Armbrust <[email protected]>
Avoid unclear match errors and use `AnalysisException`.

Author: Michael Armbrust <[email protected]>

Closes #5158 from marmbrus/dataSourceError and squashes the following commits:

af9f82a [Michael Armbrust] Yins comment
90c6ba4 [Michael Armbrust] Better error messages for invalid data sources

(cherry picked from commit a8f51b8)
Signed-off-by: Michael Armbrust <[email protected]>
…g to load classes (master branch PR)

ExecutorClassLoader does not ensure proper cleanup of network connections that it opens. If it fails to load a class, it may leak partially-consumed InputStreams that are connected to the REPL's HTTP class server, causing that server to exhaust its thread pool, which can cause the entire job to hang.  See [SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more details, including a bug reproduction.

This patch fixes this issue by ensuring proper cleanup of these resources.  It also adds logging for unexpected error cases.

This PR is an extended version of #4935 and adds a regression test.

Author: Josh Rosen <[email protected]>

Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and squashes the following commits:

e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable keep-alive by closing error stream
961c284 [Josh Rosen] Roll back changes that were added to get the regression test to fail
7ee2261 [Josh Rosen] Add a failing regression test
e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader

(cherry picked from commit 7215aa7)
Signed-off-by: Andrew Or <[email protected]>
…lyst

I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier.

Author: Reynold Xin <[email protected]>

Closes #5162 from rxin/catalyst-explicit-types and squashes the following commits:

e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst.

(cherry picked from commit 7334801)
Signed-off-by: Reynold Xin <[email protected]>

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala
It would be great to fix this for 1.3. since the fix is surgical and it helps understandability for users.

cc shivaram pwendell

Author: Kay Ousterhout <[email protected]>

Closes #4839 from kayousterhout/SPARK-6088 and squashes the following commits:

3ab012c [Kay Ousterhout] Update getting result time incrementally, correctly set GET_RESULT status
f346b49 [Kay Ousterhout] Typos
748ea6b [Kay Ousterhout] Fixed build failure
84d617c [Kay Ousterhout] [SPARK-6088] Correct how tasks that get remote results are shown in the UI.

(cherry picked from commit 6948ab6)
Signed-off-by: Andrew Or <[email protected]>
Opening shuffle files can be very significant when the disk is
contended, especially when using ext3. While writing data to
a file can avoid hitting disk (and instead hit the buffer
cache), opening a file always involves writing some metadata
about the file to disk, so the open time can be a very significant
portion of the shuffle write time. In one job I ran recently, the time to
write shuffle data to the file was only 4ms for each task, but
the time to open the file was about 100x as long (~400ms).

When we add metrics about spilled data (#2504), we should ensure
that the file open time is also included there.

Author: Kay Ousterhout <[email protected]>

Closes #4550 from kayousterhout/SPARK-3570 and squashes the following commits:

ea3a4ae [Kay Ousterhout] Added comment about excluded open time
fdc5185 [Kay Ousterhout] Improved comment
42b7e43 [Kay Ousterhout] Fixed parens for nanotime
2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle write time.

(cherry picked from commit d8ccf65)
Signed-off-by: Andrew Or <[email protected]>
Clarify the local directories usage in YARN

Author: Christophe Préaud <[email protected]>

Closes #5165 from preaudc/yarn-doc-local-dirs and squashes the following commits:

6912b90 [Christophe Préaud] Fix some formatting issues.
4fa8ec2 [Christophe Préaud] Merge remote-tracking branch 'upstream/master' into yarn-doc-local-dirs
eaaf519 [Christophe Préaud] Clarify the local directories usage in YARN
436fb7d [Christophe Préaud] Revert "Clarify the local directories usage in YARN"
876ae5e [Christophe Préaud] Clarify the local directories usage in YARN
608dbfa [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
a49a2ce [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
9ba89ca [Christophe Préaud] Ensure that files are fetched atomically
54419ae [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
c6a5590 [Christophe Préaud] Revert commit 8ea871f
7456a33 [Christophe Préaud] Merge remote-tracking branch 'upstream/master'
8ea871f [Christophe Préaud] Ensure that files are fetched atomically
Needed to import the types specifically, not the more general pyspark.sql

Author: Bill Chambers <[email protected]>
Author: anabranch <[email protected]>

Closes #5179 from anabranch/master and squashes the following commits:

8fa67bf [anabranch] Corrected SqlContext Import
603b080 [Bill Chambers] [DOCUMENTATION]Fixed Missing Type Import in Documentation

(cherry picked from commit c5cc414)
Signed-off-by: Reynold Xin <[email protected]>
…ghts) should initialize numFeatures

In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model.
```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated.
In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.

Author: Yanbo Liang <[email protected]>

Closes #5167 from yanboliang/spark-6496 and squashes the following commits:

8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures

(cherry picked from commit 10c7860)
Signed-off-by: Sean Owen <[email protected]>
…n LDAModel.scala

Remove unicode characters from MLlib file.

Author: Michael Griffiths <[email protected]>
Author: Griffiths, Michael (NYC-RPM) <[email protected]>

Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following commits:

bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27)
38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
…, because this will make some UDAF can not work.

spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage"

Author: DoingDone9 <[email protected]>

Closes #5131 from DoingDone9/udaf and squashes the following commits:

9de08d0 [DoingDone9] Update HiveUdfSuite.scala
49c62dc [DoingDone9] Update hiveUdfs.scala
98b134f [DoingDone9] Merge pull request #5 from apache/master
161cae3 [DoingDone9] Merge pull request #4 from apache/master
c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master

(cherry picked from commit 968408b)
Signed-off-by: Michael Armbrust <[email protected]>
The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed.

Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
<!-- Reviewable:end -->

Author: Cheng Lian <[email protected]>

Closes #5183 from liancheng/spark-6450 and squashes the following commits:

3536780 [Cheng Lian] Fixes metastore Parquet table conversion

(cherry picked from commit 8c3b005)
Signed-off-by: Michael Armbrust <[email protected]>
Previously this could result in sets compare equals when in fact the right was a subset of the left.

Based on #5133 by sisihj

Author: sisihj <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #5194 from marmbrus/pr/5133 and squashes the following commits:

5ed4615 [Michael Armbrust] fix imports
d4cbbc0 [Michael Armbrust] Add test cases
0a0834f [sisihj]  AttributeSet.equal should compare size

(cherry picked from commit 276ef1c)
Signed-off-by: Michael Armbrust <[email protected]>
```
>>> df[df.name.inSet("Bob", "Mike")].collect()
[Row(age=5, name=u'Bob')]
>>> df[df.age.inSet([1, 2, 3])].collect()
[Row(age=2, name=u'Alice')]
```

Author: Davies Liu <[email protected]>

Closes #5190 from davies/in and squashes the following commits:

6b73a47 [Davies Liu] Column.inSet() in Python

(cherry picked from commit f535802)
Signed-off-by: Reynold Xin <[email protected]>
Author: Michael Armbrust <[email protected]>

Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:

bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema

(cherry picked from commit f88f51b)
Signed-off-by: Cheng Lian <[email protected]>
…t schema to support dropping of columns using replace columns

Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema.
But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work.

Author: Yash Datta <[email protected]>

Closes #5141 from saucam/replace_col and squashes the following commits:

e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema
5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns

(cherry picked from commit 1c05027)
Signed-off-by: Cheng Lian <[email protected]>
When running "bin/computer-classpath.sh", the output will be:
:/spark/conf:/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.5.0-cdh5.2.0.jar:/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/spark/lib_managed/jars/datanucleus-core-3.2.10.jar
Java will add the current working dir to the CLASSPATH, if the first ":" exists, which is not expected by spark users.
For example, if I call spark-shell in the folder /root. And there exists a "core-site.xml" under /root/. Spark will use this file as HADOOP CONF file, even if I have already set HADOOP_CONF_DIR=/etc/hadoop/conf.

Author: guliangliang <[email protected]>

Closes #5156 from marsishandsome/Spark6491 and squashes the following commits:

5ae214f [guliangliang] use appendToClasspath to change CLASSPATH
b21f3b2 [guliangliang] keep the classpath order
5d1f870 [guliangliang] [SPARK-6491] Spark will put the current working dir to the CLASSPATH
viadea and others added 23 commits August 10, 2015 17:18
…ry files.

Spark streaming deletes the temp file and backup files without checking if they exist or not

Author: Hao Zhu <[email protected]>

Closes #8082 from viadea/master and squashes the following commits:

242d05f [Hao Zhu] [SPARK-9801][Streaming]No need to check the existence of those files
fd143f2 [Hao Zhu] [SPARK-9801][Streaming]Check if backupFile exists before deleting backupFile files.
087daf0 [Hao Zhu] SPARK-9801

(cherry picked from commit 3c9802d)
Signed-off-by: Tathagata Das <[email protected]>
…when a grouping expression is used as an argument of the aggregate fucntion

https://issues.apache.org/jira/browse/SPARK-10169

Author: Wenchen Fan <[email protected]>
Author: Yin Huai <[email protected]>

Closes #8380 from yhuai/aggTransformDown-branch1.3.
…eter setting

Added check for positive block size with a note that -1 for auto-configuring is not supported

Author: Bryan Cutler <[email protected]>

Closes #8363 from BryanCutler/ml.ALS-neg-blocksize-8400-1.3.
…nitializaiton

* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance

Further improvements will be addressed in SPARK-10329

cc: yu-iskw HuJiayin

Author: Xiangrui Meng <[email protected]>

Closes #8526 from mengxr/SPARK-10354.

(cherry picked from commit f0f563a)
Signed-off-by: Xiangrui Meng <[email protected]>
…= 0.0 for some subset of matrix multiplications

Apply fixes for alpha, beta parameter handling in gemm/gemv from #8525 to branch 1.3

CC mengxr brkyvz

Author: Sean Owen <[email protected]>

Closes #8572 from srowen/SPARK-10353.2.
…ld be 0.0 (original: 1.0)

Small typo in the example for `LabelledPoint` in the MLLib docs.

Author: Sean Paradiso <[email protected]>

Closes #8680 from sparadiso/docs_mllib_smalltypo.

(cherry picked from commit 1dc7548)
Signed-off-by: Xiangrui Meng <[email protected]>
…rialization

Python time values return a floating point value, need to cast to integer before serialize with struct.pack('!q', value)

https://issues.apache.org/jira/browse/SPARK-6931

Author: Bryan Cutler <[email protected]>

Closes #8594 from BryanCutler/py-write_long-backport-6931-1.2.

(cherry picked from commit 4862a80)
Signed-off-by: Davies Liu <[email protected]>
Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against.

Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation.

Author: Ahir Reddy <[email protected]>

Closes #8709 from ahirreddy/sbt-scala-version-fix.

(cherry picked from commit 9bbe33f)
Signed-off-by: Sean Owen <[email protected]>
…keys

JIRA: https://issues.apache.org/jira/browse/SPARK-10642

When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.

Author: Liang-Chi Hsieh <[email protected]>

Closes #8796 from viirya/fix-pyrdd-lookup.

(cherry picked from commit 136c77d)
Signed-off-by: Davies Liu <[email protected]>
As of https://issues.apache.org/jira/browse/SPARK-7561, we no longer need to use our custom SCP-based mechanism for archiving Jenkins logs on the master machine; this has been superseded by the use of a Jenkins plugin which archives the logs and provides public links to view them.

Per shaneknapp, we should remove this log syncing mechanism if it is no longer necessary; removing the need to SCP from the Jenkins workers to the masters is a desired step as part of some larger Jenkins infra refactoring.

Author: Josh Rosen <[email protected]>

Closes #8793 from JoshRosen/remove-jenkins-ssh-to-master.

(cherry picked from commit f1c9115)
Signed-off-by: Josh Rosen <[email protected]>
…mitCoordinator (branch-1.3 backport)

This is a backport of #8544 to `branch-1.3` for inclusion in 1.3.2.

Author: Josh Rosen <[email protected]>

Closes #8790 from JoshRosen/SPARK-10381-1.3.
Author: Reynold Xin <[email protected]>

Closes #8894 from rxin/branch-1.3.
The created decimal is wrong if using `Decimal(unscaled, precision, scale)` with unscaled > 1e18 and and precision > 18 and scale > 0.

This bug exists since the beginning.

Author: Davies Liu <[email protected]>

Closes #9014 from davies/fix_decimal.

(cherry picked from commit 37526ac)
Signed-off-by: Davies Liu <[email protected]>

Conflicts:
	sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala
…when asked for index after the last non-zero entry

See #9009 for details.

Author: zero323 <[email protected]>

Closes #9062 from zero323/SPARK-10973_1.3.
…atrix returns incorrect answer in some cases

Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes #9293

Author: Sean Owen <[email protected]>

Closes #9309 from srowen/SPARK-11302.2.

(cherry picked from commit 826e1e3)
Signed-off-by: Xiangrui Meng <[email protected]>
….3 backport)

This is a branch-1.3 backport of #9382, a fix for SPARK-11424.

Author: Josh Rosen <[email protected]>

Closes #9423 from JoshRosen/hadoop-decompressor-pooling-fix-branch-1.3.
The code convertToCanonicalEdges is such that srcIds are smaller than dstIds but the scaladoc suggested otherwise. Have fixed the same.

Author: Gaurav Kumar <[email protected]>

Closes #9666 from gauravkumar37/patch-1.

(cherry picked from commit df0e318)
Signed-off-by: Reynold Xin <[email protected]>
jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <[email protected]>

Closes #9803 from hhbyyh/w2vVocab.

(cherry picked from commit e391abd)
Signed-off-by: Xiangrui Meng <[email protected]>
…ceByKeyAndWindow

invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <[email protected]>

Closes #9775 from dtolpin/master.

(cherry picked from commit 599a8c6)
Signed-off-by: Tathagata Das <[email protected]>
…branch 1.3

JIRA: https://issues.apache.org/jira/browse/SPARK-13464

## What changes were proposed in this pull request?

During backport a mllib feature, I found that the clearly checkouted branch-1.3 codebase would fail at the test `test_reduce_by_key_and_window_with_none_invFunc` in pyspark/streaming. We should fix it.

## How was the this patch tested?

Unit test `test_reduce_by_key_and_window_with_none_invFunc` is fixed.

Author: Liang-Chi Hsieh <[email protected]>

Closes #11339 from viirya/fix-streaming-test-branch-1.3.
…tionClustering failed test

## What changes were proposed in this pull request?

Backport JIRA-SPARK-12363 to branch-1.3.

## How was the this patch tested?

Unit test.

cc mengxr

Author: Liang-Chi Hsieh <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes #11265 from viirya/backport-12363-1.3 and squashes the following commits:

ec076dd [Liang-Chi Hsieh] Fix scala style.
7a3ef5f [Xiangrui Meng] use Graph instead of GraphImpl and update tests and example based on PIC paper
b86018d [Liang-Chi Hsieh] Remove setRun and fix PowerIterationClustering failed test.
@HyukjinKwon
Copy link
Member

Hi @Kevy123, it seems this pull request is mistakenly open. Could you please close this?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

srowen added a commit to srowen/spark that referenced this pull request Jan 1, 2017
@srowen srowen mentioned this pull request Jan 1, 2017
@asfgit asfgit closed this in ba48812 Jan 2, 2017
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#12968
Closes apache#16215
Closes apache#16212
Closes apache#16086
Closes apache#15713
Closes apache#16413
Closes apache#16396

Author: Sean Owen <[email protected]>

Closes apache#16447 from srowen/CloseStalePRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.