Skip to content

Conversation

@debasish83
Copy link

ALS is a generic algorithm for matrix factorization which is equally applicable for both feature space and similarity space. Current ALS support L2 regularization and positivity constraint. This PR introduces userConstraint and productConstraint to ALS and let's the user select different constraints for user and product solves. The supported constraints are the following:

  1. SMOOTH : default ALS with L2 regularization
  2. POSITIVE: ALS with positive factors
  3. BOUNDS: ALS with factors bounded within upper and lower bound (default within 0 and 1)
  4. SPARSE: ALS with L1 regularization
  5. EQUALITY: ALS with equality constraint (default the factors sum up to 1 and positive)

First let's focus on the problem formulation. Both implicit and explicit feedback ALS formulation can be written as a quadratic minimization problem. The quadratic objective can be written as xtHx + ctx. Each of the respective constraints take the following form:
minimize xtHx + ctx
s.t ||x||1 <= c (SPARSE constraint)

We rewrite the objective as f(x) = xtHx + ctx and the constraint as an indicator function g(x)

Now minimization of f(x) + g(x) can be carried out using various forward backward splitting algorithms. We choose ADMM for the first version based on our experimentation with ECOS IP solver and MOSEK comparisons. I will document the comparisons.

Details of the algorithm are in the following reference:
http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf

Right now the default parameters of alpha, rho are set as 1.0 but the following issues show up in experiments with MovieLens dataset:

  1. ~3X higher iterations as compared to NNLS
  2. For SPARSE we are hitting the max iterations (400) around 10% of the time
  3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd on optimal control

We choose ADMM as the baseline solver but this PR will explore the following solver enhancements to decrease the iteration count:

  1. Accelerated ADMM using Nesterov acceleration
  2. FISTA style forward backward splitting

For use-cases the PR is focused on the following:

  1. Sparse matrix factorization to improve recommendation
    On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k measures. Using the PR from @coderxiang to look into IR measures.
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 --productConstraint SPARSE --lambdaProduct 0.1 --kryo hdfs://localhost:8020/sandbox/movielens/
  2. Topic modeling using LSA
    References:
    2007 Sparse coding: papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab):
    https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
    2012 Sparse Coding + MR/MPI Microsoft: http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement over LDA based topic modeling.
  3. Topic modeling using PLSA
    Reference:
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the KL divergence loss being used in PLSA. We are interested to see if it improves the result further as compared to the Sparse coding.

Next steps:

  1. Improve the convergence rate of forward-backward splitting on quadratic problems
  2. Move the test-cases to QuadraticMinimizerSuite.scala
  3. Generate results for each of the use-cases and add tests related to each use-case

Related future PRs:

  1. Scale the factorization rank and remove the need to construct H matrix
  2. Replace the quadratic loss xtHx + ctx with a Convex loss

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@debasish83
Copy link
Author

@mengxr could you please take a first pass at it...I am focused at decreasing the iteration count of the proximal algorithm

@rezazadeh could you please see if the quadratic problem ideas mentioned in your paper (http://arxiv.org/pdf/1410.0342v1.pdf) are captured in this PR. We would like to integrate the convex loss at the earliest as some of our use-cases we would like to experiment with hinge/huber loss...

mengxr and others added 10 commits October 8, 2014 22:35
Got warning msg:

~~~
[warn] /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/feature/Normalizer.scala:50: method norm in trait NumericOps is deprecated: Use norm(XXX) instead of XXX.norm
[warn]     var norm = vector.toBreeze.norm(p)
~~~

dbtsai

Author: Xiangrui Meng <[email protected]>

Closes apache#2718 from mengxr/SPARK-3856 and squashes the following commits:

4f38169 [Xiangrui Meng] use norm operator
Upgrade to akka 2.3.4

Author: Anand Avati <[email protected]>

Closes apache#1685 from avati/SPARK-1812-akka-2.3 and squashes the following commits:

57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
Truncate appName in WebUI if it is too long.

Author: Xiangrui Meng <[email protected]>

Closes apache#2707 from mengxr/truncate-app-name and squashes the following commits:

87834ce [Xiangrui Meng] move scala import below java
c7111dc [Xiangrui Meng] truncate appName in WebUI if it is too long
It took me a minute to track this down, so I thought it could be useful to have it in the docs.

I'm unsure if 512mb is the default for spark.driver.memory? Also - there could be a better value for the 'description' to differentiate it from spark.executor.memory.

Author: nartz <[email protected]>
Author: Nathan Artz <[email protected]>

Closes apache#2410 from nartz/docs/add-spark-driver-memory-to-config-docs and squashes the following commits:

a2f6c62 [nartz] Update configuration.md
74521b8 [Nathan Artz] add spark.driver.memory to config docs
Currently, the implementation does one unnecessary aggregation step. The aggregation step for level L (to choose splits) gives enough information to set the predictions of any leaf nodes at level L+1. We can use that info and skip the aggregation step for the last level of the tree (which only has leaf nodes).

### Implementation Details

Each node now has a `impurity` field and the `predict` is changed from type `Double` to type `Predict`(this can be used to compute predict probability in the future) When compute best splits for each node, we also compute impurity and predict for the child nodes, which is used to constructed newly allocated child nodes. So at level L, we have set impurity and predict for nodes at level L +1.
If level L+1 is the last level, then we can avoid aggregation. What's more, calculation of parent impurity in

Top nodes for each tree needs to be treated differently because we have to compute impurity and predict for them first. In `binsToBestSplit`, if current node is top node(level == 0), we calculate impurity and predict first.
after finding best split, top node's predict and impurity is set to the calculated value. Non-top nodes's impurity and predict are already calculated and don't need to be recalculated again. I have considered to add a initialization step to set top nodes' impurity and predict and then we can treat all nodes in the same way, but this will need a lot of duplication of code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I choose the current way.

 CC mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <[email protected]>

Closes apache#2708 from chouqin/avoid-agg and squashes the following commits:

8e269ea [Qiping Li] adjust code and comments
eefeef1 [Qiping Li] adjust comments and check child nodes' impurity
c41b1b6 [Qiping Li] fix pyspark unit test
7ad7a71 [Qiping Li] fix unit test
822c912 [Qiping Li] add comments and unit test
e41d715 [Qiping Li] fix bug in test suite
6cc0333 [Qiping Li] SPARK-3158: Avoid 1 extra aggregation for DecisionTree training
cc mengxr

Author: GuoQiang Li <[email protected]>

Closes apache#2730 from witgo/SPARK-3856 and squashes the following commits:

2cffce1 [GuoQiang Li] use norm operator after breeze 0.10 upgrade
… mo...

...re logs to avoid Executors swallowing errors

This PR made the following changes:
* Register a callback to `Connection` so that the error will be propagated properly.
* Add more logs so that the errors won't be swallowed by Executors.
* Use trySuccess/tryFailure because `Promise` doesn't allow to call success/failure more than once.

Author: zsxwing <[email protected]>

Closes apache#2593 from zsxwing/SPARK-3741 and squashes the following commits:

1d5aed5 [zsxwing] Fix naming
0b8a61c [zsxwing] Merge branch 'master' into SPARK-3741
764aec5 [zsxwing] [SPARK-3741] Make ConnectionManager propagate errors properly and add more logs to avoid Executors swallowing errors
Author: Vida Ha <[email protected]>

Closes apache#2621 from vidaha/vida/SPARK-3752 and squashes the following commits:

d7fdbbc [Vida Ha] Add tests for different UDF's
The In case class is replaced by a InSet class in case all the filters are literals, which uses a hashset instead of Sequence, thereby giving significant performance improvement (earlier the seq was using a worst case linear match (exists method) since expressions were assumed in the filter list) . Maximum improvement should be visible in case small percentage of large data matches the filter list.

Author: Yash Datta <[email protected]>

Closes apache#2561 from saucam/branch-1.1 and squashes the following commits:

4bf2d19 [Yash Datta] SPARK-3711: 1. Fix code style and import order             2. Fix optimization condition             3. Add tests for null in filter list             4. Add test case that optimization is not triggered in case of attributes in filter list
afedbcd [Yash Datta] SPARK-3711: 1. Add test cases for InSet class in ExpressionEvaluationSuite             2. Add class OptimizedInSuite on the lines of ConstantFoldingSuite, for the optimized In clause
0fc902f [Yash Datta] SPARK-3711: UnaryMinus will be handled by constantFolding
bd84c67 [Yash Datta] SPARK-3711: Incorporate review comments. Move optimization of In clause to Optimizer.scala by adding a rule. Add appropriate comments
430f5d1 [Yash Datta] SPARK-3711: Optimize the filter list in case of negative values as well
bee98aa [Yash Datta] SPARK-3711: Optimize where in clause filter queries
To fix two issues in CliSuite
1 CliSuite throw IndexOutOfBoundsException:
Exception in thread "Thread-6" java.lang.IndexOutOfBoundsException: 6
	at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:43)
	at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
	at org.apache.spark.sql.hive.thriftserver.CliSuite.org$apache$spark$sql$hive$thriftserver$CliSuite$$captureOutput$1(CliSuite.scala:67)
	at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
	at org.apache.spark.sql.hive.thriftserver.CliSuite$$anonfun$4.apply(CliSuite.scala:78)
	at scala.sys.process.ProcessLogger$$anon$1.out(ProcessLogger.scala:96)
	at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
	at scala.sys.process.BasicIO$$anonfun$processOutFully$1.apply(BasicIO.scala:135)
	at scala.sys.process.BasicIO$.readFully$1(BasicIO.scala:175)
	at scala.sys.process.BasicIO$.processLinesFully(BasicIO.scala:179)
	at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:164)
	at scala.sys.process.BasicIO$$anonfun$processFully$1.apply(BasicIO.scala:162)
	at scala.sys.process.ProcessBuilderImpl$Simple$$anonfun$3.apply$mcV$sp(ProcessBuilderImpl.scala:73)
	at scala.sys.process.ProcessImpl$Spawn$$anon$1.run(ProcessImpl.scala:22)

Actually, it is the Mutil-Threads lead to this problem.

2 Using ```line.startsWith``` instead ```line.contains``` to assert expected answer. This is a tiny bug in CliSuite, for test case "Simple commands", there is a expected answers "5", if we use ```contains``` that means output like "14/10/06 11:```5```4:36 INFO CliDriver: Time taken: 1.078 seconds" or "14/10/06 11:54:36 INFO StatsReportListener: 	0%	```5```%	10%	25%	50%	75%	90%	95%	100%" will make the assert true.

Author: scwf <[email protected]>

Closes apache#2666 from scwf/clisuite and squashes the following commits:

11430db [scwf] fix-clisuite
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

cocoatomo and others added 13 commits October 9, 2014 13:46
…nit-tests.log

./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.
It is harder for us to recognize what test programs were executed and which test was failed.

Author: cocoatomo <[email protected]>

Closes apache#2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits:

c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
In JSONRDD.scala, add 'case TimestampType' in the enforceCorrectType function and a toTimestamp function.

Author: Mike Timper <[email protected]>

Closes apache#2720 from mtimper/master and squashes the following commits:

9386ab8 [Mike Timper] Fix and tests for SPARK-3853
This PR aims to provide a way to skip/query corrupt JSON records. To do so, we introduce an internal column to hold corrupt records (the default name is `_corrupt_record`. This name can be changed by setting the value of `spark.sql.columnNameOfCorruptRecord`). When there is a parsing error, we will put the corrupt record in its unparsed format to the internal column. Users can skip/query this column through SQL.

* To query those corrupt records
```
-- For Hive parser
SELECT `_corrupt_record`
FROM jsonTable
WHERE `_corrupt_record` IS NOT NULL
-- For our SQL parser
SELECT _corrupt_record
FROM jsonTable
WHERE _corrupt_record IS NOT NULL
```
* To skip corrupt records and query regular records
```
-- For Hive parser
SELECT field1, field2
FROM jsonTable
WHERE `_corrupt_record` IS NULL
-- For our SQL parser
SELECT field1, field2
FROM jsonTable
WHERE _corrupt_record IS NULL
```

Generally, it is not recommended to change the name of the internal column. If the name has to be changed to avoid possible name conflicts, you can use `sqlContext.setConf(SQLConf.COLUMN_NAME_OF_CORRUPT_RECORD, <new column name>)` or `sqlContext.sql(SET spark.sql.columnNameOfCorruptRecord=<new column name>)`.

Author: Yin Huai <[email protected]>

Closes apache#2680 from yhuai/corruptJsonRecord and squashes the following commits:

4c9828e [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
309616a [Yin Huai] Change the default name of corrupt record to "_corrupt_record".
b4a3632 [Yin Huai] Merge remote-tracking branch 'upstream/master' into corruptJsonRecord
9375ae9 [Yin Huai] Set the column name of corrupt json record back to the default one after the unit test.
ee584c0 [Yin Huai] Provide a way to query corrupt json records as unparsed strings.
chenghao-intel assigned this to me, check PR apache#2284 for previous discussion

Author: Daoyuan Wang <[email protected]>

Closes apache#2529 from adrian-wang/rowapi and squashes the following commits:

c6594b2 [Daoyuan Wang] using boxed
7b7e6e3 [Daoyuan Wang] update pattern match
7a39456 [Daoyuan Wang] rename file and refresh getAs[T]
4c18c29 [Daoyuan Wang] remove setAs[T] and null judge
1614493 [Daoyuan Wang] add missing row api
The alias parameter is being ignored, which makes it more difficult to specify a qualifier for Generator expressions.

Author: Nathan Howell <[email protected]>

Closes apache#2721 from NathanHowell/SPARK-3858 and squashes the following commits:

8aa0f43 [Nathan Howell] [SPARK-3858][SQL] Pass the generator alias into logical plan node
…SQL.

"case when" conditional function is already supported in Spark SQL but there is no support in SqlParser. So added parser support to it.

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <[email protected]>

Closes apache#2678 from ravipesala/SPARK-3813 and squashes the following commits:

70c75a7 [ravipesala] Fixed styles
713ea84 [ravipesala] Updated as per admin comments
709684f [ravipesala] Changed parser to support case when function.
…upport improvements:

This pull request addresses a few issues related to PySpark's IPython support:

- Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in apache#2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
- Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use `ipython` while the workers use a different Python version.
- Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
- Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in `bin/pyspark`.

Author: Josh Rosen <[email protected]>

Closes apache#2651 from JoshRosen/SPARK-3772 and squashes the following commits:

7b8eb86 [Josh Rosen] More changes to PySpark python executable configuration:
c4f5778 [Josh Rosen] [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes:
This prevents it from changing during serialization, leading to corrupted results.

Author: Michael Armbrust <[email protected]>

Closes apache#2656 from marmbrus/generateBug and squashes the following commits:

efa32eb [Michael Armbrust] Store the output of a generator in a val. This prevents it from changing during serialization.
…ls.createTempDir

I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_

*Bit Less Minor*

* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.

Author: Sean Owen <[email protected]>

Closes apache#2670 from srowen/SPARK-3811 and squashes the following commits:

071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
This PR is a follow up of apache#2590, and tries to introduce a top level SQL parser entry point for all SQL dialects supported by Spark SQL.

A top level parser `SparkSQLParser` is introduced to handle the syntaxes that all SQL dialects should recognize (e.g. `CACHE TABLE`, `UNCACHE TABLE` and `SET`, etc.). For all the syntaxes this parser doesn't recognize directly, it fallbacks to a specified function that tries to parse arbitrary input to a `LogicalPlan`. This function is typically another parser combinator like `SqlParser`. DDL syntaxes introduced in apache#2475 can be moved to here.

The `ExtendedHiveQlParser` now only handle Hive specific extensions.

Also took the chance to refactor/reformat `SqlParser` for better readability.

Author: Cheng Lian <[email protected]>

Closes apache#2698 from liancheng/gen-sql-parser and squashes the following commits:

ceada76 [Cheng Lian] Minor styling fixes
9738934 [Cheng Lian] Minor refactoring, removes optional trailing ";" in the parser
bb2ab12 [Cheng Lian] SET property value can be empty string
ce8860b [Cheng Lian] Passes test suites
e86968e [Cheng Lian] Removes debugging code
8bcace5 [Cheng Lian] Replaces digit.+ to rep1(digit) (Scala style checking doesn't like it)
d15d54f [Cheng Lian] Unifies SQL and HiveQL parsers
…Y_AND_DISK

Using `MEMORY_AND_DISK` as default storage level for in-memory table caching. Due to the in-memory columnar representation, recomputing an in-memory cached table partitions can be very expensive.

Author: Cheng Lian <[email protected]>

Closes apache#2686 from liancheng/spark-3824 and squashes the following commits:

35d2ed0 [Cheng Lian] Removes extra space
1ab7967 [Cheng Lian] Reduces test data size to fit DiskStore.getBytes()
ba565f0 [Cheng Lian] Maks CachedBatch serializable
07f0204 [Cheng Lian] Sets in-memory table default storage level to MEMORY_AND_DISK
The queries like SELECT a.key FROM (SELECT key FROM src) \`a\` does not work as backticks in subquery aliases are not handled properly. This PR fixes that.

Author : ravipesala ravindra.pesalahuawei.com

Author: ravipesala <[email protected]>

Closes apache#2737 from ravipesala/SPARK-3834 and squashes the following commits:

0e0ab98 [ravipesala] Fixing issue in backtick handling for subquery aliases
@rchanda
Copy link

rchanda commented Oct 10, 2014

Related Future PRs: In mlib QP Solver (QuadraticMinimizer.scala)
3.Replace dense gram matrix to sparse. (Move from jblas dense matrix to breeze sparse matrix).
4.Also add inequality constraints to the QP, now it only has equality constraints.

avati and others added 3 commits October 10, 2014 00:46
This is a second rev of the Akka upgrade (earlier merged, but reverted). I made a slight modification which is that I also upgrade Hive to deal with a compatibility issue related to the protocol buffers library.

Author: Anand Avati <[email protected]>
Author: Patrick Wendell <[email protected]>

Closes apache#2752 from pwendell/akka-upgrade and squashes the following commits:

4c7ca3f [Patrick Wendell] Upgrading to new hive->protobuf version
57a2315 [Anand Avati] SPARK-1812: streaming - remove tests which depend on akka.actor.IO
2a551d3 [Anand Avati] SPARK-1812: core - upgrade to akka 2.3.4
…ionManager

In general, individual shuffle blocks are frequently small, so mmapping them often creates a lot of waste. It may not be bad to mmap the larger ones, but it is pretty inconvenient to get configuration into ManagedBuffer, and besides it is unlikely to help all that much.

Author: Aaron Davidson <[email protected]>

Closes apache#2742 from aarondav/mmap and squashes the following commits:

a152065 [Aaron Davidson] Add other pathway back
52b6cd2 [Aaron Davidson] [SPARK-3889] Attempt to avoid SIGBUS by not mmapping files in ConnectionManager
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].

In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.

Author: Davies Liu <[email protected]>

Closes apache#2740 from davies/batchsize and squashes the following commits:

52cdb88 [Davies Liu] update docs
185f2b9 [Davies Liu] use AutoBatchedSerializer by default
gvramana and others added 12 commits October 31, 2014 11:30
…stamp values

In org.apache.hadoop.hive.serde2.io.TimestampWritable.set , if the next entry is null then current time stamp object is being reset.
However because of this hiveinspectors:unwrap cannot use the same timestamp object without creating a copy.

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <[email protected]>

Closes apache#3019 from gvramana/spark_4077 and squashes the following commits:

32d818f [Venkata Ramana Gollamudi] fixed check style
fa01e71 [Venkata Ramana Gollamudi] cloned timestamp object as org.apache.hadoop.hive.serde2.io.TimestampWritable.set will reset current time object
…rk SQL and HQL

if the query contains "not between" does not work like.
SELECT * FROM src where key not between 10 and 20'

Author: ravipesala <[email protected]>

Closes apache#3017 from ravipesala/SPARK-4154 and squashes the following commits:

65fc89e [ravipesala] Handled admin comments
32e6d42 [ravipesala] 'not between' is not working
This PR adds support for the `ADD FILE` Hive command, and removes `ShellCommand` and `SourceCommand`. The reason is described in [this SPARK-2220 comment](https://issues.apache.org/jira/browse/SPARK-2220?focusedCommentId=14191841&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14191841).

Author: Cheng Lian <[email protected]>

Closes apache#3038 from liancheng/hive-commands and squashes the following commits:

6db61e0 [Cheng Lian] Fixes remaining Hive commands
…ors exist

WebUI

Author: Mark Mims <[email protected]>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <[email protected]>

Closes apache#3031 from mmm/remove-accumulators-col and squashes the following commits:

6141cb3 [Mark Mims] reformat to satisfy scalastyle linelength.  build failed from jenkins https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22604/
390893b [Mark Mims] cleanup
c28c449 [Mark Mims] looking much better now... minimal explicit formatting.  Now, see if any sort keys make sense
fb72156 [Mark Mims] mimic hasInput.  The basics work here, but wanna clean this up with maybeAccumulators for column content
Then we can do `rdd.setName('abc').cache().count()`.

Author: Xiangrui Meng <[email protected]>

Closes apache#3011 from mengxr/rdd-setname and squashes the following commits:

10d0d60 [Xiangrui Meng] update test
4ac3bbd [Xiangrui Meng] return self in rdd.setName
We have shell scripts and Windows batch files, so we should enforce proper EOL character.

Author: Kousuke Saruta <[email protected]>

Closes apache#2726 from sarutak/eol-enforcement and squashes the following commits:

9748c3f [Kousuke Saruta] Fixed make.bat
252de89 [Kousuke Saruta] Removed extra characters from make.bat
5b81c00 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
8633ed2 [Kousuke Saruta] merge branch 'master' of git://git.apache.org/spark into eol-enforcement
5d630d8 [Kousuke Saruta] Merged
ba10797 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
7407515 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
772fd4e [Kousuke Saruta] Normized EOL character in make.bat and compute-classpath.cmd
ac7f873 [Kousuke Saruta] Added an entry for .gitattributes to .rat-excludes
1570e77 [Kousuke Saruta] Added .gitattributes
This is caused by this commit: acd4ac7

Author: andrewor14 <[email protected]>
Author: Andrew Or <[email protected]>

Closes apache#3041 from andrewor14/yarn-hot-fix and squashes the following commits:

e5deba1 [andrewor14] Add new line at the end (minor)
aa998e8 [Andrew Or] Compilation hot fix
Author: Sandy Ryza <[email protected]>

Closes apache#3043 from sryza/sandy-spark-4175 and squashes the following commits:

e327340 [Sandy Ryza] SPARK-4175. Exception on stage page
Implementation of various multi-label classification measures, including: Hamming-loss, strict and default Accuracy, macro-averaged Precision, Recall and F1-measure based on documents and labels, micro-averaged measures: https://issues.apache.org/jira/browse/SPARK-2329

Multi-class measures are currently in the following pull request: apache#1155

Author: Alexander Ulanov <[email protected]>
Author: avulanov <[email protected]>

Closes apache#1270 from avulanov/multilabelmetrics and squashes the following commits:

fc8175e [Alexander Ulanov] Merge with previous updates
43a613e [Alexander Ulanov] Addressing reviewers comments: change Set to Array
517a594 [avulanov] Addressing reviewers comments: Scala style
cf4222b [avulanov] Addressing reviewers comments: renaming. Added label method that returns the list of labels
1843f73 [Alexander Ulanov] Scala style fix
79e8476 [Alexander Ulanov] Replacing fold(_ + _) with sum as suggested by srowen
ca46765 [Alexander Ulanov] Cosmetic changes: Apache header and parameter explanation
40593f5 [Alexander Ulanov] Multi-label metrics: Hamming-loss, strict and normal accuracy, fix to macro measures, bunch of tests
ad62df0 [Alexander Ulanov] Comments and scala style check
154164b [Alexander Ulanov] Multilabel evaluation metics and tests: macro precision and recall averaged by docs, micro and per-class precision and recall averaged by class
This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838

Python example for word2vec
mengxr

Author: Anant <[email protected]>

Closes apache#2952 from anantasty/SPARK-3838 and squashes the following commits:

87bd723 [Anant] remove stop line
4bd439e [Anant] Changes as per code review. Fized error in word2vec python example, simplified example in docs.
3d3c9ee [Anant] Added empty line after python imports
0c90c31 [Anant] Fixed erroneous code. I was still treating each line to be a single word instead of 16 words
ee4f5f6 [Anant] Fixes from code review comments
c637bcf [Anant] Added word2vec python example to docs
269f31f [Anant] added example in docs
c015b14 [Anant] Added python example for word2vec
Given the popular demand for gradient boosting and AdaBoost in MLlib, I am creating a WIP branch for early feedback on gradient boosting with AdaBoost to follow soon after this PR is accepted. This is based on work done along with hirakendu that was pending due to decision tree optimizations and random forests work.

Ideally, boosting algorithms should work with any base learners.  This will soon be possible once the MLlib API is finalized -- we want to ensure we use a consistent interface for the underlying base learners. In the meantime, this PR uses decision trees as base learners for the gradient boosting algorithm. The current PR allows "pluggable" loss functions and provides least squares error and least absolute error by default.

Here is the task list:
- [x] Gradient boosting support
- [x] Pluggable loss functions
- [x] Stochastic gradient boosting support – Re-use the BaggedPoint approach used for RandomForest.
- [x] Binary classification support
- [x] Support configurable checkpointing – This approach will avoid long lineage chains.
- [x] Create classification and regression APIs
- [x] Weighted Ensemble Model -- created a WeightedEnsembleModel class that can be used by ensemble algorithms such as random forests and boosting.
- [x] Unit Tests

Future work:
+ Multi-class classification is currently not supported by this PR since it requires discussion on the best way to support "deviance" as a loss function.
+ BaggedRDD caching -- Avoid repeating feature to bin mapping for each tree estimator after standard API work is completed.

cc: jkbradley hirakendu mengxr etrain atalwalkar chouqin

Author: Manish Amde <[email protected]>
Author: manishamde <[email protected]>

Closes apache#2607 from manishamde/gbt and squashes the following commits:

991c7b5 [Manish Amde] public api
ff2a796 [Manish Amde] addressing comments
b4c1318 [Manish Amde] removing spaces
8476b6b [Manish Amde] fixing line length
0183cb9 [Manish Amde] fixed naming and formatting issues
1c40c33 [Manish Amde] add newline, removed spaces
e33ab61 [Manish Amde] minor comment
eadbf09 [Manish Amde] parameter renaming
035a2ed [Manish Amde] jkbradley formatting suggestions
9f7359d [Manish Amde] simplified gbt logic and added more tests
49ba107 [Manish Amde] merged from master
eff21fe [Manish Amde] Added gradient boosting tests
3fd0528 [Manish Amde] moved helper methods to new class
a32a5ab [Manish Amde] added test for subsampling without replacement
781542a [Manish Amde] added support for fractional subsampling with replacement
3a18cc1 [Manish Amde] cleaned up api for conversion to bagged point and moved tests to it's own test suite
0e81906 [Manish Amde] improving caching unpersisting logic
d971f73 [Manish Amde] moved RF code to use WeightedEnsembleModel class
fee06d3 [Manish Amde] added weighted ensemble model
1b01943 [Manish Amde] add weights for base learners
9bc6e74 [Manish Amde] adding random seed as parameter
d2c8323 [Manish Amde] Merge branch 'master' into gbt
2ae97b7 [Manish Amde] added documentation for the loss classes
9366b8f [Manish Amde] minor: using numTrees instead of trees.size
3b43896 [Manish Amde] added learning rate for prediction
9b2e35e [Manish Amde] Merge branch 'master' into gbt
6a11c02 [manishamde] fixing formatting
823691b [Manish Amde] fixing RF test
1f47941 [Manish Amde] changing access modifier
5b67102 [Manish Amde] shortened parameter list
5ab3796 [Manish Amde] minor reformatting
9155a9d [Manish Amde] consolidated boosting configuration and added public API
631baea [Manish Amde] Merge branch 'master' into gbt
2cb1258 [Manish Amde] public API support
3b8ffc0 [Manish Amde] added documentation
8e10c63 [Manish Amde] modified unpersist strategy
f62bc48 [Manish Amde] added unpersist
bdca43a [Manish Amde] added timing parameters
2fbc9c7 [Manish Amde] fixing binomial classification prediction
6dd4dd8 [Manish Amde] added support for log loss
9af0231 [Manish Amde] classification attempt
62cc000 [Manish Amde] basic checkpointing
4784091 [Manish Amde] formatting
78ed452 [Manish Amde] added newline and fixed if statement
3973dd1 [Manish Amde] minor indicating subsample is double during comparison
aa8fae7 [Manish Amde] minor refactoring
1a8031c [Manish Amde] sampling with replacement
f1c9ef7 [Manish Amde] Merge branch 'master' into gbt
cdceeef [Manish Amde] added documentation
6251fd5 [Manish Amde] modified method name
5538521 [Manish Amde] disable checkpointing for now
0ae1c0a [Manish Amde] basic gradient boosting code from earlier branches
freeman-lab and others added 9 commits October 31, 2014 22:30
This adds a Streaming KMeans algorithm to MLlib. It uses an update rule that generalizes the mini-batch KMeans update to incorporate a decay factor, which allows past data to be forgotten. The decay factor can be specified explicitly, or via a more intuitive "fractional decay" setting, in units of either data points or batches.

The PR includes:
- StreamingKMeans algorithm with decay factor settings
- Usage example
- Additions to documentation clustering page
- Unit tests of basic behavior and decay behaviors

tdas mengxr rezazadeh

Author: freeman <[email protected]>
Author: Jeremy Freeman <[email protected]>
Author: Xiangrui Meng <[email protected]>

Closes apache#2942 from freeman-lab/streaming-kmeans and squashes the following commits:

b2e5b4a [freeman] Fixes to docs / examples
078617c [Jeremy Freeman] Merge pull request apache#1 from mengxr/SPARK-3254
2e682c0 [Xiangrui Meng] take discount on previous weights; use BLAS; detect dying clusters
0411bf5 [freeman] Change decay parameterization
9f7aea9 [freeman] Style fixes
374a706 [freeman] Formatting
ad9bdc2 [freeman] Use labeled points and predictOnValues in examples
77dbd3f [freeman] Make initialization check an assertion
9cfc301 [freeman] Make random seed an argument
44050a9 [freeman] Simpler constructor
c7050d5 [freeman] Fix spacing
2899623 [freeman] Use pattern matching for clarity
a4a316b [freeman] Use collect
1472ec5 [freeman] Doc formatting
ea22ec8 [freeman] Fix imports
2086bdc [freeman] Log cluster center updates
ea9877c [freeman] More documentation
9facbe3 [freeman] Bug fix
5db7074 [freeman] Example usage for StreamingKMeans
f33684b [freeman] Add explanation and example to docs
b5b5f8d [freeman] Add better documentation
a0fd790 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
9fd9c15 [freeman] Merge remote-tracking branch 'upstream/master' into streaming-kmeans
b93350f [freeman] Streaming KMeans with decay
I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.

Author: Daniel Lemire <[email protected]>

Closes apache#3044 from lemire/master and squashes the following commits:

54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
431f3a0 [Daniel Lemire] Recommended bug fix release
Changing the default number of edge partitions to match spark parallelism.

Author: Joseph E. Gonzalez <[email protected]>

Closes apache#3006 from jegonzal/default_partitions and squashes the following commits:

a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
Accumulate sizes of all the EdgePartitions just like the VertexRDD.

Author: luluorta <[email protected]>

Closes apache#2975 from luluorta/graph-edge-count and squashes the following commits:

86ef0e5 [luluorta] Add overrided count for edge counting of EdgeRDD.
@debasish83 debasish83 closed this Nov 12, 2014
@debasish83 debasish83 reopened this Nov 12, 2014
@debasish83 debasish83 closed this Nov 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.