Skip to content

Conversation

@mengxr
Copy link
Contributor

@mengxr mengxr commented Mar 17, 2015

Add checkpiontInterval to ALS to prevent:

  1. StackOverflow exceptions caused by long lineage,
  2. large shuffle files generated during iterations,
  3. slow recovery when some node fail.

@srowen @coderxiang

@SparkQA
Copy link

SparkQA commented Mar 17, 2015

Test build #28739 has finished for PR 5076 at commit 20d3f7f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class KMeansModel(Saveable, Loader):

@coderxiang
Copy link
Contributor

I've seen the first point before and thus I'm +1 for this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda forget how checkpoint gets executed here. Is this count necessary? Or this is for caching?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, for implicit preference, this is not necessary because we are computing YtY anyway.

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28801 has finished for PR 5076 at commit 29affcb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Mar 18, 2015

test this please

@SparkQA
Copy link

SparkQA commented Mar 18, 2015

Test build #28820 has finished for PR 5076 at commit 29affcb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28903 has finished for PR 5076 at commit df56791.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor Author

mengxr commented Mar 20, 2015

test this please

@SparkQA
Copy link

SparkQA commented Mar 20, 2015

Test build #28908 has finished for PR 5076 at commit df56791.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@coderxiang
Copy link
Contributor

LGTM!

@mengxr
Copy link
Contributor Author

mengxr commented Mar 20, 2015

Thanks! Merged into master.

@asfgit asfgit closed this in 6b36470 Mar 20, 2015
asfgit pushed a commit that referenced this pull request Mar 24, 2015
Add checkpiontInterval to ALS to prevent:

1. StackOverflow exceptions caused by long lineage,
2. large shuffle files generated during iterations,
3. slow recovery when some node fail.

srowen coderxiang

Author: Xiangrui Meng <[email protected]>

Closes #5076 from mengxr/SPARK-5955 and squashes the following commits:

df56791 [Xiangrui Meng] update impl to reuse code
29affcb [Xiangrui Meng] do not materialize factors in implicit
20d3f7f [Xiangrui Meng] add checkpointInterval to ALS

(cherry picked from commit 6b36470)
Signed-off-by: Xiangrui Meng <[email protected]>

Conflicts:
	mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala
@mengxr
Copy link
Contributor Author

mengxr commented Mar 24, 2015

Merged this into branch-1.3 as well because this helps with scalability.

@aremirata
Copy link

Hi guys,

First of all, I would like to thank you guys for developing spark and putting it open source that we can use. I'm new to Spark and Scala, and working in a project involving matrix factorizations in Spark. I have a problem regarding running ALS in Spark. It has a stackoverflow due to long linage chain as per comments on the internet. One of their suggestion is to use the setCheckpointInterval so that for every 10-20 iterations, we can checkpoint the RDDs and it prevents the error. Just want to ask details on how to do checkpointing with ALS. I am using spark-kernel developed by IBM: https://github.com/ibm-et/spark-kernel instead of spark-shell.

Here are some of my specific questions regarding details on checkpoint:

  1. In setting checkpoint directory through SparkContext.setCheckPointDir(), it needs to be a hadoop compatible directory. Can we use any available hdfs-compatible directory?
  2. What do you mean by this comment on the code in ALS checkpointing:
    If the checkpoint directory is not set in [[org.apache.spark.SparkContext]],
    • this setting is ignored.
  3. Is the use of setCheckPointInterval the only code I needed to add to have checkpointing for ALS work?
  4. I am getting this error: Name: java.lang.IllegalArgumentException, Message: Wrong FS: expected file :///. How can I solve this? What is the proper way of using checkpointing.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants