-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14661] [MLlib] trim PCAModel by required explained variance #12419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no expert in the ML domain, but from a user perspective, this breaks API backwards compatibility.
An alternative could be to create a new method and factor out common behaviour shared with the current
computePrincipalComponentsAndExplainedVarianceinto a private utility method.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah @jodersky It looks like the comment Since("1.6.0") is false becaue this method is not available in spark 1.6 - this change was merged to master instead of 1.6 branch. Do you still consider this change as API breaking given that it modifies API that wasn't yet released? If yes then I'll do as @jodersky said and introduce a new method and move common code to a new private one. I'd really like to have this feature in MLlib version because I use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the breakage, nevertheless I would recommend implementing a new method regardless. I find the method's parameter type
Either[Int, Double]to be quite confusing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah having one method to mean two things using an
Eitheris too strange. At least, you would provide two overloads. And then, no reason to overload versus given them distinct and descriptive names.I don't understand the question about unreleased APIs -- 1.6.0 was released a while ago and this method takes an Int parameter there. We certainly want to keep the ability to set a fixed number of principal components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is RowMatrix as in 1.6.1 release: https://github.com/apache/spark/blob/15de51c238a7340fa81cb0b80d029a05d97bfc5c/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala am I correct? If yes then can you find there a method named computePrincipalComponentsAndExplainedVariance? I can't, yet on master it is annotated with Since("1.6.0") - isn't it false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha you're right, it wasn't in 1.6. This is my fault: 21b3d2a
It never was added to branch 1.6, despite the apparent intention. At this point I think it should be considered 2.0+ and you can fix that annotation here. So yeah this method was never 'released'. Still I think we want to do something different with the argument anyway.