-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list #16770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list #16770
Conversation
|
This is currently not working because of The correct way to fix this is to change as was done in #14653 from SPARK-10931 PySpark ML Models should contain Param values. I would like to take over SPARK-10931 and simplify it to just include the above fix to |
|
Test build #72257 has finished for PR 16770 at commit
|
|
Test build #75771 has finished for PR 16770 at commit
|
…yet working because missing param _copyValues from estimator to model
da65f4b to
e94dde3
Compare
|
ping @holdenk |
|
Awesome! So if folks are OK with this I'm going to save the review for this Friday during the live code review ( see https://www.youtube.com/watch?v=lugG_2QU6YU ). The review comments will of course end up on the PR so don't feel like you have to tune in. |
|
Test build #87986 has finished for PR 16770 at commit
|
|
Test build #87988 has finished for PR 16770 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BryanCutler for the work on this and waiting such a very long time for review. I've got a few questions and a small change suggestion I'd love to see. Feel free to ping me when this PR is ready for review again.
| model.setMinTF(minTF) | ||
| if binary is not None: | ||
| model.setBinary(binary) | ||
| model._set(vocabSize=len(vocabulary)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for _set rather than set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference is set checks to make sure the param is valid, which isn't really needed since this is internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
| return self._call_java("vocabulary") | ||
|
|
||
| @since("2.4.0") | ||
| def setMinTF(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to have the setters in both the model and the estimator maybe we should consider putting it in the shared params class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but I was trying to match the Scala API. My only thought is it was done this way to leave it up to the implementations if they allow setting the params. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
| self.assertEqual(feature, expected) | ||
|
|
||
| def test_count_vectorizer_from_vocab(self): | ||
| model = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good first test, I'd love to also see it with empty vocab, and also one that uses the default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea, I'll add those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| for name, cls in inspect.getmembers(module, inspect.isclass): | ||
| if not name.endswith('Model') and issubclass(cls, JavaParams)\ | ||
| and not inspect.isabstract(cls): | ||
| if not name.endswith('Model') and not name.endswith('Params')\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I've understood whats happening here, were avoiding doing the default params test on non-concrete classes like the base params shared between the model and the estimator and instead testing just the model and estimator on their own right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's pretty much right but this is only checking estimators and skips models also. We should have an explicit check for CountVectorizer.from_vocabulary here too since that is possible. Unfortunately, a new param maxDF was added to Scala recently and the param check will fail. Once that is in Python, we can add the check for it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm helping get maxDF in python now, so after that's done I'll make a followup to add this
python/pyspark/ml/feature.py
Outdated
| >>> loadedModel = CountVectorizerModel.load(modelPath) | ||
| >>> loadedModel.vocabulary == model.vocabulary | ||
| True | ||
| >>> fromVocabModel = CountVectorizerModel.from_vocabulary(model.vocabulary, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be better with an explicit manual array rather than model.vocabulary to show folks how to expect to use it? What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, totally agree let me change it
|
Thanks for the review @holdenk and for doing the livestream, hopefully it was helpful to folks! I'll make some updates and ping when ready. |
…ing default params
| # Test model with default settings can transform | ||
| model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words") | ||
| transformed_list = model_default.transform(dataset).collect() | ||
| self.assertEqual(len(transformed_list), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol
| model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words") | ||
| transformed_list = model_default.transform(dataset)\ | ||
| .select(model_default.getOrDefault(model_default.outputCol)).collect() | ||
| self.assertEqual(len(transformed_list), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
|
Test build #88238 has finished for PR 16770 at commit
|
|
Test build #88239 has finished for PR 16770 at commit
|
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. LGTM
| model.setMinTF(minTF) | ||
| if binary is not None: | ||
| model.setBinary(binary) | ||
| model._set(vocabSize=len(vocabulary)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
| return self._call_java("vocabulary") | ||
|
|
||
| @since("2.4.0") | ||
| def setMinTF(self, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
| model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words") | ||
| transformed_list = model_default.transform(dataset)\ | ||
| .select(model_default.getOrDefault(model_default.outputCol)).collect() | ||
| self.assertEqual(len(transformed_list), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
| for name, cls in inspect.getmembers(module, inspect.isclass): | ||
| if not name.endswith('Model') and issubclass(cls, JavaParams)\ | ||
| and not inspect.isabstract(cls): | ||
| if not name.endswith('Model') and not name.endswith('Params')\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.
|
Looks good to me, merged to master. Thanks! |
|
Thanks @holdenk! |
…abulary list ## What changes were proposed in this pull request? Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters. This now matches the Scala class hierarchy. ## How was this patch tested? Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly. Author: Bryan Cutler <[email protected]> Closes apache#16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.
What changes were proposed in this pull request?
Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class
_CountVectorizerParamsto allow the Python model to also own the parameters. This now matches the Scala class hierarchy.How was this patch tested?
Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.