[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list #16770

BryanCutler · 2017-02-01T19:51:52Z

What changes were proposed in this pull request?

Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class _CountVectorizerParams to allow the Python model to also own the parameters. This now matches the Scala class hierarchy.

How was this patch tested?

Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.

BryanCutler · 2017-02-01T20:10:11Z

This is currently not working because of param issues. In order for a model constructed from vocab to transform a DataFrame, it was first necessary to add InputColumn and OutputColumn params to the model class. After that, the normal operation of fitting the model, then transforming fails because the CountVectorizer estimator never copies values to the CountVectorizerModel. This causes test failures because the column names are wrong on the transformed DataFrame.

File "spark/python/pyspark/ml/feature.py", line 233, in __main__.CountVectorizer
Failed example:
    model.transform(df).show(truncate=False)
Expected:
    +-----+---------------+-------------------------+
    |label|raw            |vectors                  |
    +-----+---------------+-------------------------+
    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
    +-----+---------------+-------------------------+
    ...
Got:
    +-----+---------------+-------------------------------------------------+
    |label|raw            |CountVectorizerModel_4514bd7bded7359f0828__output|
    +-----+---------------+-------------------------------------------------+
    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])                        |
    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])                        |
    +-----+---------------+-------------------------------------------------+

The correct way to fix this is to change JavaEstimator._fit in wrapper.py to include a call to _copyValues like

def _fit(self, dataset):
        java_model = self._fit_java(dataset)
        model = self._create_model(java_model)
return self._copyValues(model)

as was done in #14653 from SPARK-10931 PySpark ML Models should contain Param values.

I would like to take over SPARK-10931 and simplify it to just include the above fix to wrapper.py and implement it for the CountVectorizer class. The remaining classes can be implemented in pieces as follow on tasks. Once SPARK-10931, this PR should work too. What are your thoughts @holdenk and @jkbradley ?

SparkQA · 2017-02-01T20:12:44Z

Test build #72257 has finished for PR 16770 at commit da65f4b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):

SparkQA · 2017-04-13T09:06:37Z

Test build #75771 has finished for PR 16770 at commit da65f4b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):

…yet working because missing param _copyValues from estimator to model

BryanCutler · 2018-03-06T00:57:10Z

ping @holdenk

holdenk · 2018-03-06T01:02:45Z

Awesome! So if folks are OK with this I'm going to save the review for this Friday during the live code review ( see https://www.youtube.com/watch?v=lugG_2QU6YU ). The review comments will of course end up on the PR so don't feel like you have to tune in.

SparkQA · 2018-03-06T01:06:29Z

Test build #87986 has finished for PR 16770 at commit e94dde3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):
class CountVectorizer(JavaEstimator, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):
class CountVectorizerModel(JavaModel, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):

SparkQA · 2018-03-06T02:39:14Z

Test build #87988 has finished for PR 16770 at commit 8860641.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

Thanks @BryanCutler for the work on this and waiting such a very long time for review. I've got a few questions and a small change suggestion I'd love to see. Feel free to ping me when this PR is ready for review again.

holdenk · 2018-03-09T19:23:37Z

python/pyspark/ml/feature.py

+            model.setMinTF(minTF)
+        if binary is not None:
+            model.setBinary(binary)
+        model._set(vocabSize=len(vocabulary))


Any reason for _set rather than set?

The only difference is set checks to make sure the param is valid, which isn't really needed since this is internal.

holdenk · 2018-03-09T19:25:04Z

python/pyspark/ml/feature.py

        return self._call_java("vocabulary")

+    @since("2.4.0")
+    def setMinTF(self, value):


If we're going to have the setters in both the model and the estimator maybe we should consider putting it in the shared params class?

I agree but I was trying to match the Scala API. My only thought is it was done this way to leave it up to the implementations if they allow setting the params. What do you think?

Sounds reasonable to me.

holdenk · 2018-03-09T19:26:12Z

python/pyspark/ml/tests.py

            self.assertEqual(feature, expected)

+    def test_count_vectorizer_from_vocab(self):
+        model = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words",


Good first test, I'd love to also see it with empty vocab, and also one that uses the default values.

Yeah, good idea, I'll add those

holdenk · 2018-03-09T19:28:21Z

python/pyspark/ml/tests.py

            for name, cls in inspect.getmembers(module, inspect.isclass):
-                if not name.endswith('Model') and issubclass(cls, JavaParams)\
-                        and not inspect.isabstract(cls):
+                if not name.endswith('Model') and not name.endswith('Params')\


Just to make sure I've understood whats happening here, were avoiding doing the default params test on non-concrete classes like the base params shared between the model and the estimator and instead testing just the model and estimator on their own right?

Yes, that's pretty much right but this is only checking estimators and skips models also. We should have an explicit check for CountVectorizer.from_vocabulary here too since that is possible. Unfortunately, a new param maxDF was added to Scala recently and the param check will fail. Once that is in Python, we can add the check for it here.

Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.

I'm helping get maxDF in python now, so after that's done I'll make a followup to add this

holdenk · 2018-03-09T19:29:28Z

python/pyspark/ml/feature.py

    >>> loadedModel = CountVectorizerModel.load(modelPath)
    >>> loadedModel.vocabulary == model.vocabulary
    True
+    >>> fromVocabModel = CountVectorizerModel.from_vocabulary(model.vocabulary,


This might be better with an explicit manual array rather than model.vocabulary to show folks how to expect to use it? What are your thoughts?

Yeah, totally agree let me change it

BryanCutler · 2018-03-14T18:07:57Z

Thanks for the review @holdenk and for doing the livestream, hopefully it was helpful to folks! I'll make some updates and ping when ready.

…ing default params

BryanCutler · 2018-03-14T19:01:28Z

python/pyspark/ml/tests.py

+        # Test model with default settings can transform
+        model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
+        transformed_list = model_default.transform(dataset).collect()
+        self.assertEqual(len(transformed_list), 3)


The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol

BryanCutler · 2018-03-14T19:19:58Z

python/pyspark/ml/tests.py

+        model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
+        transformed_list = model_default.transform(dataset)\
+            .select(model_default.getOrDefault(model_default.outputCol)).collect()
+        self.assertEqual(len(transformed_list), 3)


The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol

SparkQA · 2018-03-14T19:23:07Z

Test build #88238 has finished for PR 16770 at commit 5220ff1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-14T19:43:08Z

Test build #88239 has finished for PR 16770 at commit 7e05da4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

This looks good to me. LGTM

holdenk · 2018-03-16T18:23:21Z

python/pyspark/ml/feature.py

+            model.setMinTF(minTF)
+        if binary is not None:
+            model.setBinary(binary)
+        model._set(vocabSize=len(vocabulary))


holdenk · 2018-03-16T18:32:30Z

python/pyspark/ml/feature.py

        return self._call_java("vocabulary")

+    @since("2.4.0")
+    def setMinTF(self, value):


Sounds reasonable to me.

holdenk · 2018-03-16T18:33:26Z

python/pyspark/ml/tests.py

+        model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
+        transformed_list = model_default.transform(dataset)\
+            .select(model_default.getOrDefault(model_default.outputCol)).collect()
+        self.assertEqual(len(transformed_list), 3)


holdenk · 2018-03-16T18:34:40Z

python/pyspark/ml/tests.py

            for name, cls in inspect.getmembers(module, inspect.isclass):
-                if not name.endswith('Model') and issubclass(cls, JavaParams)\
-                        and not inspect.isabstract(cls):
+                if not name.endswith('Model') and not name.endswith('Params')\


Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.

holdenk · 2018-03-16T18:45:16Z

Looks good to me, merged to master. Thanks!

BryanCutler · 2018-03-16T22:57:00Z

Thanks @holdenk!

…abulary list ## What changes were proposed in this pull request? Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters. This now matches the Scala class hierarchy. ## How was this patch tested? Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly. Author: Bryan Cutler <[email protected]> Closes apache#16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.

BryanCutler added 2 commits March 5, 2018 13:26

Added class method to construct CountVectorizerModel from vocab, not …

01e5a4b

…yet working because missing param _copyValues from estimator to model

updated CountVectorizerModel to use common param base class

e94dde3

BryanCutler force-pushed the pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009 branch from da65f4b to e94dde3 Compare March 6, 2018 00:46

BryanCutler changed the title ~~[SPARK-15009][PYTHON][ML][WIP] Construct CountVectorizerModel from Vocabulary~~ [SPARK-15009][PYTHON][ML] Construct CountVectorizerModel from Vocabulary Mar 6, 2018

Added exception for checking default values of param base classes

8860641

BryanCutler changed the title ~~[SPARK-15009][PYTHON][ML] Construct CountVectorizerModel from Vocabulary~~ [SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list Mar 9, 2018

holdenk requested changes Mar 9, 2018

View reviewed changes

use explicity vocab list in doc test, add test for empty vocab and us…

5220ff1

…ing default params

BryanCutler commented Mar 14, 2018

View reviewed changes

select default col in test

7e05da4

BryanCutler commented Mar 14, 2018

View reviewed changes

holdenk approved these changes Mar 16, 2018

View reviewed changes

asfgit closed this in 8a72734 Mar 16, 2018

BryanCutler deleted the pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009 branch November 19, 2018 05:47

[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list #16770

[SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list #16770

Uh oh!

Conversation

BryanCutler commented Feb 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

BryanCutler commented Feb 1, 2017

Uh oh!

SparkQA commented Feb 1, 2017

Uh oh!

SparkQA commented Apr 13, 2017

Uh oh!

BryanCutler commented Mar 6, 2018

Uh oh!

holdenk commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

SparkQA commented Mar 6, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Mar 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 14, 2018

Uh oh!

SparkQA commented Mar 14, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Mar 16, 2018

BryanCutler commented Feb 1, 2017 •

edited

Loading