Skip to content

Conversation

@BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Feb 1, 2017

What changes were proposed in this pull request?

Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version. Introduced a common param base class _CountVectorizerParams to allow the Python model to also own the parameters. This now matches the Scala class hierarchy.

How was this patch tested?

Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.

@BryanCutler
Copy link
Member Author

This is currently not working because of param issues. In order for a model constructed from vocab to transform a DataFrame, it was first necessary to add InputColumn and OutputColumn params to the model class. After that, the normal operation of fitting the model, then transforming fails because the CountVectorizer estimator never copies values to the CountVectorizerModel. This causes test failures because the column names are wrong on the transformed DataFrame.

File "spark/python/pyspark/ml/feature.py", line 233, in __main__.CountVectorizer
Failed example:
    model.transform(df).show(truncate=False)
Expected:
    +-----+---------------+-------------------------+
    |label|raw            |vectors                  |
    +-----+---------------+-------------------------+
    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])|
    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
    +-----+---------------+-------------------------+
    ...
Got:
    +-----+---------------+-------------------------------------------------+
    |label|raw            |CountVectorizerModel_4514bd7bded7359f0828__output|
    +-----+---------------+-------------------------------------------------+
    |0    |[a, b, c]      |(3,[0,1,2],[1.0,1.0,1.0])                        |
    |1    |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])                        |
    +-----+---------------+-------------------------------------------------+

The correct way to fix this is to change JavaEstimator._fit in wrapper.py to include a call to _copyValues like

def _fit(self, dataset):
        java_model = self._fit_java(dataset)
        model = self._create_model(java_model)
return self._copyValues(model)

as was done in #14653 from SPARK-10931 PySpark ML Models should contain Param values.

I would like to take over SPARK-10931 and simplify it to just include the above fix to wrapper.py and implement it for the CountVectorizer class. The remaining classes can be implemented in pieces as follow on tasks. Once SPARK-10931, this PR should work too. What are your thoughts @holdenk and @jkbradley ?

@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72257 has finished for PR 16770 at commit da65f4b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):

@SparkQA
Copy link

SparkQA commented Apr 13, 2017

Test build #75771 has finished for PR 16770 at commit da65f4b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CountVectorizerModel(JavaModel, HasInputCol, HasOutputCol, JavaMLReadable, JavaMLWritable):

@BryanCutler BryanCutler force-pushed the pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009 branch from da65f4b to e94dde3 Compare March 6, 2018 00:46
@BryanCutler BryanCutler changed the title [SPARK-15009][PYTHON][ML][WIP] Construct CountVectorizerModel from Vocabulary [SPARK-15009][PYTHON][ML] Construct CountVectorizerModel from Vocabulary Mar 6, 2018
@BryanCutler
Copy link
Member Author

ping @holdenk

@holdenk
Copy link
Contributor

holdenk commented Mar 6, 2018

Awesome! So if folks are OK with this I'm going to save the review for this Friday during the live code review ( see https://www.youtube.com/watch?v=lugG_2QU6YU ). The review comments will of course end up on the PR so don't feel like you have to tune in.

@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #87986 has finished for PR 16770 at commit e94dde3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _CountVectorizerParams(JavaParams, HasInputCol, HasOutputCol):
  • class CountVectorizer(JavaEstimator, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):
  • class CountVectorizerModel(JavaModel, _CountVectorizerParams, JavaMLReadable, JavaMLWritable):

@SparkQA
Copy link

SparkQA commented Mar 6, 2018

Test build #87988 has finished for PR 16770 at commit 8860641.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler changed the title [SPARK-15009][PYTHON][ML] Construct CountVectorizerModel from Vocabulary [SPARK-15009][PYTHON][ML] Construct a CountVectorizerModel from a vocabulary list Mar 9, 2018
Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BryanCutler for the work on this and waiting such a very long time for review. I've got a few questions and a small change suggestion I'd love to see. Feel free to ping me when this PR is ready for review again.

model.setMinTF(minTF)
if binary is not None:
model.setBinary(binary)
model._set(vocabSize=len(vocabulary))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for _set rather than set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference is set checks to make sure the param is valid, which isn't really needed since this is internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

return self._call_java("vocabulary")

@since("2.4.0")
def setMinTF(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to have the setters in both the model and the estimator maybe we should consider putting it in the shared params class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I was trying to match the Scala API. My only thought is it was done this way to leave it up to the implementations if they allow setting the params. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

self.assertEqual(feature, expected)

def test_count_vectorizer_from_vocab(self):
model = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first test, I'd love to also see it with empty vocab, and also one that uses the default values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea, I'll add those

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for name, cls in inspect.getmembers(module, inspect.isclass):
if not name.endswith('Model') and issubclass(cls, JavaParams)\
and not inspect.isabstract(cls):
if not name.endswith('Model') and not name.endswith('Params')\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I've understood whats happening here, were avoiding doing the default params test on non-concrete classes like the base params shared between the model and the estimator and instead testing just the model and estimator on their own right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's pretty much right but this is only checking estimators and skips models also. We should have an explicit check for CountVectorizer.from_vocabulary here too since that is possible. Unfortunately, a new param maxDF was added to Scala recently and the param check will fail. Once that is in Python, we can add the check for it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm helping get maxDF in python now, so after that's done I'll make a followup to add this

>>> loadedModel = CountVectorizerModel.load(modelPath)
>>> loadedModel.vocabulary == model.vocabulary
True
>>> fromVocabModel = CountVectorizerModel.from_vocabulary(model.vocabulary,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better with an explicit manual array rather than model.vocabulary to show folks how to expect to use it? What are your thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, totally agree let me change it

@BryanCutler
Copy link
Member Author

Thanks for the review @holdenk and for doing the livestream, hopefully it was helpful to folks! I'll make some updates and ping when ready.

# Test model with default settings can transform
model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
transformed_list = model_default.transform(dataset).collect()
self.assertEqual(len(transformed_list), 3)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol

model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
transformed_list = model_default.transform(dataset)\
.select(model_default.getOrDefault(model_default.outputCol)).collect()
self.assertEqual(len(transformed_list), 3)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doctest uses default values for all params except outputCol and checks the transformed values, so this is really just testing that nothing fails if all param default values are used including outputCol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Test build #88238 has finished for PR 16770 at commit 5220ff1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Test build #88239 has finished for PR 16770 at commit 7e05da4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. LGTM

model.setMinTF(minTF)
if binary is not None:
model.setBinary(binary)
model._set(vocabSize=len(vocabulary))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

return self._call_java("vocabulary")

@since("2.4.0")
def setMinTF(self, value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable to me.

model_default = CountVectorizerModel.from_vocabulary(["a", "b", "c"], inputCol="words")
transformed_list = model_default.transform(dataset)\
.select(model_default.getOrDefault(model_default.outputCol)).collect()
self.assertEqual(len(transformed_list), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

for name, cls in inspect.getmembers(module, inspect.isclass):
if not name.endswith('Model') and issubclass(cls, JavaParams)\
and not inspect.isabstract(cls):
if not name.endswith('Model') and not name.endswith('Params')\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I look forward to us automatically catching models with missing params eventually as well.

@holdenk
Copy link
Contributor

holdenk commented Mar 16, 2018

Looks good to me, merged to master. Thanks!

@asfgit asfgit closed this in 8a72734 Mar 16, 2018
@BryanCutler
Copy link
Member Author

Thanks @holdenk!

mstewart141 pushed a commit to mstewart141/spark that referenced this pull request Mar 24, 2018
…abulary list

## What changes were proposed in this pull request?

Added a class method to construct CountVectorizerModel from a list of vocabulary strings, equivalent to the Scala version.  Introduced a common param base class `_CountVectorizerParams` to allow the Python model to also own the parameters.  This now matches the Scala class hierarchy.

## How was this patch tested?

Added to CountVectorizer doctests to do a transform on a model constructed from vocab, and unit test to verify params and vocab are constructed correctly.

Author: Bryan Cutler <[email protected]>

Closes apache#16770 from BryanCutler/pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009.
@BryanCutler BryanCutler deleted the pyspark-CountVectorizerModel-vocab_ctor-SPARK-15009 branch November 19, 2018 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants