[SPARK-22974][ML] Attach attributes to output column of CountVectorModel by viirya · Pull Request #20313 · apache/spark

viirya · 2018-01-18T09:30:27Z

What changes were proposed in this pull request?

The output column from CountVectorModel lacks attribute. So a later transformer like Interaction can raise error because no attribute available.

How was this patch tested?

Added test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

SparkQA · 2018-01-18T10:34:56Z

Test build #86332 has finished for PR 20313 at commit aeae308.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-01-18T11:45:06Z

cc @MLnick @WeichenXu123 @jkbradley

viirya · 2018-02-27T04:06:16Z

ping @jkbradley @MLnick again

WeichenXu123 · 2018-04-02T09:48:35Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

      Vectors.sparse(dictBr.value.size, effectiveCounts)
    }
-    dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
+    val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]


The attributes append no useful statistics but only allocate a large array. I think it should be generated lazily, e.g., when it needed in following transformer then we generate it.

Sorry for replying late. Though I agree that this attributes don't provide much info, I'm wondering if we can let it lazily generated. At this point, I think we don't know if following transformer will need it or not?

I am also unsure, if those attributes can be generated after the application of the CV transformer, since you could easily create inconsistent behaviour:
what happens when you store the cv-transformed data using a spark-action directly after the application of the CV? Wouldnt it materialize the dataframe without attributes since they were not explicitly used and needed? Now imagine, you use the CV-transformed dataframe with another Transformer B which would actually need the attributes? I guess the transformer might fail, which it wouldnt if the dataframe was not materialized before B is being applied.

Also, I do not think, that the information is totally useless: if you want to know which feature (semanticwise, not indexwise) corresponds to which LR coefficient for example, this would be very helpful. In general, it should be possible to easily get the mapping between a vector index and the raw data from which it was created by the application of a Pipeline cause it really helps to quickly make a sanity check of the model created and even reuse the LR coefficients for other purposes. And sadly, this is especially true when the feature vector contains more than 20 elements.

viirya · 2018-06-12T07:21:33Z

cc @dbtsai too.

dbtsai · 2018-08-14T05:04:46Z

LGTM. I think to have transformer framework working properly, it's required to have attributes in CountVector. Being said that, we should deal with the issue of allocating big attributes for sparse cases in as a separate task.

Merged into master.

tianxzhu · 2021-04-30T17:19:55Z

Hi, I encountered a similar issue with MinMaxScaler and StandardScaler. After I applied scaling, the output vector column does not have attributes, causing following Interaction to fail.

Attach attributes to output column of CountVectorModel.

aeae308

WeichenXu123 reviewed Apr 2, 2018

View reviewed changes

asfgit closed this in 3eb5209 Aug 14, 2018

zhengruifeng mentioned this pull request Dec 6, 2019

[SPARK-29914][ML][FOLLOWUP] CountVectorizer del big attribute array #26767

Closed

viirya deleted the SPARK-22974 branch December 27, 2023 18:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313

[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313
viirya wants to merge 1 commit intoapache:masterfrom
viirya:SPARK-22974

viirya commented Jan 18, 2018

Uh oh!

SparkQA commented Jan 18, 2018

Uh oh!

viirya commented Jan 18, 2018

Uh oh!

viirya commented Feb 27, 2018

Uh oh!

WeichenXu123 Apr 2, 2018

Uh oh!

viirya Jun 12, 2018

Uh oh!

PowerToThePeople111 Aug 9, 2018 •

edited

Loading

Uh oh!

viirya commented Jun 12, 2018

Uh oh!

dbtsai commented Aug 14, 2018

Uh oh!

tianxzhu commented Apr 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

viirya commented Jan 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 18, 2018

Uh oh!

viirya commented Jan 18, 2018

Uh oh!

viirya commented Feb 27, 2018

Uh oh!

WeichenXu123 Apr 2, 2018

Choose a reason for hiding this comment

Uh oh!

viirya Jun 12, 2018

Choose a reason for hiding this comment

Uh oh!

PowerToThePeople111 Aug 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Jun 12, 2018

Uh oh!

dbtsai commented Aug 14, 2018

Uh oh!

tianxzhu commented Apr 30, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PowerToThePeople111 Aug 9, 2018 •

edited

Loading