Skip to content

[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313

Closed
viirya wants to merge 1 commit intoapache:masterfrom
viirya:SPARK-22974
Closed

[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313
viirya wants to merge 1 commit intoapache:masterfrom
viirya:SPARK-22974

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented Jan 18, 2018

What changes were proposed in this pull request?

The output column from CountVectorModel lacks attribute. So a later transformer like Interaction can raise error because no attribute available.

How was this patch tested?

Added test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jan 18, 2018

Test build #86332 has finished for PR 20313 at commit aeae308.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Jan 18, 2018

cc @MLnick @WeichenXu123 @jkbradley

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Feb 27, 2018

ping @jkbradley @MLnick again

Vectors.sparse(dictBr.value.size, effectiveCounts)
}
dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes append no useful statistics but only allocate a large array. I think it should be generated lazily, e.g., when it needed in following transformer then we generate it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for replying late. Though I agree that this attributes don't provide much info, I'm wondering if we can let it lazily generated. At this point, I think we don't know if following transformer will need it or not?

Copy link
Copy Markdown

@PowerToThePeople111 PowerToThePeople111 Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also unsure, if those attributes can be generated after the application of the CV transformer, since you could easily create inconsistent behaviour:
what happens when you store the cv-transformed data using a spark-action directly after the application of the CV? Wouldnt it materialize the dataframe without attributes since they were not explicitly used and needed? Now imagine, you use the CV-transformed dataframe with another Transformer B which would actually need the attributes? I guess the transformer might fail, which it wouldnt if the dataframe was not materialized before B is being applied.

Also, I do not think, that the information is totally useless: if you want to know which feature (semanticwise, not indexwise) corresponds to which LR coefficient for example, this would be very helpful. In general, it should be possible to easily get the mapping between a vector index and the raw data from which it was created by the application of a Pipeline cause it really helps to quickly make a sanity check of the model created and even reuse the LR coefficients for other purposes. And sadly, this is especially true when the feature vector contains more than 20 elements.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Jun 12, 2018

cc @dbtsai too.

@dbtsai
Copy link
Copy Markdown
Member

dbtsai commented Aug 14, 2018

LGTM. I think to have transformer framework working properly, it's required to have attributes in CountVector. Being said that, we should deal with the issue of allocating big attributes for sparse cases in as a separate task.

Merged into master.

@tianxzhu
Copy link
Copy Markdown

Hi, I encountered a similar issue with MinMaxScaler and StandardScaler. After I applied scaling, the output vector column does not have attributes, causing following Interaction to fail.

@viirya viirya deleted the SPARK-22974 branch December 27, 2023 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants