[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313
[SPARK-22974][ML] Attach attributes to output column of CountVectorModel#20313viirya wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test build #86332 has finished for PR 20313 at commit
|
|
ping @jkbradley @MLnick again |
| Vectors.sparse(dictBr.value.size, effectiveCounts) | ||
| } | ||
| dataset.withColumn($(outputCol), vectorizer(col($(inputCol)))) | ||
| val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]] |
There was a problem hiding this comment.
The attributes append no useful statistics but only allocate a large array. I think it should be generated lazily, e.g., when it needed in following transformer then we generate it.
There was a problem hiding this comment.
Sorry for replying late. Though I agree that this attributes don't provide much info, I'm wondering if we can let it lazily generated. At this point, I think we don't know if following transformer will need it or not?
There was a problem hiding this comment.
I am also unsure, if those attributes can be generated after the application of the CV transformer, since you could easily create inconsistent behaviour:
what happens when you store the cv-transformed data using a spark-action directly after the application of the CV? Wouldnt it materialize the dataframe without attributes since they were not explicitly used and needed? Now imagine, you use the CV-transformed dataframe with another Transformer B which would actually need the attributes? I guess the transformer might fail, which it wouldnt if the dataframe was not materialized before B is being applied.
Also, I do not think, that the information is totally useless: if you want to know which feature (semanticwise, not indexwise) corresponds to which LR coefficient for example, this would be very helpful. In general, it should be possible to easily get the mapping between a vector index and the raw data from which it was created by the application of a Pipeline cause it really helps to quickly make a sanity check of the model created and even reuse the LR coefficients for other purposes. And sadly, this is especially true when the feature vector contains more than 20 elements.
|
cc @dbtsai too. |
|
LGTM. I think to have transformer framework working properly, it's required to have attributes in Merged into master. |
|
Hi, I encountered a similar issue with MinMaxScaler and StandardScaler. After I applied scaling, the output vector column does not have attributes, causing following Interaction to fail. |
What changes were proposed in this pull request?
The output column from
CountVectorModellacks attribute. So a later transformer likeInteractioncan raise error because no attribute available.How was this patch tested?
Added test.
Please review http://spark.apache.org/contributing.html before opening a pull request.