[SPARK-29914][ML] ML models attach metadata in transform/transformSchema#26547
[SPARK-29914][ML] ML models attach metadata in transform/transformSchema#26547zhengruifeng wants to merge 14 commits intoapache:masterfrom
transform/transformSchema#26547Conversation
|
Test build #113875 has finished for PR 26547 at commit
|
|
Test build #113876 has finished for PR 26547 at commit
|
5c621bb to
9aa6ae0
Compare
|
Test build #113982 has finished for PR 26547 at commit
|
|
Test build #113994 has finished for PR 26547 at commit
|
transform/transformSchematransform/transformSchema
|
@viirya hi, I noticed that you had some works on attach output attributes. Would you like to help reviewing this? Thanks |
|
also friendly ping @srowen |
|
This PR aims to attach inferrable attributes to output columns. |
srowen
left a comment
There was a problem hiding this comment.
It's a big change. Is there any downside? do any of these take non-trivial extra time to compute and update? conversely, does adding them help anything else optimize its operation?
viirya
left a comment
There was a problem hiding this comment.
thanks for pinging me. I will be in flight today and can not review this. I may have time to take look in next days.
viirya
left a comment
There was a problem hiding this comment.
this change adds metadata to many classes, is metadata useful for them all?
|
|
||
| val vectorSize = data.head.size | ||
|
|
||
| // Can not infer size of ouput vector, since no metadata is provided |
| vecSize: Int): Unit = { | ||
| import dataframe.sparkSession.implicits._ | ||
| val group = AttributeGroup.fromStructField(dataframe.schema(vecColName)) | ||
| assert(group.size === vecSize) |
There was a problem hiding this comment.
Can we add some error message to explain it when the condition fails?
There should not be non-trival cost in update schema, since its logic is simple (similar operations like
Some downstream impls in the pipeline will try to use the meta if provided, otherwise it need to trigger a job, such as a Thanks for reviewing. |
|
@viirya Thanks for reviewing!
I think it maybe nice to provide as much metadata as possible metadata in the output datasets, since downstream impls may use it in some way. |
|
I think the change is OK if it improves consistency. |
| val attr = if (numValues == 2) { | ||
| BinaryAttribute.defaultAttr | ||
| .withName(colName) | ||
| } else { | ||
| NominalAttribute.defaultAttr | ||
| .withName(colName) | ||
| .withNumValues(numValues) | ||
| } |
There was a problem hiding this comment.
Not sure about this. Is numValues == 2 case always BinaryAttribute? NominalAttribute can not have two number of values?
There was a problem hiding this comment.
Good point.
I found that existing impls like Bucketizer will check whether numValues==2,
so I think it is safe to only use NominalAttribute here.
| } | ||
|
|
||
| /** | ||
| * Update the metadata of an existing column. If this column do not exist, append it. |
There was a problem hiding this comment.
This method has update and overwrite two functions. We should add to this doc.
| def updateField( | ||
| schema: StructType, | ||
| field: StructField, | ||
| overrideMeta: Boolean = true): StructType = { |
| rootNode.predictImpl(features).prediction | ||
| } | ||
|
|
||
| @Since("1.4.0") |
efe911f to
a690eb7
Compare
|
Test build #114399 has finished for PR 26547 at commit
|
|
Test build #114401 has finished for PR 26547 at commit
|
srowen
left a comment
There was a problem hiding this comment.
OK by me if you're done and tests pass.
|
retest this please |
|
Test build #114813 has finished for PR 26547 at commit
|
3d26d74 to
3eb87f6
Compare
|
Test build #114821 has finished for PR 26547 at commit
|
|
Merged to master, thanks all for reviewing! |
| val attrs: Array[Attribute] = vocabulary.map(_ => new NumericAttribute) | ||
| val field = new AttributeGroup($(outputCol), attrs).toStructField() | ||
| outputSchema = SchemaUtils.updateField(outputSchema, field) |
There was a problem hiding this comment.
vocabulary is a big number, for example 1 << 18 by default. We will keep a big attribute array here. Do we actually need this metadata?
There was a problem hiding this comment.
looks like this just moved old code. so just wondering if this will be a problem.
There was a problem hiding this comment.
Sounds reasonable, I think we may change this place by only attach a size. I will send a follow up.
|
sorry for late. looks fine to me. |
|
@viirya Thanks very much for helping review this PR! |
…Schema` ### What changes were proposed in this pull request? 1, `predictionCol` in `ml.classification` & `ml.clustering` add `NominalAttribute` 2, `rawPredictionCol` in `ml.classification` add `AttributeGroup` containing vectorsize=`numClasses` 3, `probabilityCol` in `ml.classification` & `ml.clustering` add `AttributeGroup` containing vectorsize=`numClasses`/`k` 4, `leafCol` in GBT/RF add `AttributeGroup` containing vectorsize=`numTrees` 5, `leafCol` in DecisionTree add `NominalAttribute` 6, `outputCol` in models in `ml.feature` add `AttributeGroup` containing vectorsize 7, `outputCol` in `UnaryTransformer`s in `ml.feature` add `AttributeGroup` containing vectorsize ### Why are the changes needed? Appened metadata can be used in downstream ops, like `Classifier.getNumClasses` There are many impls (like `Binarizer`/`Bucketizer`/`VectorAssembler`/`OneHotEncoder`/`FeatureHasher`/`HashingTF`/`VectorSlicer`/...) in `.ml` that append appropriate metadata in `transform`/`transformSchema` method. However there are also many impls return no metadata in transformation, even some metadata like `vector.size`/`numAttrs`/`attrs` can be ealily inferred. ### Does this PR introduce any user-facing change? Yes, add some metadatas in transformed dataset. ### How was this patch tested? existing testsuites and added testsuites Closes apache#26547 from zhengruifeng/add_output_vecSize. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
| colName: String, | ||
| numValues: Int): Unit = { | ||
| import dataframe.sparkSession.implicits._ | ||
| val n = Attribute.fromStructField(dataframe.schema(colName)) match { |
There was a problem hiding this comment.
Scala compiler prints the warning here:
Warning:(88, 38) match may not be exhaustive.
It would fail on the following inputs: NumericAttribute(), UnresolvedAttribute
val n = Attribute.fromStructField(dataframe.schema(colName)) match {Just in case, do we cover all cases?
There was a problem hiding this comment.
I think that's all the cases that need to be covered. The warning could be avoided by adding a case that throws an exception. That kind of cleanup is fine across the code. It won't matter too much here as it'll already generate an exception (correctly)
What changes were proposed in this pull request?
1,
predictionColinml.classification&ml.clusteringaddNominalAttribute2,
rawPredictionColinml.classificationaddAttributeGroupcontaining vectorsize=numClasses3,
probabilityColinml.classification&ml.clusteringaddAttributeGroupcontaining vectorsize=numClasses/k4,
leafColin GBT/RF addAttributeGroupcontaining vectorsize=numTrees5,
leafColin DecisionTree addNominalAttribute6,
outputColin models inml.featureaddAttributeGroupcontaining vectorsize7,
outputColinUnaryTransformers inml.featureaddAttributeGroupcontaining vectorsizeWhy are the changes needed?
Appened metadata can be used in downstream ops, like
Classifier.getNumClassesThere are many impls (like
Binarizer/Bucketizer/VectorAssembler/OneHotEncoder/FeatureHasher/HashingTF/VectorSlicer/...) in.mlthat append appropriate metadata intransform/transformSchemamethod.However there are also many impls return no metadata in transformation, even some metadata like
vector.size/numAttrs/attrscan be ealily inferred.Does this PR introduce any user-facing change?
Yes, add some metadatas in transformed dataset.
How was this patch tested?
existing testsuites and added testsuites