Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import org.apache.hadoop.fs.Path
import org.apache.spark.annotation.Since
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.{Estimator, Model}
import org.apache.spark.ml.attribute.{Attribute, AttributeGroup, NumericAttribute}
import org.apache.spark.ml.linalg.{Vectors, VectorUDT}
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
Expand Down Expand Up @@ -264,7 +265,9 @@ class CountVectorizerModel(

Vectors.sparse(dictBr.value.size, effectiveCounts)
}
dataset.withColumn($(outputCol), vectorizer(col($(inputCol))))
val attrs = vocabulary.map(_ => new NumericAttribute).asInstanceOf[Array[Attribute]]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attributes append no useful statistics but only allocate a large array. I think it should be generated lazily, e.g., when it needed in following transformer then we generate it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for replying late. Though I agree that this attributes don't provide much info, I'm wondering if we can let it lazily generated. At this point, I think we don't know if following transformer will need it or not?

Copy link
Copy Markdown

@PowerToThePeople111 PowerToThePeople111 Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also unsure, if those attributes can be generated after the application of the CV transformer, since you could easily create inconsistent behaviour:
what happens when you store the cv-transformed data using a spark-action directly after the application of the CV? Wouldnt it materialize the dataframe without attributes since they were not explicitly used and needed? Now imagine, you use the CV-transformed dataframe with another Transformer B which would actually need the attributes? I guess the transformer might fail, which it wouldnt if the dataframe was not materialized before B is being applied.

Also, I do not think, that the information is totally useless: if you want to know which feature (semanticwise, not indexwise) corresponds to which LR coefficient for example, this would be very helpful. In general, it should be possible to easily get the mapping between a vector index and the raw data from which it was created by the application of a Pipeline cause it really helps to quickly make a sanity check of the model created and even reuse the LR coefficients for other purposes. And sadly, this is especially true when the feature vector contains more than 20 elements.

val metadata = new AttributeGroup($(outputCol), attrs).toMetadata()
dataset.withColumn($(outputCol), vectorizer(col($(inputCol))), metadata)
}

@Since("1.5.0")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -220,4 +220,20 @@ class CountVectorizerSuite extends SparkFunSuite with MLlibTestSparkContext
val newInstance = testDefaultReadWrite(instance)
assert(newInstance.vocabulary === instance.vocabulary)
}

test("SPARK-22974: CountVectorModel should attach proper attribute to output column") {
val df = spark.createDataFrame(Seq(
(0, 1.0, Array("a", "b", "c")),
(1, 2.0, Array("a", "b", "b", "c", "a", "d"))
)).toDF("id", "features1", "words")

val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features2")

val df1 = cvm.transform(df)
val interaction = new Interaction().setInputCols(Array("features1", "features2"))
.setOutputCol("features")
interaction.transform(df1)
}
}