Skip to content

Conversation

@gaborbarna
Copy link
Contributor

@gaborbarna gaborbarna commented Feb 3, 2018

No description provided.

Copy link
Owner

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks a lot! 👍

Tell me when you're happy to merge.


import scala.collection.generic.IsTraversableOnce

final class Meta(val metadata: Metadata) extends StaticAnnotation
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have made this a case class?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we should do that, and remove the val keyword.

Copy link

@kmate kmate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK so far, but I think we should add a schema mapping function before we can consider it complete: we must convert the flattened schema back to a nested one using Dataset.select to make .as[T] work again with this.

StructTypeEncoder[Foo].encode shouldBe StructType(
StructField("a", StringType, false) ::
StructField("b", IntegerType, false, metadata) :: Nil
case class Bar(@Flatten(2) a: Seq[Foo], @Flatten(1, Seq("x", "y")) b: Map[Symbol, Foo], @Flatten c: Foo)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use Map[String, Foo] here as the keys in flatten are Strings anyway.


import scala.collection.generic.IsTraversableOnce

final class Meta(val metadata: Metadata) extends StaticAnnotation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we should do that, and remove the val keyword.

@gaborbarna gaborbarna changed the title [WIP] feat: flattened annotation support feat: flattened annotation support Feb 13, 2018
}

implicit def recordEncoder[A, H <: HList, HA <: HList](
private def flattenFields(fields: Seq[StructField], dt: DataType, prefix: String, flatten: Flatten) =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we specify the return type?

@@ -0,0 +1,197 @@
package ste
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add the license header in this file and the associated spec?

}

object DFUtils {
implicit class EnhancedDF(df: DataFrame) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that these are names of implicits, but I would rename at least the class. EnhancedDF should be something like FlattenedDataFrame. Also, I would split the method, and create a public wrapper around the select only, like selectNested: DataFrame. Then asNested can be implemented with selectNested and as.

Copy link
Owner

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits but it looks great 👍 . Also, could we, in a separate PR, document those new features (Metadata and Flatten) in the readme?

tSelector.select(dfNested, parentPrefixes, flatten.tail)
}

private def getChildPrefixes(prefixes: Seq[Prefix], flatten: Option[Flatten]) =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're also missing the return type here

}

private def getChildPrefixes(prefixes: Seq[Prefix], flatten: Option[Flatten]) =
flatten.map(_ match {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can just do

flatten.map {
  case ...
}

}
}(breakOut)

private def orderedSelect(df: DataFrame, nestedCols: Map[Prefix, Column], fields: Seq[Prefix]) = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing return type here too


private def orderedSelect(df: DataFrame, nestedCols: Map[Prefix, Column], fields: Seq[Prefix]) = {
@tailrec
def loop(nestedCols: Map[Prefix, Column], fields: Seq[Prefix], cols: Seq[Column]): Seq[Column] = fields match {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we make fields and cols lists instead of seqs?

StructTypeEncoder[Foo].encode shouldBe StructType(
StructField("a", StringType, false) ::
StructField("b", IntegerType, false, metadata) :: Nil
case class Bar(@Flatten(2) a: Seq[Foo], @Flatten(1, Seq("x", "y")) b: collection.Map[String, Foo], @Flatten c: Foo)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we have Map instead of collection.Map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately not, this fix hasn't been backported to spark 2.1 apache/spark#16161

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, would you be against an upgrade to 2.2.1 in release 0.2.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wrong, they backported it to 2.1.x, I'm upgrading the patch version, I think we should do the minor update in a separate PR


object StructSelectorSpec {
case class Foo(a: Int, b: String)
case class Bar(@Flatten(1, Seq("asd", "qwe")) foo: collection.Map[String, Foo], c: Int)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark for the collection.Map

@gaborbarna
Copy link
Contributor Author

@BenFradet I'm going to write some documentation for these new features today hopefully

}

object DFUtils {
def selectNested[A](df: DataFrame)(implicit s: StructTypeSelector[A]): DataFrame = s.select(df, None)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this into FlattenedDataFrame .

@tailrec
def loop(nestedCols: Map[Prefix, Column], fields: List[Prefix], cols: List[Column]): List[Column] = fields match {
case Nil => cols.reverse
case hd +: tail => nestedCols.find { case (p, _) => p.isParentOf(hd) } match {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you have lists you can do hd :: tail, same thing below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, unfortunately shapeless overrides the :: definition

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, too bad :(

case class Bar(@Flatten(1, Seq("asd", "qwe")) foo: Map[String, Foo], c: Int)
case class Baz(@Flatten(2) bar: Seq[Bar], e: Int)
case class Asd(@Flatten foo: Foo, x: Int)
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we move this at the top of the spec, this make things easier to understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the companion object up, it needs to be in an object for spark encoder to access the scope that those case classes were defined in.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't have an issue with encapsulating them in an object, it's just that if we move them at the top it makes understanding the tests easier 👍

Copy link
Owner

@BenFradet BenFradet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, two nits and I'll merge 👍 Thanks a lot for all your effort

@BenFradet
Copy link
Owner

Merging, thanks a lot! 👍

@BenFradet BenFradet changed the title feat: flattened annotation support Add support for a Flatten annotation Feb 15, 2018
@BenFradet BenFradet merged commit 449e6c5 into BenFradet:master Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants