Save genotype field indices during resolution#224
Save genotype field indices during resolution#224henrydavidge merged 19 commits intoprojectglow:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #224 +/- ##
==========================================
+ Coverage 93.59% 93.63% +0.03%
==========================================
Files 88 88
Lines 4232 4258 +26
Branches 359 397 +38
==========================================
+ Hits 3961 3987 +26
Misses 271 271
Continue to review full report at Codecov.
|
karenfeng
left a comment
There was a problem hiding this comment.
LGTM overall, mostly just requests for documentation.
| * nested column pruning. Prefer writing new functions as rewrites when possible. | ||
| */ | ||
| trait ExpectsGenotypeFields extends Expression { | ||
| @deprecated trait ExpectsGenotypeFields extends Expression { |
There was a problem hiding this comment.
For documentation, can we set the deprecated annotation's message and since fields?
| .withColumn(genotypesFieldName, arrays_zip(gSchema.get.fieldNames.map(col(_)): _*)) | ||
| .withColumn( | ||
| genotypesFieldName, | ||
| arrays_zip(gSchema.get.fieldNames.map(name => col(name).as(name)): _*)) |
There was a problem hiding this comment.
To clarify, we don't actually have to do this in the future, right?
There was a problem hiding this comment.
Oops, this is a testing relic. I wanted to see if it would resolve the issue within the main change of the PR. It does not.
core/src/main/scala/io/projectglow/sql/optimizer/hlsOptimizerRules.scala
Show resolved
Hide resolved
core/src/main/scala/io/projectglow/sql/expressions/VariantUtilExprs.scala
Show resolved
Hide resolved
core/src/test/scala/io/projectglow/sql/util/ExpectsGenotypeFieldsSuite.scala
Show resolved
Hide resolved
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com>
* doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com>
* doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
* Support numpy literals (projectglow#213) * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add documentation for the GFF3 reader (projectglow#221) * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> Signed-off-by: Karen Feng <karen.feng@databricks.com> * Remove cross-build plugin Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap Signed-off-by: Karen Feng <karen.feng@databricks.com> * triple quote Signed-off-by: Karen Feng <karen.feng@databricks.com> * set 'for imports' and 'for builds' within sbt settings Signed-off-by: Karen Feng <karen.feng@databricks.com> * CircleCI config Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap release Signed-off-by: Karen Feng <karen.feng@databricks.com> * More docs Signed-off-by: Karen Feng <karen.feng@databricks.com> * Comments Signed-off-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
karenfeng
left a comment
There was a problem hiding this comment.
LGTM, thanks! Just a small typo nit.
| * resolution. | ||
| * @param size The number of fields in the struct | ||
| * @param requiredFieldIndices The indices of required fields. 0 <= idx < size. | ||
| * @param optionalFieldIndices The indices of optional fields. -1 if not the field is not present. |
There was a problem hiding this comment.
Typo: -1 if the field is not present
|
The DCO bot seems confused by some of the merges into this branch. I force set it to pass. |
* Support numpy literals (projectglow#213) * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add documentation for the GFF3 reader (projectglow#221) * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> Signed-off-by: Karen Feng <karen.feng@databricks.com> * Remove cross-build plugin Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap Signed-off-by: Karen Feng <karen.feng@databricks.com> * triple quote Signed-off-by: Karen Feng <karen.feng@databricks.com> * set 'for imports' and 'for builds' within sbt settings Signed-off-by: Karen Feng <karen.feng@databricks.com> * CircleCI config Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap release Signed-off-by: Karen Feng <karen.feng@databricks.com> * new rule Signed-off-by: Henry D <henrydavidge@gmail.com> * Support numpy literals (projectglow#213) * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add documentation for the GFF3 reader (projectglow#221) * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * More docs Signed-off-by: Karen Feng <karen.feng@databricks.com> * add tests Signed-off-by: Henry D <henrydavidge@gmail.com> * style Signed-off-by: Henry D <henrydavidge@gmail.com> * Fix IntelliJ import (projectglow#223) * Support numpy literals (projectglow#213) * WIP Signed-off-by: Karen Feng <karen.feng@databricks.com> * Ndarray Signed-off-by: Karen Feng <karen.feng@databricks.com> * More tests Signed-off-by: Karen Feng <karen.feng@databricks.com> * Support flat array Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move registering Signed-off-by: Karen Feng <karen.feng@databricks.com> * pytest Signed-off-by: Karen Feng <karen.feng@databricks.com> * Make docstring more accurate Signed-off-by: Karen Feng <karen.feng@databricks.com> * idempotent registration Signed-off-by: Karen Feng <karen.feng@databricks.com> * Test fixup Signed-off-by: Karen Feng <karen.feng@databricks.com> * Move import out Signed-off-by: Karen Feng <karen.feng@databricks.com> * Add documentation for the GFF3 reader (projectglow#221) * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * doc Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * notebook Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * attributes Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> * yapf Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com> Signed-off-by: Karen Feng <karen.feng@databricks.com> * Remove cross-build plugin Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap Signed-off-by: Karen Feng <karen.feng@databricks.com> * triple quote Signed-off-by: Karen Feng <karen.feng@databricks.com> * set 'for imports' and 'for builds' within sbt settings Signed-off-by: Karen Feng <karen.feng@databricks.com> * CircleCI config Signed-off-by: Karen Feng <karen.feng@databricks.com> * wrap release Signed-off-by: Karen Feng <karen.feng@databricks.com> * More docs Signed-off-by: Karen Feng <karen.feng@databricks.com> * Comments Signed-off-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com> * comments Signed-off-by: Henry D <henrydavidge@gmail.com> * more docs Signed-off-by: Henry D <henrydavidge@gmail.com> * note pr Signed-off-by: Henry D <henrydavidge@gmail.com> Co-authored-by: Karen Feng <karen.feng@databricks.com> Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com> Signed-off-by: Henry Davidge <hhd@databricks.com>
What changes are proposed in this pull request?
@karenfeng noticed that the
genotype_statesfunction did not work after splitting multiallelics. It turns out that this is because expressions likeGenotypeStatesthat extendExpectsGenotypeFieldscheck during physical planning and execution that the struct field names match expectations. However, some expressions likeArraysZiplose the struct names after resolution.Now, during resolution we store the list of field indices we need. Even if the names change, the indices stay the same (this property is also relied upon by Spark's built-in expressions like
GetStructField).Moving forward, we should prefer implementing new functions as rewrites whenever possible.
How is this patch tested?
(Details)