Skip to content

Comments

Save genotype field indices during resolution#224

Merged
henrydavidge merged 19 commits intoprojectglow:masterfrom
henrydavidge:fix-expr
Jun 2, 2020
Merged

Save genotype field indices during resolution#224
henrydavidge merged 19 commits intoprojectglow:masterfrom
henrydavidge:fix-expr

Conversation

@henrydavidge
Copy link
Contributor

What changes are proposed in this pull request?

@karenfeng noticed that the genotype_states function did not work after splitting multiallelics. It turns out that this is because expressions like GenotypeStates that extend ExpectsGenotypeFields check during physical planning and execution that the struct field names match expectations. However, some expressions like ArraysZip lose the struct names after resolution.

Now, during resolution we store the list of field indices we need. Even if the names change, the indices stay the same (this property is also relied upon by Spark's built-in expressions like GetStructField).

Moving forward, we should prefer implementing new functions as rewrites whenever possible.

How is this patch tested?

  • Unit tests
  • Integration tests
  • Manual tests

(Details)

@henrydavidge henrydavidge requested a review from karenfeng June 1, 2020 23:28
@codecov
Copy link

codecov bot commented Jun 1, 2020

Codecov Report

Merging #224 into master will increase coverage by 0.03%.
The diff coverage is 85.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #224      +/-   ##
==========================================
+ Coverage   93.59%   93.63%   +0.03%     
==========================================
  Files          88       88              
  Lines        4232     4258      +26     
  Branches      359      397      +38     
==========================================
+ Hits         3961     3987      +26     
  Misses        271      271              
Impacted Files Coverage Δ
...w/sql/expressions/PerSampleSummaryStatistics.scala 83.33% <75.00%> (+0.25%) ⬆️
...o/projectglow/sql/expressions/VariantQcExprs.scala 86.95% <77.77%> (-0.74%) ⬇️
...projectglow/sql/expressions/VariantUtilExprs.scala 97.10% <77.77%> (-0.66%) ⬇️
...cala/io/projectglow/sql/SqlExtensionProvider.scala 100.00% <100.00%> (ø)
...tglow/sql/expressions/SampleCallSummaryStats.scala 97.01% <100.00%> (+0.02%) ⬆️
.../projectglow/sql/optimizer/hlsOptimizerRules.scala 91.30% <100.00%> (+1.83%) ⬆️
...o/projectglow/sql/util/ExpectsGenotypeFields.scala 84.00% <100.00%> (+7.80%) ⬆️
core/src/main/scala/io/projectglow/functions.scala 35.48% <0.00%> (+3.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7548aa8...0d710dc. Read the comment docs.

Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, mostly just requests for documentation.

* nested column pruning. Prefer writing new functions as rewrites when possible.
*/
trait ExpectsGenotypeFields extends Expression {
@deprecated trait ExpectsGenotypeFields extends Expression {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For documentation, can we set the deprecated annotation's message and since fields?

.withColumn(genotypesFieldName, arrays_zip(gSchema.get.fieldNames.map(col(_)): _*))
.withColumn(
genotypesFieldName,
arrays_zip(gSchema.get.fieldNames.map(name => col(name).as(name)): _*))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, we don't actually have to do this in the future, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, this is a testing relic. I wanted to see if it would resolve the issue within the main change of the PR. It does not.

henrydavidge and others added 17 commits June 2, 2020 09:05
Signed-off-by: Henry D <henrydavidge@gmail.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>
* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
* Support numpy literals (projectglow#213)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add documentation for the GFF3 reader (projectglow#221)

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Remove cross-build plugin

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* triple quote

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set 'for imports' and 'for builds' within sbt settings

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* CircleCI config

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap release

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Henry D <henrydavidge@gmail.com>
Copy link
Collaborator

@karenfeng karenfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Just a small typo nit.

* resolution.
* @param size The number of fields in the struct
* @param requiredFieldIndices The indices of required fields. 0 <= idx < size.
* @param optionalFieldIndices The indices of optional fields. -1 if not the field is not present.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: -1 if the field is not present

@henrydavidge
Copy link
Contributor Author

The DCO bot seems confused by some of the merges into this branch. I force set it to pass.

@henrydavidge henrydavidge merged commit ec7da93 into projectglow:master Jun 2, 2020
henrydavidge added a commit to henrydavidge/glow that referenced this pull request Jun 22, 2020
* Support numpy literals (projectglow#213)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add documentation for the GFF3 reader (projectglow#221)

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Remove cross-build plugin

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* triple quote

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set 'for imports' and 'for builds' within sbt settings

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* CircleCI config

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap release

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* new rule

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Support numpy literals (projectglow#213)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add documentation for the GFF3 reader (projectglow#221)

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* More docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* add tests

Signed-off-by: Henry D <henrydavidge@gmail.com>

* style

Signed-off-by: Henry D <henrydavidge@gmail.com>

* Fix IntelliJ import (projectglow#223)

* Support numpy literals (projectglow#213)

* WIP

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Ndarray

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More tests

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Support flat array

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move registering

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* pytest

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Make docstring more accurate

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* idempotent registration

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Test fixup

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Move import out

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Add documentation for the GFF3 reader (projectglow#221)

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* doc

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* notebook

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* attributes

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>

* yapf

Signed-off-by: kianfar77 <kiavash.kianfar@databricks.com>
Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Remove cross-build plugin

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* triple quote

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* set 'for imports' and 'for builds' within sbt settings

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* CircleCI config

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* wrap release

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* More docs

Signed-off-by: Karen Feng <karen.feng@databricks.com>

* Comments

Signed-off-by: Karen Feng <karen.feng@databricks.com>

Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>

* comments

Signed-off-by: Henry D <henrydavidge@gmail.com>

* more docs

Signed-off-by: Henry D <henrydavidge@gmail.com>

* note pr

Signed-off-by: Henry D <henrydavidge@gmail.com>

Co-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Kiavash Kianfar <kiavash.kianfar@databricks.com>
Signed-off-by: Henry Davidge <hhd@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants